| path | size | modified |
|---|---|---|
| Printed on 17 May 2024 | ||
| data_raw/postSA_YSJ_2023_3+February+2024_21.40.sav | 7.29M | 2024-02-03 21:41:12 |
| data_raw/preSA_YSJ_2023_12+April+2024_12.08.sav | 23.44M | 2024-04-12 12:08:56 |
| data_raw/Study Abroad Expectations_September 11, 2023_17.06.sav | 43.24M | 2023-09-12 00:09:39 |
| data_raw/Study+Abroad+Expectations+–+External_September+11,+2023_17.09.sav | 23.82M | 2023-09-12 00:09:19 |
| study_design/MAXOUT-SA_Codeplan.xlsx | 55.82K | 2024-04-18 16:43:18 |
| study_design/MAXOUT-SA_Interviewees.xlsx | 12.83K | 2024-04-25 13:21:27 |
| study_design/postSA_survey_labels.csv | 11.85K | 2024-05-16 06:23:26 |
| study_design/preSA_survey_labels.csv | 20.87K | 2024-05-16 06:23:25 |
Data files
The raw data files are downloaded from Qualtrics™ into a folder called data_raw with their default Qualtrics™ names, which includes the survey name plus the date and time of the download. The data is exported from Qualtrics™ as SPSS .sav data files with the extra long labels option.
The Qualtrics™ questionnaire was based on a design codeplan saved in an Excel .xlsx file, which is stored in a folder named study_design. The same folder also contains a spreadsheet with details about the survey participants who also participated in the follow-up qualitative interview phase of the data collection.
The data_raw and study_design folders contain the following files:
The code below sets up functional links to these files in R:
#### File paths ----------------------------------------------------------------------------------
(datafiles <- list.files("data_raw", pattern = "\\.sav")) # List `.sav` files
(designfiles <- list.files("study_design", pattern = "\\.xlsx")) # List `.xlsx` files
## 2020 pre-SA
preSA20_ysj_path <- file.path("data_raw", grep("Study Abroad Expectations", datafiles, value = TRUE)) # 2020 YSJ student data
preSA20_ext_path <- file.path("data_raw", grep("External", datafiles, value = TRUE)) # 2020 Non-YSJ student data
## 2023 pre-SA
preSA23_ysj_path <- file.path("data_raw", grep("preSA_YSJ_2023", datafiles, value = TRUE)) # 2023 YSJ pre-SA data
## 2023 post-SA
postSA23_ysj_path <- file.path("data_raw", grep("postSA_YSJ_2023", datafiles, value = TRUE)) # 2023 YSJ post-SA data
## Design
codeplan_path <- file.path("study_design", grep("Codeplan", designfiles, value = TRUE)) # Excel survey codebook
interviewees_path <- file.path("study_design", grep("Interviewees", designfiles, value = TRUE)) # Excel list of intervieweesPre-SA datasets
Questionnaire/variable differences
The 2020 pilot data collection consists of a YSJ and an external dataset. The difference between the two questionnaires was a single item that asked YSJ respondents whether they would also be interested in participating in a qualitative interview study. Qualitative data was not collected from external respondents:
In YSJ but not in External data:
[1] "ysj_interview"
[1] "Accept to participate in an interview"
The position of the variable in the dataset is:
[1] 303
The 2023 pre-SA survey had several differences compared to the 2020 questionnaire:
| 2020 | 2023 |
|---|---|
In which academic year do you expect to go on a Study Abroad year? [
|
In which year do you expect to go on a Study Abroad year/semester? [
|
Who do you expect to socialize with most while on Study Abroad? [
|
Who do you expect to socialize with most while on Study Abroad? [
|
| Block of 16 questions on “imagined self” was not asked | |
The email question asked for “university email address” specifically |
The sayr and expect_socialise variables from 2023 were given the _23 suffix to their variable names (in the Codeplan document).
Data import
The code below imports into R the pre-SA data (.sav files), the variable information (names, labels) from the codeplan document (survey_design/SA_codeplan.xlsx), and information about which respondents also participated in qualitative follow-up interviews (data_qualitative/MAXOUT-SA-Interviewees.xlsx):
#### Import from raw ----------------------------------------------------------------------------------------
## Pre-SA Codeplan
codeplan_pre <- read_excel(codeplan_path, sheet = "preSAvars")
## Interviewees
interviewees <- read_excel(interviewees_path) |> data_select(c("Random_ID", "interviewed_preSA", "interviewed_postSA"))
## 2020 pre-SA YSJ
preSA20_ysj <- read_spss(preSA20_ysj_path) # Import from spss
names(preSA20_ysj) <- codeplan_pre$varname_pre20 # Assign variable names
sjlabelled::set_label(preSA20_ysj) <- codeplan_pre$varlabel_pre20 # Assign variable labels
## 2020 pre-SA External
preSA20_ext <- read_spss(preSA20_ext_path)
names(preSA20_ext) <- codeplan_pre$varname_pre20[-303] # Assign variable names removing YSJ-specific var
sjlabelled::set_label(preSA20_ext) <- codeplan_pre$varlabel_pre20[-303] # Assign variable labels removing YSJ-specific var
## 2023 pre-SA YSJ
preSA23_ysj <- read_spss(preSA23_ysj_path)
names(preSA23_ysj) <- na.omit(codeplan_pre$varname_pre23) # Assign variable names
sjlabelled::set_label(preSA23_ysj) <- na.omit(codeplan_pre$varlabel_pre23) # Assign variable labelsData management variables
Before merging the datasets, we create an additional cohort column which records the academic year of the pre-SA data collection. The survey for the YSJ study had been kept open for several months, spanning the second semester of the 2019/2020 academic year and the first semester of the 2020/2021 AY, and therefore the preSA20_ysj dataset contains responses from two student cohorts (2019/2020 and 2020/2021). Data for the preSA20_ext dataset should only contain responses from the 2019/2020 student cohort due to the outbreak of the Covid-19 pandemic, which interfered with the data collection since international travel - and Study Abroad years - were put on hold. However, there is one response that was submitted in March 2021. This response will be removed from the dataset:
#### Create `cohort` column ------------------------------------------------------------------------------------
preSA20_ysj <- preSA20_ysj |>
mutate(cohort = case_when(StartDate < as.POSIXct("2020-09-01") ~ "19/20",
StartDate >= as.POSIXct("2020-09-01") ~ "20/21"))
preSA20_ext <- preSA20_ext |>
mutate(cohort = "19/20") |>
dplyr::filter(StartDate < as.POSIXct("2020-09-01")) # remove response dating "2021-03-20 16:11:51"
preSA23_ysj <- preSA23_ysj |>
mutate(cohort = "23/24")Merging the pre-SA datasets
Merging the three datasets should therefore have ncol(preSA20_ysj) + 3 = 312 variables.
The code below merges the datasets and checks its dimensions:
#### Merge all pre-SA datasets from 2020 and 2023 ---------------------------------------------------------------
preSA <- sjmisc::add_rows(preSA20_ysj, # `this`sjmisc::add_rows` keeps `label` attribute but not other non-relevant attributes
preSA20_ext, ## `dplyr::bind_rows` removes the variable label attributes
preSA23_ysj) ## `datawizard::data_merge` keeps all SPSS-specific attributes (`display_width`, `format.spss`)
### Check dimensions of merged dataframe
dim(preSA)[1] 243 312
Number of columns as expected: TRUE
Replacing piped text
The Qualtrics™ questionnaire included piped text for Japanese and Korean language students, and these appear with non-human-readable characters in the variable and value labels, so we replace these characters with the phrase “JP/KO”. For example, see the value label of the A1_comjpko variable before and after the replacement:
| Speaks with JP/KO friends in JP/KO (A1_comjpko) | |||||
|---|---|---|---|---|---|
| Value | Label | N | Raw % | Valid % | Cum. % |
| 1 | ${lm://Field/1} | 38 | 15.64 | 100 | 100 |
| Speaks with JP/KO friends in JP/KO (A1_comjpko) | |||||
|---|---|---|---|---|---|
| Value | Label | N | Raw % | Valid % | Cum. % |
| 1 | JP/KO | 38 | 15.64 | 100 | 100 |
The code below makes the replacements across all the value labels in the dataset:
#### Replace shortcodes for "Japanese" and "Korean" -----------------------------------------------------
## Get all value labels as list
labs <- sjlabelled::get_labels(preSA)
## Change all the values labels in all the variables in list
labs <- lapply(labs, function(x) str_replace_all(x,
'\\$[^\\}]*\\}',
"JP/KO"))
## Apply changed labels to dataset; keep labels as attribute (don't do `as_label(as.numeric)` beforehand)
preSA <- sjlabelled::set_labels(preSA, labels = labs, force.labels = TRUE) Converting categorical variables
We convert the values of all labelled factor (categorical) variables to their labels, so that later we can manipulate the values directly as text.
#### Convert labelled factor variables ------------------------------
## This keeps the unused labels as well
preSA <- preSA |>
mutate(across(where(is.factor), sjlabelled::as_numeric),
across(everything(), sjlabelled::as_label))
## This keeps only the labels of categories that had valid responses
# preSA_alt <- preSA |>
# mutate(across(where(is.factor), labels_to_levels))Combining Japanese and Korean versions of variables
The survey questions were broken down by language studied (Japanese/Korean), and we have duplicate variables coding the same question (prefixed with “A1_” for Japanese and “A2_” for Korean). With the code below we combine these variables:
#### Unify variables split by language ---------------------------------------------------------
korean <- preSA |>
dplyr::filter(language == "Korean") |>
select(!starts_with("A1")) |>
rename_with(stringr::str_replace,
pattern = "A2_", replacement = "",
matches("A2_"))
japanese <- preSA |>
dplyr::filter(language == "Japanese") |>
select(!starts_with("A2")) |>
rename_with(stringr::str_replace,
pattern = "A1_", replacement = "",
matches("A1_"))
missing <- preSA |>
dplyr::filter(is.na(language)) |> # 13 missing answers to language
datawizard::remove_empty_columns() # remove all empty columns
preSA <- sjmisc::add_rows(japanese, korean, missing) Removing incomplete responses
There were 13 responses with missing data on language. Since the language studied was a core compulsory-answer item, these 13 cases were also unfinished responses. Of the 243 responses in the pre-SA dataset 183 have been finished and submitted. We keep only finished cases:
#### Keep only completed and submitted responses ---------------------------
preSA <- preSA |>
dplyr::filter(Finished == "True")Removing duplicates
E-mail addresses were requested primarily for the purposes of contacting students who opted in for participation in a follow-up qualitative interview and/or future (post-SA) rounds of data collection, as well as for contacting the winner of the randomly selected participation prize. Respondent e-mail addresses and IP Addresses are also helpful for identifying any data reliability issues, such as duplicate responses (n.b. the IPAddress collected by Qualtrics™ is “external”, so those connecting to the same network will share an IP; for this reason, selecting on IP Address is not useful here).
We find four email addresses with duplicate responses:
| No. of duplicate emails | ID_first | ID_second |
|---|---|---|
| 2 | 9419 | 8436 |
| 2 | 9514 | 8175 |
| 2 | 95339 | 70145 |
| 2 | 84953 | 44422 |
We will keep the earlier responses. The reason for this choice is that the information from the later responses could be contaminated by having previously completed the survey already (practice effects). Incidentally, the earlier responses also have fewer missing answers (albeit marginally, a difference of one in two cases):
#### Will delete the later responses (incidentally, these also have fewer NAs) -----------------
`%not_in%` = Negate(`%in%`)
preSA <- preSA |>
data_filter(Random_ID %not_in% c("8436", "8175", "70145", "44422")) # keeps original "rownames"; `rownames(preSA) <- NULL` to renumberThis leaves us with 179 responses/cases/rows.
We can also check whether any Random_ID numbers have been allocated multiple times (unfortunately, Qualtrics™ doesn’t have a system to fine-tune the randomisation of numbers…). We find that the Random_ID number 3591 has been allocated twice: