The goal of gsedread is to read validation data of the project Global Scales for Early Development (GSED).
Installation
Install the gsedread
package from GitHub as follows:
install.packages("remotes")
remotes::install_github("d-score/gsedread")
There is no CRAN version.
Example
You need access to the WHO SharePoint site and sync the data to a local OneDrive. In the file .Renviron
in your home directory add a line specifying the location of your synced OneDrive, e.g.,
='/Users/username/Library/CloudStorage/OneDrive-Sharedlibraries-WorldHealthOrganization/CAVALLERA, Vanessa - GSED Validation 2021_phase I' ONEDRIVE_GSED
After setting the environmental variable ONEDRIVE_GSED
, restart R, and manually check whether you are able to read the OneDrive directory.
dir(Sys.getenv("ONEDRIVE_GSED"))
#> [1] "-DESKTOP-GU6P9PF.RData"
#> [2] "-DESKTOP-GU6P9PF.Rhistory"
#> [3] "Bangladesh Validation"
#> [4] "Baseline Analysis - OLD - NOV 2021"
#> [5] "Data Cleaning Script MK1 - Run before merge.R"
#> [6] "Data Merge Script MK1.R"
#> [7] "Final Phase 1 Data - May 10th 2022"
#> [8] "GSED Final Collated Phase 1 Data Files 18_05_22"
#> [9] "GSED PHASE 1 DATA COLLECTED LOG"
#> [10] "GSED_data_quality_1_output_LF_TEST.csv"
#> [11] "GSED_data_quality_1_output.csv"
#> [12] "GSED_phase1_merged_11_11_21.csv"
#> [13] "GSED_phase1_merged_20_07_22.csv"
#> [14] "interim DAZ values combined.csv"
#> [15] "Interim validation data_phase I_May2021"
#> [16] "Master_data_dictionary_MAIN_v0.9.1_2021.04.22.xlsx"
#> [17] "merged_lf.dta"
#> [18] "Norming work"
#> [19] "Pakistan Validation"
#> [20] "Pemba Validation"
#> [21] "Phase 1 Data for Sunil"
#> [22] "PREDICTIVE VALIDITY GSED 2.0"
#> [23] "QUALITATIVE"
#> [24] "QUALITATIVE DATA PHASE 1 MAY 2022"
#> [25] "Stop rule change exploration"
The following commands reads all SF data from GSED Final Collated Phase 1 Data Files 18_05_22
directory and returns a tibble with one record per administration.
Count the number of records per file:
table(data$file)
#>
#> ban_sf_2021_11_03 ban_sf_new_enrollment_17_05_2022
#> 1421 72
#> ban_sf_predictive_17_05_2022 pak_sf_2022_05_17
#> 473 1761
#> pak_sf_new_enrollment_2022_05_17 pak_sf_predictive_2022_05_17
#> 72 459
#> tza_sf_2021_11_01 tza_sf_new_enrollment_10_05_2022
#> 1427 74
#> tza_sf_predictive_10_05_2022
#> 469
Process variable names user-friendly alternative:
rename_vector(colnames(data)[c(1:3, 19, 21:25)], lexout = "gsed2", trim = "Ma_SF_")
#> [1] "file" "gsed_id" "parent_id" "date" "gpalac001" "gpacgc002"
#> [7] "gpafmc003" "gpasec004" "gpamoc005"
Operations
The package reads and processes GSED data. It does not store data. The read_sf()
and read_lf()
functions takes the following actions:
- Constructs the paths to the files OneDrive sync file;
- Reads all specified datasets in a list;
- Internally specifies the desired format for each column;
- Specifies the available date and data-time formats per file;
- Recodes empty,
NA
,-8888
,-8,888.00
and-9999
values asNA
; - Repairs problems with mixed data-time formats in the adaptive Pakistan data;
- Stacks the datasets to one tibble and adds columns
file
andadm
; - Removes records without a
GSED_ID
.
Item renaming with rename_variables()
relies on the item translation table at https://github.com/D-score/gsedread/blob/main/inst/extdata/itemnames_translate.tsv.
Acknowledgement
This study was supported by the Bill & Melinda Gates Foundation. The contents are the sole responsibility of the authors and may not necessarily represent the official views of the Bill & Melinda Gates Foundation or other agencies that may have supported the primary data studies used in the present study.