5 Data
The focus of your technical work outside clusteR will be the setup and implementation of your data collection tool. clusteR can’t help with that (see What clusteR can do), but it can connect to your data automatically.
This section focuses on obtaining, cleaning, and exporting data. For establishing a data connection, see setup_get. For analyzing data, see Analysis.
5.1 get_data
get_data downloads and archives survey responses from your data source.
clusteR natively supports Alchemer and CSV data
sources. get_data itself archives data and both invisibly returns the file
path to that archive and saves it to the config as .cluster$cfg$last_data.
get_csv, a native function, obtains data from a CSV according to
setup and returns it. If you need to import data to R for manual
work and you are using a CSV file, run df <- get_csv().
get_alc, a native function, obtains data from Alchemer according to
setup, cleans up and standardizes the JSON, and returns it. If you
need to import data to R for manual work and you are using Alchemer, run
df <- get_alc.
clusteR is highly extensible, including providing a framework for
custom data sources. To support a custom data source, get_data needs to
find a setup_get list in config with any necessary (named) variables to
run your get_x function. setup_get should be created in setup
using a setup_get_x function. get_x, like get_csv and get_alc, should
return a tidy dataframe with formatted (though not necessarily cleaned) data,
which get_data will handle.
5.2 Cleaning
clusteR provides a simple, friendly way to clean survey data: clean_data.
By default, clean_data will use the most recent data file archived by
get_data, but you can specify a different archive file using
.x.
You can programmatically weight data using .wt. clusteR natively supports
inverse probability weighting according to a generalized linear model with
clean_weight. .wt is very robust, so if a simpler solution
is desired (e.g. calculating weights using simple mutates), create a weight
variable using regular cleaning mutates instead.
Arguments passed to clean_data are passed on to dplyr::mutate
to modify existing columns and create new ones. Columns are set by your get_x
function: get_csv will use the first row of your input CSV as names, while
get_alc will use a codebook, for instance.
clean_data archives its output according to the name of the input file, which
by default is the time the data was downloaded. Running clean_data on the
same raw data archive twice will overwrite the previously cleaned version of
the same archive.
5.3 Weighting
Weighting is usually operationalized as part of cleaning (.wt) and is
performed before cleaning mutates.
5.3.1 Simple weighting schemes
If you are calculating simple weights based on columns in survey responses or
the cohort file, you should create your weight variable within the dynamic dots
(clean_data(...)). If you are depending on columns that you have not yet
created, create the weight column later in your clean_data call.
For example, if using
CDC’s CASPER,
you will generally be able to calculate weights based on cluster-level
information you can keep in the cohort file or refer to in your clean_data
call.
5.3.2 clean_weight
The author’s weighting approach is based on each individual participant’s estimated, adjusted probability of completing the survey if selected to participate. This approach relies on data in the cohort file, data in survey responses, data obtained from external sources (such as the U.S. Census), and calculated fields. It also requires a generalized linear model based on these variables.
clean_weight returns a customized function that takes the survey responses
and cohort file, performs temporary mutates, runs a logistic regression to
estimate adjusted probabilities of participation for all cohort members, and
outputs a probability (prob) and inverse probability weight (ipw) for
each respondent. clean_data calls this function and joins the
weight variables before performing its own (persistent) mutates and archiving
cleaned data.
formula is passed to
stats::glm.
glm is set to use a logit link and the binomial family. All other glm
options are set to the default.
.import allows you to provide a file path to a cluster-level data file that
is joined to the survey data for weighting according to the cluster identifier.
The file must contain a column titled Cluster with the cluster identifiers as
found in the cohort file. The author, for example, uses .import to include
Census demographic data for each cluster.
.filt allows you to filter data before weighting to limit which cohort members
are included in weighting. Filters are applied before weighting mutates or
modeling.
Arguments passed to ... are passed on to dplyr::mutate.
These are weighting mutates, so they are accessible to stats::glm for modeling
but are not included in clean_data output.
5.3.3 Custom weighting schemes
clusteR is highly extensible, including providing a framework to
create complex, custom weighting schemes and apply those weights to cleaned
data. If you develop a custom weighting function, you must source it (or
otherwise attach it) before passing it to .wt in clean_data.
Your custom weighting function must return a function with exactly one argument
(x), the dataframe of raw data, that itself returns a dataframe with at least
two columns: ID, a column of participant IDs, and at least one numeric column
of weights to be included with clean_data output. clean_data will join
weights according to ID.
5.4 export_data
You may wish to export your (raw or cleaned) survey responses to a file. You
can use export_data to create a CSV from archived survey data.
Arguments passed to ... are passed on to dplyr::select. Unlike
dplyr::select, leaving ... blank will include all columns, not no columns.
Otherwise, however, default dplyr::select behavior applies; if, for instance,
you wish to rename Q1 to Q01, running:
export_data("Q01" = Q1)
will export only Q01. Instead, you may need to run:
export_data(everything(), "Q01" = Q1)
.name allows you to specify the export file name. The file will be saved in
the Survey Data folder.
.dat allows you to specify a data archive to export. By default, export_data
uses the most recently cleaned data (preferred) or the most recently downloaded
data (backup).
.deid deidentifies the data export, removing Address and all columns in the
cohort file other than ID. If you need finer control of column inclusion,
pass these requirements to ... rather than enabling .deid.