3 Setup
Setup is the trickiest and least forgiving step in using clusteR. The most
critical information is available in R with ?setup. This page goes into
additional detail.
3.1 Location
setup generates an environment for clusteR. Before beginning setup, you need
to set your working directory to a dedicated folder for your project. You
should not reuse the same folder as past clusteR projects. You can, however,
include additional files and folders in this dedicated folder.
The easiest way to do this is to create an R project using RStudio. At the top right of RStudio, click the project dropdown and select New Project…. If you have already created a folder, select Existing Directory and specify it; otherwise, select New Directory and place it somewhere safe. clusteR does not encrypt or obscure cohort files or survey data, and does not implement any HIPAA, FERPA, or other requirements, so you will need to set up clusteR in an appropriately secure location that your R installation can access.
clusteR creates (but does not overwrite, if they already exist) Cohort, Contacts, Scripts, and Survey Data folders in the main working directory. clusteR will only write or modify files inside these folders, except a handful of HTML outputs written to the main working directory.
The most important file for your clusteR setup is Cohort/config.rds. Unless
you wish to reset your clusteR configuration and start over, you should not
delete or tamper with this file.
3.2 name, short_name
These are your project identifiers. Both should be strings. name will be used
primarily for formatted titles on reports, while short_name will be used
primarily for file name prefixes.
Importantly, your cohort data will be saved as a plain text file in the
Cohort folder named after short_name. It is rarely, if ever, a good idea
to tamper directly with the cohort file.
These are the only variables required for setup to run and exit successfully.
However, because so many other configured options are required for clusteR to
actually work, setup will provide warnings for other options you do not set.
3.3 state, county
These variables allow you to specify FIPS/INCITS codes, which clusteR uses to download U.S. Census TIGER/Line shapefiles for your region of interest. State codes and county codes are available online. These codes are required for all mapping functions.
You may specify multiple county codes, but you may only specify one state code, and all county codes must be within the same state. This feature is intended to accommodate North Carolina district health departments (see information from the UNC School of Government) but may be useful in other states.
As of version 0.1, clusteR is hard-coded to retrieve 2020 TIGER/Line shapefiles. Support for other years is planned in later versions.
3.4 geoids
If you have already selected U.S. Census blocks to serve as your clusters, you
will need a file to link cluster identifiers to U.S. Census GEOIDs. If you have
not selected them yet, you can do so using make_clusters or a similar
function.
3.4.1 make_clusters
make_clusters will generate a file for you that matches cluster identifiers
to GEOIDs. If you have already run setup previously, make_clusters will
obtain your state and county FIPS/INCITS codes from your configuration, if
available. Otherwise, you will need to specify them, or provide the names of
your state and county/counties of interest. Providing names will display
a message from tidycensus with the
relevant FIPS/INCITS codes so you don’t need to look them up.
3.4.2 Custom clusters
clusteR is highly extensible, and you can create custom functions or scripts to manage cluster selection for you. Your function or script must save a delimited text file with two columns: “geoid” (specifying the 15-digit Census block ID) and “cluster” (matching the cluster identifiers in your cohort file).
Most commonly, you will simply have your clusters in advance of using clusteR. In this case, save a delimited text file (probably a CSV) with columns as specified above. Your cluster identifiers can be anything you choose, but they must match the “Cluster” column in your cohort file.
3.5 setup_cohort
For clusteR to perform management tasks, it needs to maintain a master file of cohort members, their information, and their participation status. clusteR maintains a raw text file in the Cohort subdirectory with this information and automatically updates some of the information as survey responses come in.
3.5.1 Required fields
clusteR requires certain fields in the cohort file. These are:
- ID: A unique identifier, which can be in any format.
- Cluster: An identifier of the cluster a participant is in.
- Name: The participant’s name, which defaults to “Current Resident”.
- Mailing: The participant’s mailing street address.
- Physical: When the participant’s mailing address is not their residence, the participant’s actual street address of residence.
- City: The city of the mailing address.
- State: The state of the mailing address.
- ZIP: The ZIP code of the mailing address.
- Phone: The participant’s phone number.
- Email: The participant’s email address.
- Consent: The participant’s consent status (“Yes”, “No”, “Unknown”, “Do not contact”, or NA), defaulting to NA.
- Status: The participant’s completion status with enrollment information, which is automatically determined.
Most of these fields can be formatted however you like, but Consent is picky. You must use only the following options:
- Yes: The participant has participated and consented to be surveyed.
- No: The participant has participated or refused and did not consent to be surveyed.
- Unknown: The participant has participated, but did not return a consent.
- Do not contact: The participant wishes to be removed from the cohort; this is optional and “No” can be used instead.
- NA: The system default, which should be used when a prospective participant has not yet been reached.
update_cohort will throw an error if anything other than these options is
used. For an Alchemer survey (but, importantly, not a CSV file), you can enter
a key in setup_get_alc to convert your responses for you.
3.5.2 Input
During initial setup, you will need to specify a cohort input file, which must be a:
- .csv text file
- .xls or .xlsx Excel file
- .rds R data file
No other format is supported. There are two ways to provide this input file.
Pre-generate and provide (easy)
The easy way is to generate the file in advance and provide its file path to
setup as a string. The file should be pre-filled with as much participant
information as you have available.
Generate during setup (harder)
The more complicated way is to write and run a function as part of setup that
will generate your cohort input file for you. This function can work however you
like, but must be available when you run setup and must return only the file
path, as a string, to the properly-formatted input file. If you choose to
generate this file during setup, it is recommended to write your setup call in
an R script that you source rather than running setup interactively in the
console.
3.5.3 Cohort file storage
The cohort data file will be stored in the automatically-generated Cohort
subdirectory and will be named short_name as provided in setup. It is not
recommended to edit this file by hand. However, should you need to edit it by
hand, it is a CSV file that can be edited in a text editor or Excel.
Because this file is plain-text, it should be stored in a secure location. clusteR does not have built-in protection for confidential information.
3.6 setup_get
A survey generates survey data, and in order to manage the survey, clusteR needs to connect to this data. clusteR natively supports two data sources, but other packages can add support for additional data sources, and end users can write functions to support a custom data source. The two default sources are:
-
.csv files, connected with
setup_get_csv. -
Alchemer surveys, connected with
setup_get_alc.
Additional native sources will be documented here in future releases.
3.6.1 Comma-delimited (.csv) data source
The simplest data source in existence is the humble .csv file: a text file with values separated by commas and observations separated by newlines. These files are easy to create, modify, and view.
clusteR accepts comma-delimited files as data sources via
setup_get_csv and get_csv). You will need:
- A permanent file path (given as a string) denoting the location of the input .csv file.
- Any arguments you wish to pass to
read_csvto read your file correctly. (col_typesis recommended!)
To set up a comma-delimited data source, you will need to run setup and
include this argument:
setup_get = setup_get_csv(file, ...)
Once setup is completed, clusteR saves this information. You will not need to do this again.
3.6.2 Alchemer data source
Alchemer is a survey tool used widely by local public health agencies in the United States. Alchemer surveys are handled natively by clusteR, not least because the author uses Alchemer at his local health department.
clusteR accepts Alchemer survey data using Alchemer’s SurveyResponse API call via
setup_get_alc and get_alc. You will need:
- An Alchemer account, with your survey already created.
- The Survey ID, which is a number that can be found in the URL when you are building a survey right after /id/.
- An API token passed as a string.
- An API token secret (password) passed as a string.
- A codebook (as a tab-delimited file) to convert Alchemer questions to dataframe columns.
To set up an Alchemer survey as your data source, you will need to run setup
and include this argument:
setup_get = setup_get_alc(survey_id, api_token, api_token_secret, codebook)
Once setup is completed, clusteR saves this information. You will not need to do this again.
3.6.3 Custom data sources
In general, because each data source and its format are unique, clusteR is very flexible with custom data sources. However, there are a few requirements (replace x with a shorthand for your data source):
- Your package or custom script must include a
setup_get_xfunction that can be called insetup. - Your
setup_get_xfunction must return a list of named objects, which will be included in config.rds and loaded by clusteR as.cluster$cfg$setup_get. - If your get function is found in a custom script (or if your get function
requires additional, non-standard setup), you must include the file path to the
script as a string named
run_beforein this list. - Your package or custom script must include a
get_xfunction that can be called byget_dataand retrieve all its arguments (such as URLs, file paths, and API keys) from.cluster$cfg$setup_get. - Your
get_xfunction must be named (as a string) at.cluster$cfg$setup_get$get. - Your
get_xfunction must return a plain dataframe, tibble, or similar with at least the required cohort fields properly named (seevignette('setup_cohort')).get_datawill handle the rest. Note that this dataframe does not have to be particularly clean, just meet these basic requirements.
While there is nothing stopping you from implementing an alternative data
structure to the system default, it is not recommended. get_data will save and
archive data in an appropriate format that is compatible across data source
types and with all native clusteR functions.
3.6.4 Why not Excel?
You may be wondering why the author hasn’t created functions to get data from a Microsoft Excel source file, even though clusteR supports Excel for other tasks (like importing a cohort file). The reasons are:
- You can easily save your Excel file as a CSV.
- If you can’t easily save your Excel file as a CSV, it is probably because you have formatted your Excel file in a way that R (and therefore clusteR) will not handle well.
To prevent very confusing and difficult to troubleshoot issues that would
inevitably result from using clusteR with messy Excel files, clusteR requires
you to either make a CSV, which will make formatting problems more obvious to
you and force you to fix them sooner, or learn to write and implement your own
setup_get_excel and get_excel functions, which is considerably more
time-consuming.
3.7 Additional variables
As mentioned previously, clusteR is highly extensible, and your
extensions may need persistent variables. While setup will not perform any
validation on these additional variables, it will save them to config.
If you have created (or required) custom variables, you can refer to them using
their names in the list .cluster$cfg, which is loaded whenever clusteR is
attached. For example, you can access the project name using
.cluster$cfg$name.