This repository aims to provide instructions on how to run the CDC's BRFSS data for 2022 and 2023 in a PostgreSQL database while accessing it from RStudio to perform analysis.
Before getting into that, let me jump right into what is, perhaps, of greatest utility: the codebooks for 2023 and 2022 re-formatted in markdown. Markdown's navigable outline feature makes these documents much more manageable.
The schema presented here divide the BRFSS into multiples tables grouped by section name (as indicated in the LLCP codebook). The resulting SQL db for each year is around half that of the XPT file. This size reduction, along with the ability to only pull the table or column you need into R, means a faster, more responsive analysis environment.
The CDC hosts the data file in either .xpt
or .asc
format. They can be downloaded, zipped, from this location. Scroll down to the section entitled Data Files and choose the format you prefer. I worked with the .xpt
file as the .asc
requires a separate file (known as a dictionary file) to correctly assign column names and etc.
I found two methods:
1. Using a simple .r
script in RStudio
The .r
file I used to convert the XPT file to CSV is included. This file was slightly larger, but didn't contain extra rows which made dealing with it slightly easier.
2. SAS Universal Viewer
SAS produces the free SAS Universal Viewer that will open .xpt
files and export .csv
. You'll need an account to download the tool. The csv file it produced was nearly 100MB smaller than the version produced by RStudio; however, it contained 78 extra rows added to the bottom of the file.1
As of the publication of this readme, both years represented here have columns present in the dataset but missing from the codebook (all are listed at the bottom of this readme). Dealing with these columns means never extracting from the master csv in the first place. 2023 actually has a variable in the codebook not in the dataset (rcsborg1
) and this was simply ignored when making individual csv files.
My Postgres schema is based on the codebook, which lists the SAS Variable names along with the Section Name. Each Section is its own table, and each column is an SAS Variable in that section.
Taking the 2023 dataset as an example:
-
Using the file
brfss-var-sec-type-2023.csv
file provided in the supplemental folder, cut the master csv file into sections. In Terminal,cut -d ',' -f[column number(s)] input-file.csv > output-file.csv
does the job. So,cut -d ',' -f216,217,218,219,220,221,222,223,224,225,226,227,228 inpute-file.csv > ace.csv
returns what will become theace
table as a csv. -
Once all individual csv files have been created and placed in the
/data
folder, run thecreate_table_and_co.py
script, making sure the paths to your files and output are correct. This will produce the commands you'll use to create the tables in SQL and copy the data in using psql, or you can use the.sql
files I generated.
The conversion from XPT to CSV converted the blank values to NA
. Most of the columns in the SQL db are of datatype INTEGER, which means NA
isn't compatible. I chose to convert each NA
to NULL
when loading the data into Postgres and my psql commands reflect this.
Install Docker Desktop and get the latest PostgreSQL image. The docker-compose.yml
files found in brfss-2022
and brfss-2023
directories are the instructions for starting up your Postgres db. The compose file is structured to give a persistent db even if the container is stopped and restarted.
Caution
If you want to keep your Postgres data, spin the container down with docker-compose down
. Do not use docker-compose down -v
unless you want to scrub completely and re-initialize. The -v
flag scrubs the persistent volumes. You can see these volumes with docker volume ls
. Nothing listed means no persistent docker volumes are present.
Since you started the docker container without an init.db
the Postgres db exists but has no table or variable data in it.
I solved this by adding the tables in dBeaver then using pqsl in terminal to copy the data over. Here are the psql commands:
# first
docker ps
# second
docker exec -it your_container_name psql -U your_user_name -d your_db_name
# third
\copy table_name ("column-name","column-name") from '/docker-entrypoint-initdb.d/file_name.csv' WITH (FORMAT CSV, HEADER, NULL 'NA');
Check to see if any of your tables are empty
# A 0 indicates no rows have been inserted; thus, empty table
SELECT schemaname, relname, n_tup_ins
FROM pg_stat_all_tables
WHERE schemaname = 'public'
ORDER BY n_tup_ins;
Backup and restore the Postgres db.
# backup sql db
docker exec -t your-container-name pg_dump -U username your-db-name > db_name_backup.sql
# restore db
docker exec -i your-container-name pg_dump -U username your-db-name > db_name_backup.sql
Establish the db connection
# install needed packages
install.packages("DBI")
install.packages("RPostgres")
# call packages on script execution
library(DBI)
library(RPostgres)
# Replace with your actual database credentials
con <- dbConnect(RPostgres::Postgres(),
dbname = "your_database_name",
host = "your_host", # e.g., "localhost" or an IP address
port = 5432, # default PostgreSQL port
user = "your_username",
password = "your_password")
Call tables or variables as needed
# Replace 'your_table_name' with the actual table name
data <- dbReadTable(con, "your_table_name")
# Replace 'your_table_name' with the actual table name and specify the columns you want
query <- "SELECT column1, column2 FROM your_table_name"
data <- dbGetQuery(con, query)
Once finished, disconnect
dbDisconnect(con)
The following supplemental material is provided for each year.
Codebook: The codebook gives the variable name, location, and frequency of values for all reporting areas combined for the landline and cell phone data set. The CDC distributes the codebook as an .html
file which can be found on their site. Through python scripts, I've converted the codebook into a .md
file with section and question headers, which greatly reduce seek time when looking up a variable, its SAS code and answer codes.
Discrepancies between the CDC's codebook and the dataset existed for both years.
I omitted the columns from my final SQL database that didn't appear in the codebook.
SAS Variable | 1-Based Column No. | In Codebook | In Dataset | Remediation |
---|---|---|---|---|
diabage4 |
57 | No | Yes | Not included in db |
numphon4 |
62 | No | Yes | Not included in db |
cpdemo1c |
63 | No | Yes | Not included in db |
usemrjn4 |
210 | No | Yes | Not included in db |
SAS Variable | 1-Based Column No. | In Codebook | In Dataset | Remediation |
---|---|---|---|---|
rcsborg1 |
NA | Yes | No | Omitted |
lndsxbrt |
19 | No | Yes | Not included in db |
celsxbrt |
25 | No | Yes | Not included in db |
birthsex |
205 | No | Yes | Not included in db |
trnsgndr |
208 | No | Yes | Not included in db |
usemrjn4 |
215 | No | Yes | Not included in db |
rcsgend1 |
252 | No | Yes | Not included in db |
rcsxbrth |
253 | No | Yes | Not included in db |
Footnotes
-
Taking the LLCP2023.XPT file as an example: the original XPT file, when read into RStudio, contains 433,323 rows. The csv file produced by SAS Universal Viewer contains 78 more rows (rows 433,325 - 443,401) most likely due to metadata being written in as extra rows.
A quick check interminal
of the csv created by the SAS viewer usingsed -n '433325p' file-name.csv
shows us empty data for that row. Usesed -n '433324p' file-name.csv
to see the difference. Line 433,324 is the last line of BRFSS data (the csv starts count on the header row, while R doesn't when reading in an XPT, hence the 1 digit difference in the last line). These need to be trimmed off before going further. For this reason I stuck with the R script. ↩