Obtain the BRFSS File

This repository aims to provide instructions on how to run the CDC's BRFSS data for 2022 and 2023 in a PostgreSQL database while accessing it from RStudio to perform analysis.

Before getting into that, let me jump right into what is, perhaps, of greatest utility: the codebooks for 2023 and 2022 re-formatted in markdown. Markdown's navigable outline feature makes these documents much more manageable.

The schema presented here divide the BRFSS into multiples tables grouped by section name (as indicated in the LLCP codebook). The resulting SQL db for each year is around half that of the XPT file. This size reduction, along with the ability to only pull the table or column you need into R, means a faster, more responsive analysis environment.

Obtain the BRFSS File

The CDC hosts the data file in either .xpt or .asc format. They can be downloaded, zipped, from this location. Scroll down to the section entitled Data Files and choose the format you prefer. I worked with the .xpt file as the .asc requires a separate file (known as a dictionary file) to correctly assign column names and etc.

Convert XPT to a SQL Friendly Format

I found two methods:

1. Using a simple .r script in RStudio
The .r file I used to convert the XPT file to CSV is included. This file was slightly larger, but didn't contain extra rows which made dealing with it slightly easier.

2. SAS Universal Viewer
SAS produces the free SAS Universal Viewer that will open .xpt files and export .csv. You'll need an account to download the tool. The csv file it produced was nearly 100MB smaller than the version produced by RStudio; however, it contained 78 extra rows added to the bottom of the file.¹

Prepare the CSV

Acknowledging Discrepancies

As of the publication of this readme, both years represented here have columns present in the dataset but missing from the codebook (all are listed at the bottom of this readme). Dealing with these columns means never extracting from the master csv in the first place. 2023 actually has a variable in the codebook not in the dataset (rcsborg1) and this was simply ignored when making individual csv files.

SQL Schema & Splitting the CSV

My Postgres schema is based on the codebook, which lists the SAS Variable names along with the Section Name. Each Section is its own table, and each column is an SAS Variable in that section.

Taking the 2023 dataset as an example:

Using the file brfss-var-sec-type-2023.csv file provided in the supplemental folder, cut the master csv file into sections. In Terminal, cut -d ',' -f[column number(s)] input-file.csv > output-file.csv does the job. So, cut -d ',' -f216,217,218,219,220,221,222,223,224,225,226,227,228 inpute-file.csv > ace.csv returns what will become the ace table as a csv.
Once all individual csv files have been created and placed in the /data folder, run the create_table_and_co.py script, making sure the paths to your files and output are correct. This will produce the commands you'll use to create the tables in SQL and copy the data in using psql, or you can use the .sql files I generated.

NULL and Blank Values

The conversion from XPT to CSV converted the blank values to NA. Most of the columns in the SQL db are of datatype INTEGER, which means NA isn't compatible. I chose to convert each NA to NULL when loading the data into Postgres and my psql commands reflect this.

Setting up a Docker PostgreSQL instance

Install Docker Desktop and get the latest PostgreSQL image. The docker-compose.yml files found in brfss-2022 and brfss-2023 directories are the instructions for starting up your Postgres db. The compose file is structured to give a persistent db even if the container is stopped and restarted.

Caution

If you want to keep your Postgres data, spin the container down with docker-compose down. Do not use docker-compose down -v unless you want to scrub completely and re-initialize. The -v flag scrubs the persistent volumes. You can see these volumes with docker volume ls. Nothing listed means no persistent docker volumes are present.

Since you started the docker container without an init.db the Postgres db exists but has no table or variable data in it.

I solved this by adding the tables in dBeaver then using pqsl in terminal to copy the data over. Here are the psql commands:

# first 
docker ps

# second
docker exec -it your_container_name psql -U your_user_name -d your_db_name

# third
\copy table_name ("column-name","column-name") from '/docker-entrypoint-initdb.d/file_name.csv' WITH (FORMAT CSV, HEADER, NULL 'NA');

Check to see if any of your tables are empty

# A 0 indicates no rows have been inserted; thus, empty table
SELECT schemaname, relname, n_tup_ins 
FROM pg_stat_all_tables 
WHERE schemaname = 'public' 
ORDER BY n_tup_ins;

Backup and restore the Postgres db.

# backup sql db
docker exec -t your-container-name pg_dump -U username your-db-name > db_name_backup.sql

# restore db
docker exec -i your-container-name pg_dump -U username your-db-name > db_name_backup.sql

Connecting to the database from RStudio

Establish the db connection

# install needed packages 
install.packages("DBI")
install.packages("RPostgres")

# call packages on script execution
library(DBI)
library(RPostgres)

# Replace with your actual database credentials
con <- dbConnect(RPostgres::Postgres(),
                 dbname = "your_database_name",
                 host = "your_host",      # e.g., "localhost" or an IP address
                 port = 5432,             # default PostgreSQL port
                 user = "your_username",
                 password = "your_password")

Call tables or variables as needed

# Replace 'your_table_name' with the actual table name
data <- dbReadTable(con, "your_table_name")

# Replace 'your_table_name' with the actual table name and specify the columns you want
query <- "SELECT column1, column2 FROM your_table_name"
data <- dbGetQuery(con, query)

Once finished, disconnect

dbDisconnect(con)

Supplemental Material

The following supplemental material is provided for each year.

Codebook: The codebook gives the variable name, location, and frequency of values for all reporting areas combined for the landline and cell phone data set. The CDC distributes the codebook as an .html file which can be found on their site. Through python scripts, I've converted the codebook into a .md file with section and question headers, which greatly reduce seek time when looking up a variable, its SAS code and answer codes.

Handling Errors

Discrepancies between the CDC's codebook and the dataset existed for both years.

I omitted the columns from my final SQL database that didn't appear in the codebook.

2022

SAS Variable	1-Based Column No.	In Codebook	In Dataset	Remediation
`diabage4`	57	No	Yes	Not included in db
`numphon4`	62	No	Yes	Not included in db
`cpdemo1c`	63	No	Yes	Not included in db
`usemrjn4`	210	No	Yes	Not included in db

2023

SAS Variable	1-Based Column No.	In Codebook	In Dataset	Remediation
`rcsborg1`	NA	Yes	No	Omitted
`lndsxbrt`	19	No	Yes	Not included in db
`celsxbrt`	25	No	Yes	Not included in db
`birthsex`	205	No	Yes	Not included in db
`trnsgndr`	208	No	Yes	Not included in db
`usemrjn4`	215	No	Yes	Not included in db
`rcsgend1`	252	No	Yes	Not included in db
`rcsxbrth`	253	No	Yes	Not included in db

Taking the LLCP2023.XPT file as an example: the original XPT file, when read into RStudio, contains 433,323 rows. The csv file produced by SAS Universal Viewer contains 78 more rows (rows 433,325 - 443,401) most likely due to metadata being written in as extra rows.

A quick check in terminal of the csv created by the SAS viewer using sed -n '433325p' file-name.csv shows us empty data for that row. Use sed -n '433324p' file-name.csv to see the difference. Line 433,324 is the last line of BRFSS data (the csv starts count on the header row, while R doesn't when reading in an XPT, hence the 1 digit difference in the last line). These need to be trimmed off before going further. For this reason I stuck with the R script. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
brfss-2022		brfss-2022
brfss-2023		brfss-2023
src		src
.gitignore		.gitignore
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Obtain the BRFSS File

Convert XPT to a SQL Friendly Format

Prepare the CSV

Acknowledging Discrepancies

SQL Schema & Splitting the CSV

NULL and Blank Values

Setting up a Docker PostgreSQL instance

Connecting to the database from RStudio

Supplemental Material

Handling Errors

2022

2023

About

Releases

Packages

Languages

License

datatimp/brfss

Folders and files

Latest commit

History

Repository files navigation

Obtain the BRFSS File

Convert XPT to a SQL Friendly Format

Prepare the CSV

Acknowledging Discrepancies

SQL Schema & Splitting the CSV

NULL and Blank Values

Setting up a Docker PostgreSQL instance

Connecting to the database from RStudio

Supplemental Material

Handling Errors

2022

2023

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages