Time Series Analysis of US COVID-19 Data

STA141B Final Project - UC Davis

Fall Quarter 2021

Professor Farris

Contributors to this repo: Niraj Bangari @ndbang, Marvin Pepito @mjpepito, Sophia Tierney @sophiatierney

Selected categories represented by this project

1. Project organization, writeup readability, overall conclusions
3. Custom and efficient data processing scripts
4. Data munging: pandas filtering, joining, grouping, transforming; handle missing values, timestamps; set indices, extract custom features
5. Data visualization
7. Data storage: SQLite database

Overview

Project Directory Structure

/notebooks/ directory contains statistical analysis, time series forecasting, and visualizations
/code/ directory contains source code for data extraction, data cleaning, and data manipulation
/data/ contains the cleaned data in csv format, ready to import into a jupyter notebook, and also contains an in-memory sqlite database

All data used for this project was sourced from the COVID-19 github repository, owned by the Center for Systems Science and Engineering at Johns Hopkins University which can be found here.

We gathered, transformed, and merged a total of 4 time series datasets which are updated with new data daily. First we combined 2 vaccine data sources, this one containing the number of vaccine doses, broken down into 1st_dose and 2nd_dose, with this containing the daily reports of vaccines administered in wide format. Then, we combined the confirmed cases and cumulative deaths. After processing, cleaning, and transforming the 4 data sources, we merged them into a pandas dataframe and saved the result to a csv file and also loaded into an in memory sqlite database using sqlalchemy & sqlite.

The range of date observations were are interested in analyzing is from 01/03/21 to date, 12/09/21. Note that 12/17/20 is the date when vaccines began rolling out to the public in the United states, however many states did not begin reporting vaccine counts until about a month later.

When merging the cases and deaths data with vaccine data, we filled in na values with 0 for date observations before states began reporting vaccine administration, for easier processing and computation.

On 11/19/21, booster shots began to be administered and reported. So note that from that date forward, the sum of the 1st dose and 2nd dose counts do not total to the total administered. We attempt to estimate the number of booster shots by summing 1st doses and 2nd doses and subtracting this from the total doses administered.

To access the data, you can either download the csv file, or you can query the database. The columns contained in the database table are as follows:

fips
state
pop (based of 2019 Census Data)
date
total_doses_state_level
cum_deaths
confirmed
first_dose
sec_dose
est_booster_doses

Say we want to extract all rows from our sql table named covid, and we want them sorted in descending order by date, and grouped uniquely by state.

An example query to the database would be:

SELECT COUNT(*), 
   FROM covid,
   GROUP BY state,
   ORDER BY date DESC;

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
code		code
data		data
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Time Series Analysis of US COVID-19 Data

STA141B Final Project - UC Davis

Fall Quarter 2021

Professor Farris

Selected categories represented by this project

Overview

Project Directory Structure

About

Releases

Packages

Languages

soph-py/sta141b_final

Folders and files

Latest commit

History

Repository files navigation

Time Series Analysis of US COVID-19 Data

STA141B Final Project - UC Davis

Fall Quarter 2021

Professor Farris

Selected categories represented by this project

Overview

Project Directory Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages