Skip to content

soph-py/sta141b_final

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Time Series Analysis of US COVID-19 Data

STA141B Final Project - UC Davis

Fall Quarter 2021

Professor Farris

Contributors to this repo: Niraj Bangari @ndbang, Marvin Pepito @mjpepito, Sophia Tierney @sophiatierney

Selected categories represented by this project

  • 1. Project organization, writeup readability, overall conclusions
  • 3. Custom and efficient data processing scripts
  • 4. Data munging: pandas filtering, joining, grouping, transforming; handle missing values, timestamps; set indices, extract custom features
  • 5. Data visualization
  • 7. Data storage: SQLite database

Overview

Project Directory Structure

  • /notebooks/ directory contains statistical analysis, time series forecasting, and visualizations
  • /code/ directory contains source code for data extraction, data cleaning, and data manipulation
  • /data/ contains the cleaned data in csv format, ready to import into a jupyter notebook, and also contains an in-memory sqlite database

All data used for this project was sourced from the COVID-19 github repository, owned by the Center for Systems Science and Engineering at Johns Hopkins University which can be found here.

We gathered, transformed, and merged a total of 4 time series datasets which are updated with new data daily. First we combined 2 vaccine data sources, this one containing the number of vaccine doses, broken down into 1st_dose and 2nd_dose, with this containing the daily reports of vaccines administered in wide format. Then, we combined the confirmed cases and cumulative deaths. After processing, cleaning, and transforming the 4 data sources, we merged them into a pandas dataframe and saved the result to a csv file and also loaded into an in memory sqlite database using sqlalchemy & sqlite.

The range of date observations were are interested in analyzing is from 01/03/21 to date, 12/09/21. Note that 12/17/20 is the date when vaccines began rolling out to the public in the United states, however many states did not begin reporting vaccine counts until about a month later.

When merging the cases and deaths data with vaccine data, we filled in na values with 0 for date observations before states began reporting vaccine administration, for easier processing and computation.

On 11/19/21, booster shots began to be administered and reported. So note that from that date forward, the sum of the 1st dose and 2nd dose counts do not total to the total administered. We attempt to estimate the number of booster shots by summing 1st doses and 2nd doses and subtracting this from the total doses administered.

To access the data, you can either download the csv file, or you can query the database. The columns contained in the database table are as follows:

  • fips
  • state
  • pop (based of 2019 Census Data)
  • date
  • total_doses_state_level
  • cum_deaths
  • confirmed
  • first_dose
  • sec_dose
  • est_booster_doses

Say we want to extract all rows from our sql table named covid, and we want them sorted in descending order by date, and grouped uniquely by state.

An example query to the database would be:

SELECT COUNT(*), 
   FROM covid,
   GROUP BY state,
   ORDER BY date DESC; 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published