Contributors to this repo: Niraj Bangari @ndbang, Marvin Pepito @mjpepito, Sophia Tierney @sophiatierney
- 1. Project organization, writeup readability, overall conclusions
- 3. Custom and efficient data processing scripts
- 4. Data munging: pandas filtering, joining, grouping, transforming; handle missing values, timestamps; set indices, extract custom features
- 5. Data visualization
- 7. Data storage: SQLite database
- /notebooks/ directory contains statistical analysis, time series forecasting, and visualizations
- /code/ directory contains source code for data extraction, data cleaning, and data manipulation
- /data/ contains the cleaned data in csv format, ready to import into a jupyter notebook, and also contains an in-memory sqlite database
All data used for this project was sourced from the COVID-19 github repository, owned by the Center for Systems Science and Engineering at Johns Hopkins University which can be found here.
We gathered, transformed, and merged a total of 4
time series datasets which are updated with new data daily. First we combined 2
vaccine data sources, this one containing the number of vaccine doses, broken down into 1st_dose and 2nd_dose, with this containing the daily reports of vaccines administered in wide format. Then, we combined the confirmed cases and cumulative deaths. After processing, cleaning, and transforming the 4
data sources, we merged them into a pandas dataframe and saved the result to a csv file and also loaded into an in memory sqlite database using sqlalchemy & sqlite.
The range of date observations were are interested in analyzing is from 01/03/21 to date, 12/09/21. Note that 12/17/20 is the date when vaccines began rolling out to the public in the United states, however many states did not begin reporting vaccine counts until about a month later.
When merging the cases and deaths data with vaccine data, we filled in na values with 0
for date observations before states began reporting vaccine administration, for easier processing and computation.
On 11/19/21, booster shots began to be administered and reported. So note that from that date forward, the sum of the 1st dose and 2nd dose counts do not total to the total administered. We attempt to estimate the number of booster shots by summing 1st doses and 2nd doses and subtracting this from the total doses administered.
To access the data, you can either download the csv file, or you can query the database. The columns contained in the database table are as follows:
- fips
- state
- pop (based of 2019 Census Data)
- date
- total_doses_state_level
- cum_deaths
- confirmed
- first_dose
- sec_dose
- est_booster_doses
Say we want to extract all rows from our sql table named covid, and we want them sorted in descending order by date, and grouped uniquely by state.
An example query to the database would be:
SELECT COUNT(*),
FROM covid,
GROUP BY state,
ORDER BY date DESC;