The project uses various techniques such as descriptive statistics, dimensionality reduction using PCA, Pattern Mining and Clustering using Unsupervised Machine Learning algorithms like Kmeans and Kmodes, SARIMAX was used for time series prediction of air pollutant values.
- To identify the top states with highest pollutant values for NO2, SO2, O3, CO we will be using statistical and ranking methods like mean to check the top 10 states having the highest pollutant values from year 2000 to 2016
- We will be using either spearman rank correlation and Pearson rank correlation to check if there is any correlation among the top 10 ranked states in US.
- We will be using K-means clustering to check for clustering patterns among NO2, CO, SO2, and O3. To see if there is a clustering pattern that can help us identify some insights about US Air pollution status.
- We will also cluster US cities accordingly and plot them on US map to visualize any findings we find. This will help us visualize the cities that are clustered together and is there any location influences air pollution.
- For our query to check if the regions with higher pollutant values have an effect on neighboring regions, we will we plotting the average pollutant value for:
- State-wise from the year 2000 to 2016 to see the variation among all pollutants
- We will be plotting monthly variation which will make us visualize the daily dispersion of air pollution values and check if there is a neighboring effect which we strongly believe that it should be the case.
- We will be using Regression to predict pollutant value and multi regression to check the strength of the pollutant variable (dependent in this case) and other independent values.
- We will be doing a time series analysis to check if there is a seasonal pattern for air pollutant values.
Air pollutant trends
We can see that there is a downward trend for all the pollutants from 2000 to 2016. This is in fact good as we can infer that United States is taking correct measures to curb down the air pollutant levels throughout the years.
Top pollutant states
Air Pollutant values are affected by nearby regions
As we see when pollutant level rises in some area, the pollutant level starts rising in the nearby area. Also, when the pollutant level starts decreasing the nearby areas also reflect this change.
City Clustering
The Map shows distribution of different cluster over US map.
- We can see that cluster0 which is defined by Color Red, has the majority points in East Coast. There is a neighboring effect on this cluster and through this study we know that the air pollutant values for the east coast is a bit higher.
- Cluster1 defined by Color Violet is scattered across entire US.
- Similarly, Cluster2 defined by Light Blue Color is scattered across US like Cluster1.
- Finally, Cluster3 defined by Green Color is mostly falling on West Coast.
Moreover, Cluster 0 denoted by Red
- Has 2nd Highest average NO2 AQI values (less than Cluster 3)
- Highest average O3 AQI values
- Highest average SO2 AQI values
- Has 3rd Highest average CO AQI values (less than Cluster 2 and Cluster 3)
Cluster 1 denoted by Violet
- Has Lowest average NO2 AQI values
- Has 2nd Highest average O3 AQI values (less than Cluster 0)
- Has Lowest average SO2 AQI values
- Has Lowest average CO AQI values
Cluster 2 denoted by Light Blue
- Has Highest NO2 AQI values
- Has 3rd Highest O3 AQI values
- Has 3rd Highest SO2 AQI values
- Has 3rd Highest CO AQI values
Cluster 3 denoted by Green
- Has 3rd Highest NO2 AQI values
- Has Lowest O3 AQI values
- Has 2nd Highest SO2 AQI values
- Has Highest CO AQI values