Analysis: Air Quality in India
Let’s explore air quality data of India. This dataset has daily pollutant readings for the period 2015-2020.
Goal of this analysis:
Understand how is the trend of these pollutants over the years
How are the two-tier cities coping up with the increased level of infrastructure development and vehicular pollution due to increased human migration.
Source of the Data:
https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=city_day.csv
Process of calculating air quality index:
https://www.kaggle.com/rohanrao/calculating-aqi-air-quality-index-tutorial
What are the safe limits of these pollutants in air for India?
http://www.arthapedia.in/index.php?title=Ambient_Air_Quality_Standards_in_India
About the Dataset:
The dataset has air quality readings of major India cities for the period 2015-2020.
This dataset captures the air quality parameters of PM2.5, PM10, Nitrogen Oxide (NO), Nitrogen Dioxide (NO2), Ammonia (NH3), Carbon Monoxide (CO), Sulphur Dioxide (SO2), Ozone (O3), Benzene, Toulene, Xylene.
Computed Air Quality Index included in the dataset
Population estimates of these cities were collected separately and added to the dataset to understand the role of population on air quality.
Load Datasets
Load Pollution and Population Estimates datasets.
# Load pollution data
pollution <- read.csv('https://raw.githubusercontent.com/learning-monk/datasets/master/ENVIRONMENT/Indian_cities_daily_pollution_2015-2020.csv',
stringsAsFactor=FALSE, na.strings=c(""))
# Load population estimates data
pop_est <- read.csv('https://raw.githubusercontent.com/learning-monk/datasets/master/ENVIRONMENT/cities%20with%20population%20estimates%202019.csv',
stringsAsFactor=FALSE)
# Let's see the structure of the dataset
str(pollution)
## 'data.frame': 29531 obs. of 16 variables:
## $ City : chr "Ahmedabad" "Ahmedabad" "Ahmedabad" "Ahmedabad" ...
## $ Date : chr "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" ...
## $ PM2.5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ PM10 : num NA NA NA NA NA NA NA NA NA NA ...
## $ NO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ NO2 : num 18.2 15.7 19.3 18.5 21.4 ...
## $ NOx : num 17.1 16.5 29.7 18 37.8 ...
## $ NH3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ SO2 : num 27.6 24.6 29.1 18.6 39.3 ...
## $ O3 : num 133.4 34.1 30.7 36.1 39.3 ...
## $ Benzene : num 0 3.68 6.8 4.43 7.01 5.42 0 0 0 0 ...
## $ Toluene : num 0.02 5.5 16.4 10.14 18.89 ...
## $ Xylene : num 0 3.77 2.25 1 2.78 1.93 0 0 0 0 ...
## $ AQI : int NA NA NA NA NA NA NA NA NA NA ...
## $ AQI_Bucket: chr NA NA NA NA ...
# Convert date column to Date format
pollution$Date <- as.Date(pollution$Date)
Merge population estimates and pollution datasets into a Combined dataset
combined <- merge(pollution, pop_est, by = "City")
str(combined)
## 'data.frame': 29531 obs. of 21 variables:
## $ City : chr "Ahmedabad" "Ahmedabad" "Ahmedabad" "Ahmedabad" ...
## $ Date : Date, format: "2015-01-01" "2015-01-02" ...
## $ PM2.5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ PM10 : num NA NA NA NA NA NA NA NA NA NA ...
## $ NO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ NO2 : num 18.2 15.7 19.3 18.5 21.4 ...
## $ NOx : num 17.1 16.5 29.7 18 37.8 ...
## $ NH3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ SO2 : num 27.6 24.6 29.1 18.6 39.3 ...
## $ O3 : num 133.4 34.1 30.7 36.1 39.3 ...
## $ Benzene : num 0 3.68 6.8 4.43 7.01 5.42 0 0 0 0 ...
## $ Toluene : num 0.02 5.5 16.4 10.14 18.89 ...
## $ Xylene : num 0 3.77 2.25 1 2.78 1.93 0 0 0 0 ...
## $ AQI : int NA NA NA NA NA NA NA NA NA NA ...
## $ AQI_Bucket : chr NA NA NA NA ...
## $ State : chr "Gujarat" "Gujarat" "Gujarat" "Gujarat" ...
## $ Part_of_India : chr "West India" "West India" "West India" "West India" ...
## $ Metropolitan_Area : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Population_as_per_2011_census: int 5570585 5570585 5570585 5570585 5570585 5570585 5570585 5570585 5570585 5570585 ...
## $ Population_Estimate_2019 : int 8644400 8644400 8644400 8644400 8644400 8644400 8644400 8644400 8644400 8644400 ...
# Convert newly added columns to factors
combined$State <- as.factor(combined$State)
combined$Part_of_India <- as.factor(combined$Part_of_India)
combined$Metropolitan_Area <- as.factor(combined$Metropolitan_Area)
# Convert City and AQI_Bucket to factor variables and Date column to Date
combined$City <- as.factor(combined$City)
combined$AQI_Bucket <- as.factor(combined$AQI_Bucket)
combined$Date <- as.Date(combined$Date, format = '%Y-%m-%d')
# Add Month and Year to combined
combined$Month <- as.factor(format(combined$Date, '%b'))
combined$Year <- as.factor(as.numeric(format(combined$Date, '%Y')))
# Re-order Month Values
combined$Month <- factor(combined$Month, levels = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
'Oct', 'Nov', 'Dec'))
str(combined)
## 'data.frame': 29531 obs. of 23 variables:
## $ City : Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2015-01-01" "2015-01-02" ...
## $ PM2.5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ PM10 : num NA NA NA NA NA NA NA NA NA NA ...
## $ NO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ NO2 : num 18.2 15.7 19.3 18.5 21.4 ...
## $ NOx : num 17.1 16.5 29.7 18 37.8 ...
## $ NH3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ SO2 : num 27.6 24.6 29.1 18.6 39.3 ...
## $ O3 : num 133.4 34.1 30.7 36.1 39.3 ...
## $ Benzene : num 0 3.68 6.8 4.43 7.01 5.42 0 0 0 0 ...
## $ Toluene : num 0.02 5.5 16.4 10.14 18.89 ...
## $ Xylene : num 0 3.77 2.25 1 2.78 1.93 0 0 0 0 ...
## $ AQI : int NA NA NA NA NA NA NA NA NA NA ...
## $ AQI_Bucket : Factor w/ 6 levels "Good","Moderate",..: NA NA NA NA NA NA NA NA NA NA ...
## $ State : Factor w/ 20 levels "Andhra Pradesh",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Part_of_India : Factor w/ 6 levels "Central India",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Metropolitan_Area : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ Population_as_per_2011_census: int 5570585 5570585 5570585 5570585 5570585 5570585 5570585 5570585 5570585 5570585 ...
## $ Population_Estimate_2019 : int 8644400 8644400 8644400 8644400 8644400 8644400 8644400 8644400 8644400 8644400 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : Factor w/ 6 levels "2015","2016",..: 1 1 1 1 1 1 1 1 1 1 ...
# Find missing values in the dataset
na_count <-sapply(combined, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count)
na_count
## na_count
## City 0
## Date 0
## PM2.5 4598
## PM10 11140
## NO 3582
## NO2 3585
## NOx 4185
## NH3 10328
## CO 2059
## SO2 3854
## O3 4022
## Benzene 5623
## Toluene 8041
## Xylene 18109
## AQI 4681
## AQI_Bucket 4681
## State 0
## Part_of_India 0
## Metropolitan_Area 0
## Population_as_per_2011_census 0
## Population_Estimate_2019 0
## Month 0
## Year 0
Except City and Date columns, rest of the columns have NAs. Let’s not impute these missing values as they are not evenly missing. We will ignore these values in the plots for now.
# Re-order bucket values in logical order in AQI_Bucket columns
combined$AQI_Bucket <- factor(combined$AQI_Bucket, levels = c('Severe', 'Very Poor', 'Poor', 'Moderate',
'Satisfactory', 'Good'))
# How is AQI_Bucket distributed?
table(combined[!is.na('AQI_Bucket'), 'AQI_Bucket'])
##
## Severe Very Poor Poor Moderate Satisfactory Good
## 1338 2337 2781 8829 8224 1341
# Plot the distribution of AQI levels
library(ggplot2)
ggplot(subset(combined, !is.na(AQI_Bucket)), aes(x=AQI_Bucket)) +
geom_bar(fill = "cornflowerblue") +
labs(x = 'Air quality levels', title = 'Distribution of Air Quality levels') +
theme(plot.title = element_text(hjust=0.5))
# Using treemap, let's visualize Air quality status
# install.packages('treemapify')
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(treemapify)
AQI_nona <- na.omit(combined[, c('City', 'Date', 'AQI_Bucket')])
treedata <- AQI_nona %>%
count(AQI_Bucket)
ggplot(treedata, aes(fill = AQI_Bucket, area = n, label = AQI_Bucket)) +
geom_treemap(show.legend = FALSE) +
geom_treemap_text(colour='white', place='centre') +
labs(title='Air quality levels') +
theme(plot.title = element_text(hjust=0.5))
As different pollutants are measured on different scales and their range of values are fluctuating, it is important that we scale their values so that they can be brought under a common range and can be compared with each other in the same visualization.
How is Scaling/Normalization different from Standardization?
Scaling doesn’t change the distribution but only changes the range of values. Whereas Standardization is a technique used to normalize underlying distrbution so that machine learning algorithms / statistical tests which assumes data to be normal can be applied.
In our case, let’s scale our pollutant measurements to the range of 0 to 1
. This process makes comparison easy and also helps us in plotting all the pollutants in a single chart to see the changes over the years. The downside of this technique in contrast to Standardization
is we will end up with smaller standard deviations which can suppress the effect of outliers.
There are lot of scaling techniques available for use. In our case, let’s use MinMax Scaler.
MinMax Scaler:
- Retains the shape of the original distribution
- Doesn’t change the information in the data
- Doesn’t effect outliers
- The default range is 0 to 1
Scaling our data
Each pollutant measurement is brought to a common scale of 0 to 1 so that comparison would make sense.
pollutants <- c('PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene')
# Copy combined dataset to a new dataframe
scaled_data <- combined
# Get min value from each pollutant column
getMin <- c()
for (p in pollutants) {
getMin <- c(getMin, min(scaled_data[,p], na.rm=TRUE))
}
# Get max value from each pollutant column
getMax <- c()
for (p in pollutants) {
getMax <- c(getMax, max(scaled_data[,p], na.rm=TRUE))
}
# This code will loop through each column value to compute scaled value
for (i in 1:nrow(scaled_data)) {
for (p in 1:length(pollutants)) {
if (!is.na(scaled_data[i,pollutants[p]])) {
scaled_data[i,pollutants[p]] <- (scaled_data[i,pollutants[p]] - getMin[p])/(getMax[p]-getMin[p])
} else {
scaled_data[i,pollutants[p]] <- NA
}
}
}
summary(scaled_data[,pollutants])
## PM2.5 PM10 NO NO2
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.030 1st Qu.:0.056 1st Qu.:0.014 1st Qu.:0.032
## Median :0.051 Median :0.096 Median :0.025 Median :0.060
## Mean :0.071 Mean :0.118 Mean :0.045 Mean :0.079
## 3rd Qu.:0.085 3rd Qu.:0.150 3rd Qu.:0.051 3rd Qu.:0.104
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000
## NA's :4598 NA's :11140 NA's :3582 NA's :3585
## NOx NH3 CO SO2
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.027 1st Qu.:0.024 1st Qu.:0.0029 1st Qu.:0.029
## Median :0.050 Median :0.045 Median :0.0051 Median :0.047
## Mean :0.069 Mean :0.067 Mean :0.0128 Mean :0.075
## 3rd Qu.:0.086 3rd Qu.:0.085 3rd Qu.:0.0082 3rd Qu.:0.078
## Max. :1.000 Max. :1.000 Max. :1.0000 Max. :1.000
## NA's :4185 NA's :10328 NA's :2059 NA's :3854
## O3 Benzene Toluene Xylene
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.073 1st Qu.:0.000 1st Qu.:0.001 1st Qu.:0.001
## Median :0.120 Median :0.002 Median :0.007 Median :0.006
## Mean :0.134 Mean :0.007 Mean :0.019 Mean :0.018
## 3rd Qu.:0.177 3rd Qu.:0.007 3rd Qu.:0.020 3rd Qu.:0.020
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000
## NA's :4022 NA's :5623 NA's :8041 NA's :18109
Standardization:
The below code normalizes our pollutant columns by subtracting mean
from each value and dividing by standard deviation
so that the data is normalized and takes the shape of bell curve.
# The code below standardizes the pollutant measurement data (by subracting mean and divide by standard deviation)
pollutants <- c('PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene')
# Copy pollution data to a new dataframe
standardized_data <- combined
# Compute mean of each of these pollutant columns
mean_list <- c()
for (p in pollutants) {
mean_list <- c(mean_list, round(mean(standardized_data[,p], na.rm=TRUE),2))
}
# Compute standard deviation of each of these pollutant columns
sd_list <- c()
for (p in pollutants) {
sd_list <- c(sd_list, round(sd(standardized_data[,p], na.rm=TRUE),2))
}
# This code will loop through each column value to compute normalized value
for (i in 1:nrow(standardized_data)) {
for (p in 1:length(pollutants)) {
if (!is.na(standardized_data[i,pollutants[p]])) {
standardized_data[i,pollutants[p]] <- (standardized_data[i,pollutants[p]] - mean_list[p])/sd_list[p]
} else {
standardized_data[i,pollutants[p]] <- NA
}
}
}
summary(standardized_data[,pollutants])
## PM2.5 PM10 NO NO2
## Min. :-1.043 Min. :-1.304 Min. :-0.770 Min. :-1.167
## 1st Qu.:-0.597 1st Qu.:-0.683 1st Qu.:-0.524 1st Qu.:-0.687
## Median :-0.292 Median :-0.248 Median :-0.337 Median :-0.281
## Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000
## 3rd Qu.: 0.203 3rd Qu.: 0.349 3rd Qu.: 0.104 3rd Qu.: 0.370
## Max. :13.649 Max. : 9.733 Max. :16.372 Max. :13.635
## NA's :4598 NA's :11140 NA's :3582 NA's :3585
## NOx NH3 CO SO2
## Min. :-1.021 Min. :-0.914 Min. :-0.3233 Min. :-0.801
## 1st Qu.:-0.616 1st Qu.:-0.580 1st Qu.:-0.2500 1st Qu.:-0.489
## Median :-0.278 Median :-0.297 Median :-0.1954 Median :-0.296
## Mean : 0.000 Mean : 0.000 Mean :-0.0002 Mean : 0.000
## 3rd Qu.: 0.247 3rd Qu.: 0.255 3rd Qu.:-0.1149 3rd Qu.: 0.038
## Max. :13.754 Max. :12.827 Max. :24.9368 Max. : 9.891
## NA's :4185 NA's :10328 NA's :2059 NA's :3854
## O3 Benzene Toluene Xylene
## Min. :-1.590 Min. :-0.207 Min. :-0.436 Min. :-0.486
## 1st Qu.:-0.721 1st Qu.:-0.200 1st Qu.:-0.406 1st Qu.:-0.464
## Median :-0.168 Median :-0.140 Median :-0.287 Median :-0.331
## Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000
## 3rd Qu.: 0.511 3rd Qu.:-0.013 3rd Qu.: 0.023 3rd Qu.: 0.044
## Max. :10.292 Max. :28.574 Max. :22.341 Max. :26.472
## NA's :4022 NA's :5623 NA's :8041 NA's :18109
ggplot(subset(scaled_data, Metropolitan_Area==1), aes(x=PM2.5, y=PM10, color=City)) +
geom_point() +
labs(title="PM2.5 Vs PM10 by City") +
theme(plot.title = element_text(hjust=0.5))
The two variables PM2.5 and PM10 are positively and strongly correlated. There are visible outliers.
Scatter plot matrix:
Let’s visualize correlation between different pollutants using scatter plot matrix.
What is a scatter plot matrix?
A scatter plot matrix is collection of scatter plots displayed as a grid.
- It is similar to correlation plot but instead of displaying correlations, displays underlying data.
- Scatter plot matrix is created using
ggpairs
function fromGGally
package.
library(GGally)
library(dplyr)
corr_df <- scaled_data %>%
select(PM2.5, PM10, CO, NOx, SO2)
ggpairs(corr_df) +
labs(title = 'Understanding relationship between PM2.5, PM10, CO, NOx and SO2')
- Except PM2.5 and PM10 rest of the pollutants plotted above have weak positive correlation among themselves Whereas PM2.5 and PM10 are strongly positively correlated above 0.8.
library(GGally)
library(dplyr)
corr_df <- scaled_data %>%
select(Benzene, Toluene, Xylene)
ggpairs(corr_df) +
labs(title = 'Understanding relationship between Benzene, Toluene, Xylene')
Benzene
andToluene
shows somewhat stronger positive relationship correlation coefficient above0.7
.
Let’s see the distribution of pollutants over the years by City.
We will use Violin plots for this purpose.
What are Violin plots?
Violin plots are similar to kernel-density plots
but are mirrored and rotated 90 degrees.
To understand more about Violin plots, check this interesting blog:
Distribution of PM2.5 by City over the years 2015-2020
# Distribution of PM2.5 by City over the years 2015-2020
ggplot(subset(scaled_data, !is.na(PM2.5)), aes(x=Date, y=PM2.5)) +
geom_violin(color="black", fill="red") +
facet_wrap(~City) +
labs(title = "PM2.5 pollutant distribution by City") +
theme(plot.title = element_text(hjust=0.5), axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
Cities which caught my attention with PM2.5 pollution levels close to or over 0.50: Delhi, Gurugram, Lucknow, Patna
- Visible outliers found in Amritsar, Guwahati, Shillong
Let’s see how the PM2.5 pollutant is distributed by months for each year for the above cities (Delhi, Gurugram, Lucknow, Patna)
# lets see the distribution of PM2.5 by month for each year for Delhi
ggplot(subset(scaled_data, City=='Delhi' & !is.na(PM2.5)), aes(x=Month, y=PM2.5)) +
geom_boxplot(fill="dark blue") +
facet_wrap(~ Year, nrow=3, ncol=3) +
labs(title="PM2.5 by month for each year - Delhi") +
theme(plot.title = element_text(hjust=0.5))
As suspected, winter (Nov, Dec, Jan) months recorded high levels of PM2.5 for Delhi
# lets see the distribution of PM2.5 by month for each year for Gurugram
ggplot(subset(scaled_data, City=='Gurugram' & !is.na(PM2.5)), aes(x=Month, y=PM2.5)) +
geom_boxplot(fill="red") +
facet_wrap(~ Year, nrow=3, ncol=3) +
labs(title="PM2.5 by month for each year - Gurugram") +
theme(plot.title = element_text(hjust=0.5))
Gurugram seems to show high levels during winter. But, interestingly 2017 year showed high levels of PM2.5 during mid year.
# lets see the distribution of PM2.5 by month for each year for Lucknow
ggplot(subset(scaled_data, City=='Lucknow' & !is.na(PM2.5)), aes(x=Month, y=PM2.5)) +
geom_boxplot(fill="cyan") +
facet_wrap(~ Year, nrow=3, ncol=3) +
labs(title="PM2.5 by month for each year - Lucknow") +
theme(plot.title = element_text(hjust=0.5))
Lucknow seems to be following similar patterns as Delhi with high PM2.5 during winter.
# lets see the distribution of PM2.5 by month for each year for Patna
ggplot(subset(scaled_data, City=='Patna' & !is.na(PM2.5)), aes(x=Month, y=PM2.5)) +
geom_boxplot(fill="yellow") +
facet_wrap(~ Year, nrow=3, ncol=3) +
labs(title="PM2.5 by month for each year - Patna") +
theme(plot.title = element_text(hjust=0.5))
Patna also seems to be following similar patterns as Delhi with high PM2.5 during winter.
# PM2.5 pollutant yearly distribution for Delhi, Gurugram, Lucknow and Patna
ggplot(subset(combined, City == c('Delhi', 'Gurugram', 'Lucknow', 'Patna') & !is.na(PM2.5)), aes(x=Date, y=PM2.5)) +
geom_boxplot(color="black", fill="pink") +
facet_wrap(~City) +
labs(title = "PM2.5 pollutant yearly distribution for Delhi, Gurugram, Lucknow and Patna") +
theme(plot.title = element_text(hjust=0.5))
- The daily permissible limits of
PM2.5
is between40 - 60
- The
median
levels ofDelhi
,Gurugram
,Lucknow
,Patna
are close to125
which is way above the max permissible limit
Distribution of PM10 by City over the years 2015-2020
# Distribution of PM10 by City over the years 2015-2020
ggplot(subset(scaled_data, !is.na(PM10)), aes(x=Date, y=PM10)) +
geom_violin(color="black", fill="darkblue") +
facet_wrap(~City) +
labs(title = "PM10 pollutant yearly distribution by City") +
theme(plot.title = element_text(hjust=0.5))
Cities which caught my attention with PM10 levels close to and over 0.50 through the years:
Delhi, Gurugram, Jorapokhar, Kolkata, Talcher.
- Interestingly metros like Bengaluru, Hyderabad seems to be consistent and median levels under 0.50 over the years.
- There are visible outliers in Ahmedabad, Amritsar, Gawahati and Shillong.
Note: Jorapokhar in Jharkhand state and Talcher in Odisha state are not metro cities
Let’s see how PM10 is distributed by month over the years for the cities Delhi, Gurugram, Jorapokhar, Kolkata and Talcher
# lets see the distribution of PM10 by month for each year for Delhi
ggplot(subset(scaled_data, City=='Delhi' & !is.na(PM10)), aes(x=Month, y=PM10)) +
geom_boxplot(fill="red") +
facet_wrap(~ Year, nrow=3, ncol=3) +
labs(title="PM10 by month for each year - Delhi") +
theme(plot.title = element_text(hjust=0.5))
PM10 is also following similar patters to PM2.5 for Delhi with visible increase in levels during winter months (Nov, Dec, Jan)
# lets see the distribution of PM10 by month for each year for Gurugram
ggplot(subset(scaled_data, City=='Gurugram' & !is.na(PM10)), aes(x=Month, y=PM10)) +
geom_boxplot(fill="cyan") +
facet_wrap(~ Year, nrow=3, ncol=3) +
labs(title="PM10 by month for each year - Gurugram") +
theme(plot.title = element_text(hjust=0.5))
PM10 was recorded only for the years 2018, 2019 and 2020 for Gurugram. It is following winter pattern (with high levels during winter months) as well.
# lets see the distribution of PM10 by month for each year for Jorapokhar
ggplot(subset(scaled_data, City=='Jorapokhar' & !is.na(PM10)), aes(x=Month, y=PM10)) +
geom_boxplot(fill="green") +
facet_wrap(~ Year, nrow=3, ncol=3) +
labs(title="PM10 by month for each year - Jorapokhar") +
theme(plot.title = element_text(hjust=0.5))
Jorapokhar is also following winter pattern (with high levels during winter months) as well.
# PM10 pollutant yearly distribution for Delhi, Gurugram, Jorapokhar, Kolkata and Talcher
ggplot(subset(combined, City == c('Delhi', 'Gurugram', 'Jorapokhar', 'Kolkata', 'Talcher') & !is.na(PM10)), aes(x=Date, y=PM10)) +
geom_boxplot(color="black", fill="blue") +
facet_wrap(~City) +
labs(title = "PM10 pollutant yearly distribution for Delhi, Gurugram, Jorapokhar, Kolkata and Talcher") +
theme(plot.title = element_text(hjust=0.5))
- Daily permissible levels of
PM10
is between60 - 100
- The
median
PM10 level of Delhi is over200
which is alarming. - The
median
levels ofGurugram
,Jorapokhar
andTalcher
is over100
which is over the permissible limit - The
median
level ofKolkata
is under and close to100
Distribution of CO by City over the years 2015-2020
# Distribution of CO by City over the years 2015-2020
ggplot(subset(scaled_data, !is.na(CO)), aes(x=Date, y=CO)) +
geom_violin(color="black", fill="cyan") +
facet_wrap(~City) +
labs(title = "CO pollutant yearly distribution by City") +
theme(plot.title = element_text(hjust=0.5), axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
High CO (Carbon Monoxide) levels close to 0.50 are observed in Ahmedabad.
Let’s see the distribution of CO by month for each year for Ahmedabad
# lets see the distribution of CO by month for each year for Ahmedabad
ggplot(subset(scaled_data, City=='Ahmedabad' & !is.na(CO)), aes(x=Month, y=CO)) +
geom_boxplot(fill="cyan") +
facet_wrap(~ Year, nrow=3, ncol=3) +
labs(title="CO by month for each year - Ahmedabad") +
theme(plot.title = element_text(hjust=0.5))
No clear visible pattern observed
# CO pollutant yearly distribution for Ahmedabad
ggplot(subset(combined, City == 'Ahmedabad' & !is.na(CO)), aes(x=Date, y=CO)) +
geom_boxplot(color="black", fill="cyan") +
facet_wrap(~City) +
labs(title = "CO pollutant yearly distribution for Ahmedabad") +
theme(plot.title = element_text(hjust=0.5))
- The daily permissible levels of
CO
is between2 - 4
- The median level of
CO
forAhmedabad
is over10
which is alarming. - Visible outliers observed over
40
Distribution of NOx by City over the years 2015-2020
# Distribution of NOx by City over the years 2015-2020
ggplot(subset(scaled_data, !is.na(NOx)), aes(x=Date, y=NOx)) +
geom_violin(color="black", fill="yellow") +
facet_wrap(~City) +
labs(title = "NOx pollutant yearly distribution by City") +
theme(plot.title = element_text(hjust=0.5))
NOx levels seems to be high for Ahmedabad, Mumbai, Delhi and Patna with levels closer or over 0.50
# NOx pollutant yearly distribution for Ahmedabad, Mumbai, Delhi and Patna
ggplot(subset(combined, City == c('Ahmedabad', 'Mumbai', 'Delhi', 'Patna') & !is.na(NOx)), aes(x=Date, y=NOx)) +
geom_boxplot(color="black", fill="yellow") +
facet_wrap(~City) +
labs(title = "NOx pollutant yearly distribution for Ahmedabad, Mumbai, Delhi and Patna") +
theme(plot.title = element_text(hjust=0.5))
- The daily permissible levels of
NOx
is between40 - 80
- The
median
levels ofNOx
for these cities seems to be under80
. - Visible outliers observed which are over
200
Distribution of SO2 by City over the years 2015-2020
# Distribution of SO2 by City over the years 2015-2020
ggplot(subset(scaled_data, !is.na(SO2)), aes(x=Date, y=SO2)) +
geom_violin(color="black", fill="green") +
facet_wrap(~City) +
labs(title = "SO2 pollutant yearly distribution by City") +
theme(plot.title = element_text(hjust=0.5))
SO2 levels are high for Ahmedabad, Jorapokhar, Patna
# Distribution of SO2 over the years 2015-2020 for Ahmedabad, Jorapokhar and Patna
ggplot(subset(combined, City == c('Ahmedabad', 'Jorapokhar', 'Patna') & !is.na(SO2)), aes(x=Date, y=SO2)) +
geom_boxplot(color="black", fill="green") +
facet_wrap(~City) +
labs(title = "SO2 pollutant yearly distribution for Ahmedabad, Jorapokhar and Patna") +
theme(plot.title = element_text(hjust=0.5))
- The daily permissible levels of
SO2
is between50 - 80.
- Though the median levels are less than
50
, it is important to keep an eye on this pollutant. - There are visible outliers recorded over
80
(which is the max permissible limit).
Let’s use plotly
to visualize PM2.5, PM10, NOx, CO and SO2 over the years
I have choosen metros Delhi, Hyderabad, Bengaluru, Chennai for indepth analysis.
Note: You may double click on the legend values to filter a particular pollutant
# Major Pollutants in Delhi over the years (2015-2020)
#install.packages("plotly")
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
del <- scaled_data %>%
filter(City == 'Delhi')
fig <- plot_ly(del, x=~Date) %>%
add_lines(y = ~PM2.5, name="PM2.5", visible=T) %>%
add_lines(y = ~PM10, name="PM10", visible=T) %>%
add_lines(y = ~NOx, name="NOx", visible=T) %>%
add_lines(y = ~CO, name="CO", visible=T) %>%
add_lines(y = ~SO2, name="SO2", visible=T) %>%
layout(title = "Major pollutants in Delhi over the years (2015-2020)",
xaxis = list(title = "Date"),
yaxis = list(title = "Pollutant")) %>%
config(displayModeBar=FALSE)
fig
Observations: Delhi Pollution levels
- PM2.5: Clear visible peaks observed during late 2016, 2017 and 2019.
- PM10: PM10 levels followed similar patterns as PM2.5. One visible peak observed during mid 2018.
- NOx: Considerable peaks observed during 2015, 2016 and end of 2018
- CO: Started off 2015 with high peaks and came under control during later years. One abnormal visible peak observed in
Dec 2015
- SO2:
SO2
levels recorded high during 2016 with visible high peaks towards the end of 2016. Missing values observed during mid 2017.
library(dplyr)
library(plotly)
mum <- scaled_data %>%
filter(City == 'Mumbai')
fig <- plot_ly(mum, x=~Date) %>%
add_lines(y = ~PM2.5, name="PM2.5", visible=T) %>%
add_lines(y = ~PM10, name="PM10", visible=T) %>%
add_lines(y = ~NOx, name="NOx", visible=T) %>%
add_lines(y = ~CO, name="CO", visible=T) %>%
add_lines(y = ~SO2, name="SO2", visible=T) %>%
layout(title = "Major pollutants in Mumbai over the years (2015-2020)",
xaxis = list(title = "Date"),
yaxis = list(title = "Pollutant")) %>%
config(displayModeBar=FALSE)
fig
Observations: Mumbai Pollution levels
- PM2.5: Peaks observed during Jan 2019 and Jan 2020.
- PM10: Peaks observed during Jan 2019.
- NOx: Missing values observed during the period mid 2016 till end of 2017. Abnormal peaks observed during mid 2016 before values went missing.
- CO: Peaks observed during the end of 2018 and starting 2019.
- SO2: No visible abnormal peaks observed. Levels seems to be consistent.
library(dplyr)
library(plotly)
hyd <- scaled_data %>%
filter(City == 'Hyderabad')
fig <- plot_ly(hyd, x=~Date) %>%
add_lines(y = ~PM2.5, name="PM2.5", visible=T) %>%
add_lines(y = ~PM10, name="PM10", visible=T) %>%
add_lines(y = ~NOx, name="NOx", visible=T) %>%
add_lines(y = ~CO, name="CO", visible=T) %>%
add_lines(y = ~SO2, name="SO2", visible=T) %>%
layout(title = "Major pollutants in Hyderabad over the years (2015-2020)",
xaxis = list(title = "Date"),
yaxis = list(title = "Pollutant")) %>%
config(displayModeBar=FALSE)
fig
Observations: Hyderabad Pollution levels
- PM2.5: PM2.5 levels are abnormally high during 2015 & 2016 and came under control in the subsequent years. There are visible peaks only during these years.
- PM10: PM10 levels are high during winter months of each year. There is an abnormal peak during the year 2017.
- NOx: NOx levels started off on a high note in 2015 but reduced gradually. There were high levels during winter and early months of 2017 till 2020 dipping during mid years.
- CO: CO level seems to be high during 2015 and 2016. There is one abnormal visible peak recorded during the end of 2016. After that abnormal peak, levels seems to be low will end of 2017. From 2018, levels are more or less consistent.
- SO2: SO2 levels were high from late 2016 till early 2017. From 2018, levels seems to be more or less under control.
library(dplyr)
library(plotly)
ben <- scaled_data %>%
filter(City == 'Bengaluru')
fig <- plot_ly(ben, x=~Date) %>%
add_lines(y = ~PM2.5, name="PM2.5", visible=T) %>%
add_lines(y = ~PM10, name="PM10", visible=T) %>%
add_lines(y = ~NOx, name="NOx", visible=T) %>%
add_lines(y = ~CO, name="CO", visible=T) %>%
add_lines(y = ~SO2, name="SO2", visible=T) %>%
layout(title = "Major pollutants in Bengaluru over the years (2015-2020)",
xaxis = list(title = "Date"),
yaxis = list(title = "Pollutant")) %>%
config(displayModeBar=FALSE)
fig
Observations: Bengaluru Pollution Levels
- PM2.5: Visible peaks observed over the years and levels seems to be high during winter of each year
- PM10: PM10 levels were recorded from late 2015. PM10 levels are following similar patterns as PM2.5 with elevated levels during winter months of each year.
- NOx: Looks like NOx levels were not recorded from late 2016 till late 2017. High peaks observed during early 2016. Levels dipped past early months of 2020.
- CO: CO levels were high between 2015 and 2016 and largely under control after 2016 except for few inconsistent visible peaks here and there.
- SO2: SO2 levels were high between 2015 to 2016 and largely under control post 2016.
library(dplyr)
library(plotly)
che <- scaled_data %>%
filter(City == 'Chennai')
fig <- plot_ly(che, x=~Date) %>%
add_lines(y = ~PM2.5, name="PM2.5", visible=T) %>%
add_lines(y = ~PM10, name="PM10", visible=T) %>%
add_lines(y = ~NOx, name="NOx", visible=T) %>%
add_lines(y = ~CO, name="CO", visible=T) %>%
add_lines(y = ~SO2, name="SO2", visible=T) %>%
layout(title = "Major pollutants in Chennai over the years (2015-2020)",
xaxis = list(title = "Date"),
yaxis = list(title = "Pollutant")) %>%
config(displayModeBar=FALSE)
fig
Observations: Chennai Pollution levels
- PM2.5 has been consistent over the years, on a decreasing trend from late 2019 through 2020. But occasional peaks are visible.
- PM10 levels are recorded from mid 2019 for Chennai. Visible peaks are observed.
- NOx levels are high from mid 2018 with a visible peak during the mid of 2018.
- CO levels are bit high compared to Hyderabad and Bengaluru during 2015 and till late 2016.
- SO2 levels were fluctuating over the years and recorded peaks during 2015, late 2017, early 2019 and 2020.