It’s really spellbinding.
A huge amount of data is being produced in every click and tap of the users.
This data is helpful in giving customized service to each user with the help of data analysis.
So, it’s really a big deal right? The process of analyzing data is chunked down into several steps and there comes our EDA as a top of the list.
Exploratory Data Analysis is an approach analyzing data sets to summarize their main characteristics such as mean, standard deviation, and count, so on, often with visual methods.
Exploratory Data Analysis (EDA) is an approach to analyzing data.
It’s where the researcher takes a bird’s eye view of the data and tries to make some sense of it.
It’s often the first step in data analysis, implemented before any formal statistical techniques are applied.
India Air Quality DatasetAs always, learning by doing is the best practice to understand deeper.
So now, we are going to make our hands dirt by analyzing India Air Quality Dataset(not the recently updated one, but anyways we are just going to use it an example) downloaded from Kaggle.
Wherever you get the dataset, you first need to analyze the structure, domain, and contents of it thoroughly.
Air Pollution ~ UnsplashDataset: Basic InfoWhat is it about?This data is released by the Ministry of Environment and Forests and Central Pollution Control Board of India under the National Data Sharing and Accessibility Policy (NDSAP), using this one can explore India’s air pollution levels at a more granular scale.
What information it has?The dataset has 13 columns which are,stn_code: Station Codesampling_date: Date of sampling (note how this is formatted)state: Statelocation: Location of recordingagency: Agencytype: Type of areaso2: Sulphur dioxide (µg/m3)no2: Nitrogen dioxide (µg/m3)rspm: Respirable Suspended Particulate Matter (µg/m3)spm: Suspended Particulate Matter (µg/m3)location_monitoring_station: Location of data collectionpm2_5: PSI 2.
5 (µg/m3)date: Date of samplingWhy we need these details?SPM, RSPM, PM2.
5 values are the parameters used to measure the quality of air based on the number of particles present in it.
Using these values, we are going to identify the air quality over the period of time in different states of India.
But, how?This can only be answered once we analyze the data.
????Let’s start: Import the datasetDownload the dataset from this link — data.
Create a new Jupyter Notebook and import the packages needed.
Read the .
csv file into Pandas dataframe:Now, read the data into pandas DataFrame using read_csv() and display the first few rows with head(), by default head() will return first 5 rows of the dataset, but you can specify any number of rows like head(10).
Sample rows from India Air Quality datasetYes, we got the data now!Check the dataset infoAs discussed earlier, the dataset has 13 columns in it.
So how many rows are there?.Many of the cells are filled with NaN, which is an unknown value and cannot contribute to our analysis.
So how many such types of values are there?.and how can we get rid of those?.let’s find the answer to these questions.
To proceed with EDA, the initial level of investigation of data can be done using several commands,data.
shapeIt returns a number of rows and columns in a dataset.
sum()It returns a number of null values in each column.
stn_code 144077sampling_date 3state 0location 3agency 149481type 5393so2 34646no2 16233rspm 40222spm 237387location_monitoring_station 27491pm2_5 426428date 7dtype: int64data.
info()It returns range, column, number of non-null objects of each column, datatype and memory usage.
DataFrame'>RangeIndex: 435742 entries, 0 to 435741Data columns (total 13 columns):stn_code 291665 non-null objectsampling_date 435739 non-null objectstate 435742 non-null objectlocation 435739 non-null objectagency 286261 non-null objecttype 430349 non-null objectso2 401096 non-null float64no2 419509 non-null float64rspm 395520 non-null float64spm 198355 non-null float64location_monitoring_station 408251 non-null objectpm2_5 9314 non-null float64date 435735 non-null objectdtypes: float64(5), object(8)memory usage: 43.
count()It results in a number of null values in each column.
stn_code 291665sampling_date 435739state 435742location 435739agency 286261type 430349so2 401096no2 419509rspm 395520spm 198355location_monitoring_station 408251pm2_5 9314date 435735dtype: int64Summarised detailsGenerate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.
count: Count number of non-NA/null observationsmean: Mean of the valuesstd: Standard deviation of the observationsmin: Minimum of the values in the objectmax: Maximum of the values in the objectA subset of a DataFrame including/excluding columns based on their dtype.
Cleansing the datasetIn this step, we need to clean the data by adding and dropping the needed and unwanted data respectively.
From the above dataset,Dropping of less valued columns:stn_code, agency, sampling_date, location_monitoring_agency do not add much value to the dataset in terms of information.
Therefore, we can drop those columns.
Changing the types to uniform format:When you see the dataset, you may notice that the ‘type’ column has values such as ‘Industrial Area’ and ‘Industrial Areas’ — both actually mean the same, so let’s remove such type of stuff and make it uniform.
Creating a year columnTo view the trend over a period of time, we need year values for each row and also when you see in most of the values in date column only has ‘year’ value.
So, let’s create a new column holding year values.
Handling missing valuesThe column such as SO2, NO2, rspm, spm, pm2_5 are the ones which contribute much to our analysis.
So, we need to remove null from those columns to avoid inaccuracy in the prediction.
We use the Imputer from sklearn.
preprocessing to fill the missing values in every column with the mean.
Now, check the number of null values in each column.
state 0location 0type 0so2 0no2 0rspm 0spm 0pm2_5 0date 0dtype: int64Yes!.there are no missing values, we filled it using mean values.
Now, our dataset looks like this.
‘All set!.Ready to go.
Every preprocessing step are done, let’s find out some higher level information from it.
As I said earlier, so2, no2, rspm, and spm are the parameters that determine air quality in a particular locality.
Now, let's frame a question and get the answer from data.
Which is the state that has higher SO2 content?Group the data based on states and find the median for so2 content over a period of time, sort it and we will get the states with higher and lower level SO2 content.
Observation: Plotting for SO2, we can see that the top state is Uttaranchal, while the bottom state is Meghalaya.
Which is the state that has higher NO2 content?Again the same process, but now for NO2 value, group the data based on states and find the median for NO2 content over a period of time, sort it and we will get the states with higher and lower level NO2 content.
Observation: Plotting for NO2, we can see that the top state is West Bengal, while the bottom state is Mizoram.
????.Exercise: In the same way, as an exercise generate a graph for rspm and spm values and find the state with max and min rspm and spm value.
What is the yearly trend in a particular state, say ‘Andhra Pradesh’?We have created a new dataframe containing the NO2, SO2, rspm, and spm data regarding state ‘Andhra Pradesh’ only and group it by ‘year’.
Now plot the data,It is clear that the value SO2 was at a peak in 1997 and now from 2005–2015 is being maintained at the lower levels.
But be the measures would have taken to reduce it so.
In addition to that, the value of NO2 is also only slightly greater than the minimum value.
While doing this, I thought of plotting rspm and spm values too.
This gave an alarming signal that the value spm in Andhra Pradesh is hiking.
It’s 220 µg/m3 for the past 6 years (2010–2015).
It’s really an alerting one.
I found a news article regarding this hike and the state government’s action to reduce it.
Like this, you can dig data into any deeper levels to find surprising facts.
Data analysis is all about unraveling the hidden information.
This article is just to pave a path for you to start analyzing the data and understand the importance of it.
Transform data into insights!.????✔If you like it, don’t forget to click/tap on ????.button.
#100daysofMLcodingEnd of Day #8.