Exploratory Data Analysis (EDA) techniques for Kaggle competition beginners

Following are the different steps involved in EDA :Data CollectionData CleaningData PreprocessingData VisualisationData CollectionData collection is the process of gathering information in an established systematic way that enables one to test hypothesis and evaluate outcomes easily.After getting data we need to check the data-type of features.There are following types of features :numericcategoricalordinaldatetimecoordinatesIn order to know the data types/features of data, we need to run following command:train_data.dtypesortrain_data.info()Let’s have a look to the statistical summary about our dataset.train_data.describe()Data CleaningData cleaning is the process of ensuring that your data is correct and useable by identifying any errors in the data, or missing data by correcting or deleting them..Refer to this link for data cleaning.Once the data is clean we can go further for data preprocessing.Data PreprocessingData preprocessing is a data mining technique that involves transforming raw data into an understandable format..It includes normalisation and standardisation, transformation, feature extraction and selection, etc..The product of data preprocessing is the final training dataset.Data VisualisationData visualisation is the graphical representation of information and data..It uses statistical graphics, plots, information graphics and other tools to communicate information clearly and efficiently.Here we will focus on commonly used Seaborn visualisation..Seaborn is a Python data visualisation library based on matplotlib..It provides a high-level interface for drawing attractive and informative statistical graphics.Following are the common used seaborn visualisation :-Scatter PlotBox PlotHistogramCat PlotViolin PlotPair PlotJoint plotHeat Map# import seaborn libraryimport seaborn as snsScatter PlotA scatter plot is a set of points plotted on a horizontal and vertical axes.Scatter plot below shows the relationship between the passenger age and passenger fare based on pclass (Ticket class) from data taken from Titanic datasetsns.scatterplot(x="Age", y="Fare", hue = 'Pclass', data=train_data)Box PlotBox plot is a simple way of representing statistical data on a plot in which a rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value..The lower and upper quartiles are shown as horizontal lines either side of the rectangle.Box plot below shows how the passenger fare varies based on ticket class.sns.boxplot(x="Pclass", y="Fare",data= train_data)HistogramA histogram is an accurate representation of the distribution of numerical data..It is an estimate of the probability distribution of a continuous variablesns.distplot( train_data['Pclass'], kde=False)Cat PlotCat plot provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations..We can used different kind of plot to draw (corresponds to the name of a categorical plotting function)Options are: “point”, “bar”, “strip”, “swarm”, “box”, or “violin”..More details about Cat plot is hereBelow we do a cat plot with bar kindsns.catplot(x="Embarked", y="Survived", hue="Sex",col="Pclass", kind = 'bar',data=train_data, palette = "rainbow")Let’s have a look on same cat plot with violin kindsns.catplot(x="Embarked", y="Survived", hue="Sex",col="Pclass", kind = 'violin',data=train_data, palette = "rainbow")Violin PlotA violin plot plays a similar role as a box and whisker plot..It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.. More details

Leave a Reply