Exploratory data analysis is an important step when dealing with data sets. You have to understand how to use visualization techniques to get a clear picture of how data is skewed.
In this article, I am going to show you how to use visualization techniques to perform exploratory data analysis process.
step 1. What is Exploratory Data Analysis?
Exploratory Data Analysis is a process of discovering important insights and patterns from data for further analysis using statistical and visualization techniques.
Why is EDA important?
EDA helps us ensure that data is clean with no obvious errors before diving into statistical modeling or machine learning.
In this article, we will use the iris data set which you can download here and see how we will gain insights into it.
step 2. Check Basic data details
We will start by importing important libraries and then we load the data set. After loading the data file, it is important to check introductory details such as the number of columns, number of rows, types of features i.e. numerical or categorical, and data types of column entries.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from seaborn import load_dataset
%matplotlib inline
df = pd.read_csv("/kaggle/input/performing-eda-on-iris-dataset/iris_csv.csv")
df.head()
You can display the last five rows by using the following line of code
data.tail( ) //Python code for displaying last five rows
step 3. Statistical Insights
This step aims at getting various statistical data like mean, standard deviation, median, maximum value and minimum value.
#show statiscs
df.describe()
step 4. Data cleaning
This step is vital in EDA because it involves removing duplicate rows or columns, filling missing values with values like the mean or median of the data, dropping various values and removing null entries.
#Python code for displaying number of missing values for each variable
data.Isnull( ).sum
In case of the existence of null entries, mean, median or integer can be used to fill the entries.
step 5. Data Visualization
Data Visualization is a method of converting raw data into a graphical form, such as a graph to make data easier to understand and extract useful insights.
Types of Visualization Analysis
a. Univariate analysis
This shows every observation or distribution in data on a single data variable. It can be shown with the help of various plots like Scatter Plots, Histogram plots, line plots or violin plots.
# Distribution of particular species
sns.distplot(a=df['petalwidth'], bins=40, color='b')
plt.title('petal width distribution plot')
b. Bi-Variate analysis
The displays of bi-variate analysis are done to reveal the relationship between two data variables. It can be shown with the help of Scatter plots, histograms, heat maps, box plots and violin plots.
# Illustration using Box plot
plt.figure(figsize=(8,4))
sns.boxplot(x='class',y='sepalwidth',data=df ,palette='YlGnBu')
c. Multi-variate analysis
As the name suggests, Multi-variate analysis displays are done to reveal the relationship between more than one variable.
#Correlation map using a heatmap matrix
sns.heatmap(df.corr(), linecolor='white', linewidths=1)
step 6. Extracting valuable insights from data.
You can gain insights about the dataset from the visualizations and your research.
You should keep in mind that EDA is an iterative process, so you may keep improving your analysis as you gain new insights. A useful technique for understanding and communicating the patterns and trends in your dataset is data visualization. Happy learning!!