Exploratory Data Analysis Using Data Visualization Techniques

Exploratory data analysis is an important step when dealing with data sets. You have to understand how to use visualization techniques to get a clear picture of how data is skewed.

In this article, I am going to show you how to use visualization techniques to perform exploratory data analysis process.

step 1. What is Exploratory Data Analysis?

Exploratory Data Analysis is a process of discovering important insights and patterns from data for further analysis using statistical and visualization techniques.

Why is EDA important?

EDA helps us ensure that data is clean with no obvious errors before diving into statistical modeling or machine learning.

In this article, we will use the iris data set which you can download here and see how we will gain insights into it.

step 2. Check Basic data details

We will start by importing important libraries and then we load the data set. After loading the data file, it is important to check introductory details such as the number of columns, number of rows, types of features i.e. numerical or categorical, and data types of column entries.

import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
from seaborn import load_dataset
%matplotlib inline 

df = pd.read_csv("/kaggle/input/performing-eda-on-iris-dataset/iris_csv.csv") 
df.head()

You can display the last five rows by using the following line of code

data.tail( )     //Python code for displaying last five rows

step 3. Statistical Insights

This step aims at getting various statistical data like mean, standard deviation, median, maximum value and minimum value.

#show statiscs
df.describe()

step 4. Data cleaning

This step is vital in EDA because it involves removing duplicate rows or columns, filling missing values with values like the mean or median of the data, dropping various values and removing null entries.

#Python code for displaying number of missing values for each variable
data.Isnull( ).sum

In case of the existence of null entries, mean, median or integer can be used to fill the entries.

step 5. Data Visualization

Data Visualization is a method of converting raw data into a graphical form, such as a graph to make data easier to understand and extract useful insights.

Types of Visualization Analysis

a. Univariate analysis

This shows every observation or distribution in data on a single data variable. It can be shown with the help of various plots like Scatter Plots, Histogram plots, line plots or violin plots.

# Distribution of particular species
sns.distplot(a=df['petalwidth'], bins=40, color='b')
plt.title('petal width distribution plot')

b. Bi-Variate analysis

The displays of bi-variate analysis are done to reveal the relationship between two data variables. It can be shown with the help of Scatter plots, histograms, heat maps, box plots and violin plots.

# Illustration using Box plot 
plt.figure(figsize=(8,4)) 
sns.boxplot(x='class',y='sepalwidth',data=df ,palette='YlGnBu')

c. Multi-variate analysis

As the name suggests, Multi-variate analysis displays are done to reveal the relationship between more than one variable.

#Correlation map using a heatmap matrix
sns.heatmap(df.corr(), linecolor='white', linewidths=1)

step 6. Extracting valuable insights from data.

You can gain insights about the dataset from the visualizations and your research.

You should keep in mind that EDA is an iterative process, so you may keep improving your analysis as you gain new insights. A useful technique for understanding and communicating the patterns and trends in your dataset is data visualization. Happy learning!!