Data Visualization in Python – Numpy, Pandas, Matplotlib, and Seaborn

Python Data Visualization techniques

What is Data Visualization?

Data Visualization techniques are one of the key components of any analytics project. An end-to-end analytics use case involves ideation, requirement gathering, getting the raw data, analyzing the data, building a predictive model, deploying the model, and communicating the end result to the business.

Throughout this entire process, the analysis of data, and the communication of results to the business requires visualizing the raw data and understanding several inter-linked relations among the features. Python is the most preferred language which has several libraries and packages such as Pandas, NumPy, Matplotlib, Seaborn, and so on used to visualize the data.

We have another detailed tutorial, covering the Data Visualization libraries in Python.

Below are some of the data visualization examples using python on real data.

Data Visualization Examples –

Example 1: –

Data visualization dataset:- Iris Dataset

#Importing the necessary libraries

import pandas as pd
Import numpy as np
Import matplotlib.pyplot as plt
Import seaborn as sns
sns.set(style=”white”, color_codes=True)
%matplotlib inline

After all the libraries are imported, we load the data using the read_csv command of pandas and store it into a data frame.

df = pd.read_csv(./iris.csv)

To understand the structure of the data, the .head() function is used in pandas.

df.head()

Data Visualization in Python - Reading Data from CSV

The pandas library has a .plot() feature which is mostly used for any quick visual analysis. The scatter plot of all the Iris features is displayed below.

df.plot(kind="scatter", x="SepalLengthCm", y="SepalWidthCm")

Python Data Visualization - Scatter plot of iris dataset

Seaborn could be used to generate similar plots. Univariate histograms, and bivariate scatter plots is shown using the joint plot of seaborn.

sns.jointplot(x="SepalLengthCm", y="SepalWidthCm", data=df, size=5)

Univariate histogram and bivariate scatter plot using joint plot in seaborn

Finding which species, the plant belongs to. FacetGrid in seaborn is used for the same. It gives the scatter plot color by species.

sns.FacetGrid(df, hue="Species", size=5) \
   .map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \
   .add_legend()

Scatterplot color of species using FacetGrid in seaborn

A boxplot in Seaborn gives individual feature details.

sns.boxplot(x="Species", y="PetalLengthCm", data=df)

Building interactive dashboards in Python - Creating a box plot in Seaborn

A layer of individual points is added to this plot using the Strip plot in Seaborn. To avoid all pints falling in a single vertical line the jitter = True value is used.

ax = sns.boxplot(x="Species", y="PetalLengthCm", data=df)
ax = sns.stripplot(x="Species", y="PetalLengthCm", data=df, jitter=True, 
edgecolor="gray")

Advanced Data Visualization tools in Python - Strip plot in Seaborn

The benefits of the previous two plots combined using the violin plot.

sns.violinplot(x="Species", y="PetalLengthCm", data=df, size=6)

Data Visualization in python using violin plot

Kernel Density Estimation, Kde plot is used to look into univariate relations by plotting the kernel density estimate of the features.

sns.FacetGrid(df, hue="Species", size=6) \
   .map(sns.kdeplot, "PetalLengthCm") \
   .add_legend()

Data visualization techniques in Python - Creating Kernel Density Estimation plot

To show the bivariate relation between each feature, the pair plot is used in Seaborn. In the below plots, the Iris-setosa species is separated from the other two species.

sns.pairplot(df.drop("Id", axis=1), hue="Species", size=3)

Data Visualization dashboards in Python - Pair plot in Seaborn

To show the diagonal elements in a pair plot in form of a histogram.

sns.pairplot(df.drop("Id", axis=1), hue="Species", size=3, diag_kind="kde")

Creating a pair plot using kernel density estimation

So far, we have covered some of the visualizations using Seaborn, now let’s explore some with Pandas library as well. Below is a boxplot using Pandas.

df.drop("Id", axis=1).boxplot(by="Species", figsize=(12, 6))

Data Visualization in Python - Creating a boxplot in Pandas

The next plot is of Andrews Curves which uses sample attributes as coefficients for the Fourier series.

from pandas.plotting import andrews_curves
andrews_curves(df.drop("Id", axis=1), "Species")

Data Visualization techniques in Python - Creating Andrews curve in Matplotlib | Data visualization tutorial

Parallel coordinates are another multivariate data visualization technique in pandas where each feature is plotted on a separate column and then lines are drawn which connects each data sample feature.

from pandas.plotting import parallel_coordinates
parallel_coordinates(df.drop("Id", axis=1), "Species")

Multivariate visualization technique in pandas - Parallel coordinates | Data Visualization tutorial

Radviz is another data visualization technique in pandas used for multivariate plotting. Here, on a 2D plane, each feature is put, and then simulates having each sample attached to those points through a spring weighted by the value of the feature.

from pandas.plotting import radviz
radviz(df.drop("Id", axis=1), "Species")

Data visualization technique in pandas - Radviz | Data Visualization in Pandas

These were some of the data visualizations best practices are done on an Iris dataset.

Example 2:-

Data Visualization dataset: San Francisco Salaries

The very first step is to read the data.

salaries = pd.read_csv(‘./Salaries.csv’)

Checking the columns present using the .info() function in pandas.

salaries.info()

Checking the columns present using info() function in pandas

Converting all the columns to numeric.

for col in ['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']:
    salaries[col] = pd.to_numeric(salaries[col], errors='coerce')

All the pay columns are plotted in one plot.

pay_columns = salaries.columns[3:salaries.columns.get_loc('Year')]
pay_columns

Pay columns in Pandas Data frame Index | Data visualization in Pandas

A 2×3 figure is plotted with a histogram which is useful for grouping elements.

pays_arrangement = list(zip(*(iter(pay_columns),) * 3))

The plt.subplots command gives a figure and a 2×3 array of axes.

fig, axes = plt.subplots(2,3)
for i in range(len(pays_arrangement)):
  for j in range(len(pays_arrangement[i])):

# pass in axes to pandas hist
salaries[pays_arrangement[i][j]].hist(ax=axes[i,j])

# axis objects have a lot of methods for customizing the look of a plot
axes[i,j].set_title(pays_arrangement[i][j])
plt.show()

Building subplots in Pandas | Data Visualization tutorial

To make the plot more readable, a combination of figure height, width, and subplot spacing could be used.

fig, axes = plt.subplots(2,3)

# set the figure height
fig.set_figheight(5)
fig.set_figwidth(12)

for i in range(len(pays_arrangement)):
    for j in range(len(pays_arrangement[i])):
        # pass in axes to pandas hist
        salaries[pays_arrangement[i][j]].hist(ax=axes[i,j])
        axes[i,j].set_title(pays_arrangement[i][j])
        
# add a row of emptiness between the two rows
plt.subplots_adjust(hspace=1)
# add a row of emptiness between the cols
plt.subplots_adjust(wspace=1)
plt.show()

Implementing figure height in subplots in Pandas | Python Data Visualization tutorial

On top of this, the ticks could be rotated.

# and here is a cleaner version using tick rotation and plot spacing
fig, axes = plt.subplots(2,3)

# set the figure height
fig.set_figheight(5)
fig.set_figwidth(12)

for i in range(len(pays_arrangement)):
    for j in range(len(pays_arrangement[i])):
        salaries[pays_arrangement[i][j]].hist(ax=axes[i,j])
        axes[i,j].set_title(pays_arrangement[i][j])
        
        # set xticks with these labels,
        axes[i,j].set_xticklabels(labels=axes[i,j].get_xticks(), 
                                  # with this rotation
                                  rotation=30)
        
plt.subplots_adjust(hspace=1)
plt.subplots_adjust(wspace=1)
plt.show()

Subplots in Pandas | Python Data Visualization techniques

There is an innumerable list of plots available in the official documentation of Matplotlib and Seaborn. Another library that has gained a reputation and used quite regularly is Plotly which makes interactive browser-friendly plots.

Conclusion

In this blog, we covered some of the Data visualization techniques that could be performed using Python. TechLearn has a rich set of live sessions pertaining to Python, Data visualization, Data Science, and so on which would enhance your skill to the level necessary to sustain in the industry.

LEAVE A REPLY

Please enter your comment!
Please enter your name here