Detecting Outliers With Seaborn Boxplot

Josiah Adesola
5 min readJun 15, 2022

--

Image from besthqwallpapers

Introduction

Have you imagined yourself as a final year student in a tertiary institution, and writing common entrance exams with Year 6 (primary) school students?
Well…That’s a dream you shouldn’t have. Just kidding 😂😂
Actually, you will be considered an outlier when found in such a class. Check out the yellow ball in the midst of fitly-sized blue balls.

Read through to the end, I will be providing you with the notebook I used on my github page.

Outliers can be considered as the odd or weird data points

What are Outliers?

Outliers are observations or data points that are at a far distance from other data points. Outliers are seen as a big threat in a dataset. Just imagine, you collected a survey from a group of students between the age of 12–18 years, and three of the students try to be naughty, filling in 2, 78, and 100 years respectively.

These are considered outliers, from the example above outliers, can be far below the data points or far above the data points.

We’ll be considering several practical ways to deal with outliers using python

Outliers usually increase the standard deviation of a data, in simple terms, it makes the data more spread, instead of being clustered in a place. Understanding, the standard deviation is crucial for discovering outliers.

Image image from shutterstock

Detecting Outliers

Detecting outliers is the most important phase in data preparation, you have to discover or detect outliers, before dealing with them. There are several techniques used when finding outliers in a given dataset. I will be covering one of the major techniques to detect outliers, which is the Data Visualization techniques.

Data Visualization

Data visualization remains a powerful technique to quickly identify outliers and take the appropriate decisions. However, if you don’t understand how it work statistically, it might just seem meaningless.

Let’s me walk you through seaborn a python library for data visualization to discover outliers in a wine quality dataset, gotten from kaggle, get the dataset here. It’s time to get our hands dirty.

a. Import necessary libraries — Import python libraries such as numpy, pandas, seaborn and matplotlib to work on your dataset. You can always check the documentation.

b. Load your dataset — I loaded the wine dataset using pandas, and previewed it. I renamed the dataset I downloaded from kaggle when uploading it on my notebook.

data.head() result

c. Descriptive Analysis — Run some descriptive analysis on your dataset to give you insight, if the dataset, should have some possibilities of an outlier or not.

We can easily determine the standard deviation (std), the quantiles, percentile, median and all here. I will be analysing the alcohol column specifically. The alcohol has a mean of 10.4918 and a standard deviation of 1.1927, that’s worth studying. I’m quite inquisitive. Let’s check it out.

data.describe() result

d. Visualization — Visualization should be made with the seaborn boxplot function for easily outlier visualization, it’s one of the best for determining outliers in a dataset.

Outlier Visualization

Boxplot Explanation

This part of the article should be considered, one of the most important.
It’s one thing to visualize your data, it’s another to understand how to make the effective decisions with it.
I will walk you through

Let’s start from the bottom to top.

i. Min — This represent the minium or least number in the dataset, I have represented it with min here.

The minimum value for the red type of wine has about 8.5 alcoholic content , while the white type of wine has 8 alcoholic content.

ii. Q1 — Q1 represents first quartile, this can also be regarded as the 25th percentile of a dataset. For someone will little or no statistical background, the first quartile, means if there are 100 data values in a column, the 25th number in the column is the first quartile, lower quaritle or 25th percentile.

The lower quartile value for the red type of wine has about 9.5 alcoholic content , while the white type of wine has 9.5 alcoholic content.

iii. Q2 — This represents the second quartile or what is popularly called the median, you can easily determine through the box plot. It’s is well represented with a tick black line in the major rectangle.

The median value for the red type of wine has about 10.2 alcoholic content , while the white type of wine has 10.5 alcoholic content.

iv. Q3 — This also known as the the upper quartile. It can be regarded as the 75th percentile, the same scenario applies to this, like I explained in the lower quartile. Assume you are hundred in a class, and are arranged or sorted in ascending order, the 75th person on the attendance list is the upper quartile.

The upper quartile value for the red type of wine has about 11.2 alcoholic content , while the white type of wine has 11.5 alcoholic content.

v. Max — This is the maximum value of the data points. It represents the highest number in the dataset.

The max value for the red type of wine has about 13.5 alcoholic content , while the white type of wine has 14.2 alcoholic content.

vi. Outliers — As explained throughout this article, outliers are the points far below the minimum or above the maximum value. You can clearly see it here.

Conclusion

Detecting outliers is the first stage, but handling them is more important.
My next article, will be on how to detect and handle outliers with descriptive analytics tools such as IQR, Z-Score, and Standard Deviation in Python.
This will also include the best values and statistical methods to replace outliers with, no just random values of some sort.

I’m happy, I hope you’re. Smile

Till then, see you.

Say some comments, and give some hands if you really love this content.

Oops, almost forgot, I promised a github link to the code

--

--

Josiah Adesola
Josiah Adesola

Written by Josiah Adesola

Writes about machine learning, Data Science, Python. Creative. Thinker. Engineering. Twitter: @_JosiahAdesola

No responses yet