Detecting Outliers With Seaborn Boxplot
Introduction
Have you imagined yourself as a final year student in a tertiary institution, and writing common entrance exams with Year 6 (primary) school students?
Well…That’s a dream you shouldn’t have. Just kidding 😂😂
Actually, you will be considered an outlier when found in such a class. Check out the yellow ball in the midst of fitly-sized blue balls.
Read through to the end, I will be providing you with the notebook I used on my github page.
Outliers can be considered as the odd or weird data points
What are Outliers?
Outliers are observations or data points that are at a far distance from other data points. Outliers are seen as a big threat in a dataset. Just imagine, you collected a survey from a group of students between the age of 12–18 years, and three of the students try to be naughty, filling in 2, 78, and 100 years respectively.
These are considered outliers, from the example above outliers, can be far below the data points or far above the data points.
We’ll be considering several practical ways to deal with outliers using python
Outliers usually increase the standard deviation of a data, in simple terms, it makes the data more spread, instead of being clustered in a place. Understanding, the standard deviation is crucial for discovering outliers.
Detecting Outliers
Detecting outliers is the most important phase in data preparation, you have to discover or detect outliers, before dealing with them. There are several techniques used when finding outliers in a given dataset. I will be covering one of the major techniques to detect outliers, which is the Data Visualization techniques.
Data Visualization
Data visualization remains a powerful technique to quickly identify outliers and take the appropriate decisions. However, if you don’t understand how it work statistically, it might just seem meaningless.
Let’s me walk you through seaborn a python library for data visualization to discover outliers in a wine quality dataset, gotten from kaggle, get the dataset here. It’s time to get our hands dirty.
a. Import necessary libraries — Import python libraries such as numpy, pandas, seaborn and matplotlib to work on your dataset. You can always check the documentation.
b. Load your dataset — I loaded the wine dataset using pandas, and previewed it. I renamed the dataset I downloaded from kaggle when uploading it on my notebook.
c. Descriptive Analysis — Run some descriptive analysis on your dataset to give you insight, if the dataset, should have some possibilities of an outlier or not.
We can easily determine the standard deviation (std), the quantiles, percentile, median and all here. I will be analysing the alcohol column specifically. The alcohol has a mean of 10.4918 and a standard deviation of 1.1927, that’s worth studying. I’m quite inquisitive. Let’s check it out.
d. Visualization — Visualization should be made with the seaborn boxplot function for easily outlier visualization, it’s one of the best for determining outliers in a dataset.
Boxplot Explanation
This part of the article should be considered, one of the most important.
It’s one thing to visualize your data, it’s another to understand how to make the effective decisions with it.
I will walk you through
Let’s start from the bottom to top.
i. Min — This represent the minium or least number in the dataset, I have represented it with min here.
The minimum value for the red type of wine has about 8.5 alcoholic content , while the white type of wine has 8 alcoholic content.
ii. Q1 — Q1 represents first quartile, this can also be regarded as the 25th percentile of a dataset. For someone will little or no statistical background, the first quartile, means if there are 100 data values in a column, the 25th number in the column is the first quartile, lower quaritle or 25th percentile.
The lower quartile value for the red type of wine has about 9.5 alcoholic content , while the white type of wine has 9.5 alcoholic content.
iii. Q2 — This represents the second quartile or what is popularly called the median, you can easily determine through the box plot. It’s is well represented with a tick black line in the major rectangle.
The median value for the red type of wine has about 10.2 alcoholic content , while the white type of wine has 10.5 alcoholic content.
iv. Q3 — This also known as the the upper quartile. It can be regarded as the 75th percentile, the same scenario applies to this, like I explained in the lower quartile. Assume you are hundred in a class, and are arranged or sorted in ascending order, the 75th person on the attendance list is the upper quartile.
The upper quartile value for the red type of wine has about 11.2 alcoholic content , while the white type of wine has 11.5 alcoholic content.
v. Max — This is the maximum value of the data points. It represents the highest number in the dataset.
The max value for the red type of wine has about 13.5 alcoholic content , while the white type of wine has 14.2 alcoholic content.
vi. Outliers — As explained throughout this article, outliers are the points far below the minimum or above the maximum value. You can clearly see it here.
Conclusion
Detecting outliers is the first stage, but handling them is more important.
My next article, will be on how to detect and handle outliers with descriptive analytics tools such as IQR, Z-Score, and Standard Deviation in Python.
This will also include the best values and statistical methods to replace outliers with, no just random values of some sort.
Till then, see you.
Say some comments, and give some hands if you really love this content.
Oops, almost forgot, I promised a github link to the code