Dealing With Missing Values With Pandas
Every day, people collect data from various sources, such as forms, and websites, and ask direct questions from people. There is always this problem of people omitting fill spaces when gathering data. We had scenarios of people not having answers to the questions at the moment or it doesn’t just correlate with their personality.
However, for predictive analytics with Machine Learning, any model won’t work well with missing values scattered all over the data. Dealing with missing values is a necessary skill for any data newbie or expert.
Data needs to be cleaned and filled correcting, handling or dealing with missing values is a very important part of Data Cleaning, which is part of the Data science methodologies.
How Do We Deal With Missing Values?
Missing values can be easily dealt with in python. We have several methods to deal with missing values. I will list and explain two with proper examples.
Steps
- Import necessary python libraries for a Data Science Project.
2. Load your dataset — I got a WHO life expectancy dataset from Kaggle, download it from the link here. Then preview the first 5 rows with the life.head() function in pandas.
3. Check for missing values — It’s very important to check for missing values, before assuming or filling them with any value of your choice.
a. life.isnull().any() — It returns a boolean of missing values. True signifies, that there is a missing value. False signifies there’s no missing value in this column.
b. life.isnull().sum() — This function counts the number of missing values in the life datasets columns or features. In Adult Mortality, we have 10 missing values, Population has the highest missing values, which is 652
4. Approach — This is the most important part of this piece. I will be writing on two important approaches to dealing with missing values.
A. Dropping Missing Values
Dropping missing values is a way of deleting rows or columns with missing values depending on your choice. Personally to me, it’s an ineffective way of dealing with missing values. A fresher student might be asked to fill a form containing CGPA, if the student is just in the first semester, he/she is likely to omit that part. That doesn’t mean other parts of the data are invalid.
Important data might be missed with this method. However, it can be used, if you want to get quick insight from your data. Let’s get to it.
life.dropna(axis=0, inplace=True) — Let’s analyze this…
dropna() — It used to drop anything with missing value, axis=0 represents row, while axis=1 represents columns. inplace=True means to effect the dropping effect on the column.
When we called the life.isnull().sum() is given 0, which means there is no
B. Replace missing values — This is the most effective way of dealing with missing values.
i. Fill all missing values with any number like 0. This is not really effective, but can also be a good choice if you’re filling with the measure of central tendency such as mean, median or mode value. As you can see, there are no longer missing values.
ii. Replace missing values with mean, mode or median — Replacement of values shouldn’t be done with the whole dataset, because we have non-numeric data in some features. So you could manually get the mean in a column like population and fill the mean or mode in that feature. Just like this.
iii. Fill with method function— This is a method of filling all columns with a number before the missing value, personally I don’t like it. It can contain a lot of assumptions, which can lead to bad predictions. Well, let’s check the code out.
‘ffill’ — It represents forward filling, which means any number before should fill the missing value after it.
‘bfill’ — This is called backward filling, it’s the direct opposite of forward filling.
iv. Imputation with Scikit-Learn — Scikit learn remains effective for machine learning modelling, however they made something great for dealing with missing values.
The strategy could be mean, most_frequent, median and other things.
This method is effective too.
Conclusion
Missing values, when properly managed can make good data, and the otherwise if badly managed. There are several approaches listed above. Make use of the best one, taking domain knowledge of your data into consideration.
I hope, you found this insightful. If you did, give a thumbs up.
See you later.😎😎