How to handle missing values in Dataset?

Nidhi Bansal
3 min readJul 9, 2019

--

Handling missing values in Dataset

Data pre-processing is most crucial in Machine Learning Algorithm. In real time scenario ,there are some missing values in dataset.

There are 5 methods to handle missing data

Lets take a example dataset:

Here, Nan represents missing data or no data
  1. Remove data points with missing values

Simplest way is to remove data points with missing values. But sometimes we may end up in removing useful data.

Pros: Easy to implement

Cons: Loss of data.

2. Imputation of missing values

Replace the missing values with either mean, median or mode(most frequent occurring value)of that feature.

Like in above dataset Height, Weight missing values can be filled with mean value of their columns respectively.

Pros: Easy to implement. Work well with Numerical data.

Cons: Mean/Median does not work with categorical data. Only Mode can work with categorical data by filling missing values with most occurring value.

3. Imputation of missing values based on class labels

Replace the missing values with either mean, median or mode(most frequent occurring value)of that feature based on class label.

E.g. Lets try to find missing value of weight for Test4.

First approach is to fill it with mean of all weight values i.e. 81. See below picture: Lets say we have feature with values + and — distributed as below shown. Their mean value will be in middle as shown by circle.

Second approach: Lets take gender as class label. Test4’s gender is F, so lets find mean only for F gender values, its come out to be 62.5

Lets take below example of + and — values , if we take out their mean class wise , its come out differently and more appropriate.

Pros: It is a better approach then just filling with mean, median or mode

Cons: Mean/Median does not work with categorical data. Only Mode can work with categorical data by filling missing values with most occurring value based on class labels.

4. New missing value feature

Sometimes, missing values also giving us important information. Like in hair-colour feature, no value might means that person has no hair at all.

so , in such case we can add another feature hair with binary values 1 , 0 for hair or no hair values.

Pros: Improves performance.

Cons: Needs domain knowledge.

5. Model based imputation

We can use model like KNN to find the missing values as well. Take colour with missing values as output and all other feature as input features and predict the missing values.

Pros: Easy to implement and predict missing values using models.

Cons: Time consuming for large datasets.

Conclusion

It is fun to try different techniques to find the missing value appropriately. Choose technique which suits different features according to dataset.

So, read, try and implement. Enjoy!!!

--

--