Practical Guide to Handle Missing Values in Datasets

Practical Guide to Handle Missing Values in Datasets

Ever opened a dataset only to find it's full of holes like Swiss cheese? Don't worry - missing values are a common challenge in data science, and today we'll learn how to handle them effectively.

Understanding Missing Values

Missing values are gaps in your dataset where data should exist but doesn't. These gaps can significantly impact your analysis and machine learning models if not handled properly. They typically appear in several forms:

  • NA (Not Available)

  • NaN (Not a Number)

  • NULL

  • Empty or blank cells

  • Placeholder values (like -999 or 0)

Why Do Missing Values Occur?

Understanding why data is missing is crucial for choosing the right handling method:

  1. Data Collection Issues

    • Technical failures during data recording

    • Sensor malfunctions

    • Network connectivity problems

  2. Human Factors

    • Survey respondents skipping questions

    • Data entry errors

    • Incomplete records

  3. Structural Reasons

    • Information not applicable to all cases

    • Data not available for certain time periods

    • Merged datasets with non-overlapping fields

Types of Missing Data

Missing data generally falls into three categories:

  1. Missing Completely at Random (MCAR)

    • Missing values have no relationship with other variables

    • Like randomly scattered holes in your dataset

  2. Missing at Random (MAR)

    • Missing values can be explained by other variables

    • Example: Younger people being less likely to report income

  3. Missing Not at Random (MNAR)

    • Missing values are related to the value itself

    • Example: People with high incomes being less likely to report them

The Impact of Missing Values

Ignoring missing values can lead to serious problems:

  • Biased analysis results

  • Reduced model accuracy

  • Incorrect conclusions

  • Wasted computational resources

  • Compromised data integrity

Detecting Missing Values

Before treatment, you need to identify missing values:

Visual Methods

  • Data visualization tools

  • Heatmaps

  • Missing value correlation matrices

Programmatic Methods

# Basic checks in Python
df.isnull().sum()
df.info()

App Name                  5
App Id                    0
Category                  0
Rating                22883
Rating Count          22883
...
Developer Website    760835
Developer Email          31
Released              71053
Last Updated              0
Content Rating            0
Privacy Policy       420953
...
dtype: int64

# Using the matplotlib to visually see the missing values
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15,6)
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.title("Missing null values")

# in below plot you can see the missing values as yellow strips

Handling Missing Values

1. Deletion Methods

When to use: For small amounts of missing data or when missing data is MCAR.

  • Complete Case Analysis (Listwise Deletion)

    • Remove entire rows with missing values

    • Best when data is MCAR and sample size is large

  • Column Deletion

    • Remove columns with excessive missing values (typically >50%)

    • Use when the variable isn't critical for analysis

2. Imputation Methods

Simple Imputation

  • Mean/median replacement for numerical data

  • Mode replacement for categorical data

  • Last observation carried forward (LOCF)

  • Next observation carried backward (NOCB)

Advanced Imputation

  • KNN imputation

  • Multiple imputation by chained equations (MICE)

  • Maximum likelihood estimation

  • Deep learning methods (autoencoders)

# 2. Drop columns with excessive missing data (>40% missing values)
threshold = 0.4 * len(df)
df_dropped_cols = df.drop(columns=missing_summary[missing_summary > threshold])
print("\nDropped Columns with excessive missing values (>40%):\n", df_dropped_cols)

# Reason: Dropping columns with excessive missingness (like "Developer Website" if it were here) 
# ensures data quality and prevents bias from excessive imputation.

# 3. Impute 'Rating' with the mean (for continuous, numeric values)
df['Rating'] = df['Rating'].fillna(df['Rating'].mean())
print("\nImputed 'Rating' with mean value:\n", df['Rating'])

# Reason: The mean is used for continuous numeric values like ratings, assuming a normal distribution.

# 4. Impute 'Rating Count' with the median
df['Rating Count'] = df['Rating Count'].fillna(df['Rating Count'].median())
print("\nImputed 'Rating Count' with median value:\n", df['Rating Count'])

# Reason: Median is less sensitive to outliers, ideal for "Rating Count" since it could be skewed.

# 5. Impute 'Installs' with forward-fill (fill from the previous value)
df['Installs'] = df['Installs'].fillna(method='ffill')
print("\nImputed 'Installs' using forward fill:\n", df['Installs'])

# Reason: Forward-fill is suitable for time-based or sequential data like "Installs," ensuring 
# continuity where missing values likely result from sequential recording gaps.

# 6. Impute 'Currency' with mode (most frequent value)
df['Currency'] = df['Currency'].fillna(df['Currency'].mode()[0])
print("\nImputed 'Currency' with mode (most frequent value):\n", df['Currency'])

# Reason: For categorical data like currency, using the mode ensures the most frequent value is applied.

# 7. Impute 'Size' with interpolation (linear interpolation)
df['Size'] = df['Size'].interpolate()
print("\nImputed 'Size' using interpolation:\n", df['Size'])

# Reason: Interpolation assumes a continuous progression between values and can smooth out missing 
# data, ideal for continuous measures like "Size."

# 8. Impute 'Minimum Android' with backward-fill (fill from next value)
df['Minimum Android'] = df['Minimum Android'].fillna(method='bfill')
print("\nImputed 'Minimum Android' using backward fill:\n", df['Minimum Android'])

# Reason: Backward-fill is helpful when filling backward from future values, suitable for versions 
# like "Minimum Android" if backward compatibility is assumed.

# 9. Impute 'Developer Id' using a placeholder
df['Developer Id'] = df['Developer Id'].fillna('Unknown Developer')
print("\nImputed 'Developer Id' with placeholder 'Unknown Developer':\n", df['Developer Id'])

# Reason: Using placeholders for identifiers like "Developer Id" retains structure and signals 
# missing data without distorting analysis.

# 10. Impute 'Released' with the earliest known date
df['Released'] = df['Released'].fillna(df['Released'].min())
print("\nImputed 'Released' with the earliest known date:\n", df['Released'])

# Reason: Filling "Released" with the earliest known date ensures temporal continuity and avoids 
# artificially inflating the app's age in subsequent analyses.

Make sure to choose imputation methods responsibly

The imputation strategies presented here are context-specific and should be adapted based on the nature of the dataset, the goals of the analysis, and the assumptions that can be reasonably made.

Best Practices

  1. Understand Your Data

    • Analyze missing data patterns

    • Identify the type of missingness

    • Document missing value distributions

  2. Choose Methods Wisely

    • Consider the impact on your analysis

    • Test multiple approaches

    • Validate results with domain experts

  3. Document Everything

    • Record your missing value handling strategy

    • Note any assumptions made

    • Track the impact on your results

Real-World Example

Let's say you're analyzing customer purchase data. You notice that income data is missing for 30% of customers. Before rushing to delete or impute:

  1. Check if younger customers are more likely to have missing income data

  2. Examine if missing income data correlates with purchase behavior

  3. Consider whether simple mean imputation might introduce bias

  4. Test multiple imputation methods and validate results

Conclusion

Handling missing values is a crucial skill in data science. While it might seem tempting to just drop incomplete rows (and sometimes that's okay!), understanding the nature of your missing data and choosing appropriate handling methods can make the difference between a good analysis and a great one.

Remember: There's no one-size-fits-all solution. The key is understanding your data, the context, and the implications of your chosen method.

If you found this guide helpful, you might also enjoy my other posts on handling outliers, or dive into my Beginner’s Guide to Data Science and Easy Guide to AI

Thank you for reading. 🙂