Ever opened a dataset only to find it's full of holes like Swiss cheese? Don't worry - missing values are a common challenge in data science, and today we'll learn how to handle them effectively.
Understanding Missing Values
Missing values are gaps in your dataset where data should exist but doesn't. These gaps can significantly impact your analysis and machine learning models if not handled properly. They typically appear in several forms:
NA (Not Available)
NaN (Not a Number)
NULL
Empty or blank cells
Placeholder values (like -999 or 0)
Why Do Missing Values Occur?
Understanding why data is missing is crucial for choosing the right handling method:
Data Collection Issues
Technical failures during data recording
Sensor malfunctions
Network connectivity problems
Human Factors
Survey respondents skipping questions
Data entry errors
Incomplete records
Structural Reasons
Information not applicable to all cases
Data not available for certain time periods
Merged datasets with non-overlapping fields
Types of Missing Data
Missing data generally falls into three categories:
Missing Completely at Random (MCAR)
Missing values have no relationship with other variables
Like randomly scattered holes in your dataset
Missing at Random (MAR)
Missing values can be explained by other variables
Example: Younger people being less likely to report income
Missing Not at Random (MNAR)
Missing values are related to the value itself
Example: People with high incomes being less likely to report them
The Impact of Missing Values
Ignoring missing values can lead to serious problems:
Biased analysis results
Reduced model accuracy
Incorrect conclusions
Wasted computational resources
Compromised data integrity
Detecting Missing Values
Before treatment, you need to identify missing values:
Visual Methods
Data visualization tools
Heatmaps
Missing value correlation matrices
Programmatic Methods
# Basic checks in Python
df.isnull().sum()
df.info()
App Name 5
App Id 0
Category 0
Rating 22883
Rating Count 22883
...
Developer Website 760835
Developer Email 31
Released 71053
Last Updated 0
Content Rating 0
Privacy Policy 420953
...
dtype: int64
# Using the matplotlib to visually see the missing values
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15,6)
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.title("Missing null values")
# in below plot you can see the missing values as yellow strips
Handling Missing Values
1. Deletion Methods
When to use: For small amounts of missing data or when missing data is MCAR.
Complete Case Analysis (Listwise Deletion)
Remove entire rows with missing values
Best when data is MCAR and sample size is large
Column Deletion
Remove columns with excessive missing values (typically >50%)
Use when the variable isn't critical for analysis
2. Imputation Methods
Simple Imputation
Mean/median replacement for numerical data
Mode replacement for categorical data
Last observation carried forward (LOCF)
Next observation carried backward (NOCB)
Advanced Imputation
KNN imputation
Multiple imputation by chained equations (MICE)
Maximum likelihood estimation
Deep learning methods (autoencoders)
# 2. Drop columns with excessive missing data (>40% missing values)
threshold = 0.4 * len(df)
df_dropped_cols = df.drop(columns=missing_summary[missing_summary > threshold])
print("\nDropped Columns with excessive missing values (>40%):\n", df_dropped_cols)
# Reason: Dropping columns with excessive missingness (like "Developer Website" if it were here)
# ensures data quality and prevents bias from excessive imputation.
# 3. Impute 'Rating' with the mean (for continuous, numeric values)
df['Rating'] = df['Rating'].fillna(df['Rating'].mean())
print("\nImputed 'Rating' with mean value:\n", df['Rating'])
# Reason: The mean is used for continuous numeric values like ratings, assuming a normal distribution.
# 4. Impute 'Rating Count' with the median
df['Rating Count'] = df['Rating Count'].fillna(df['Rating Count'].median())
print("\nImputed 'Rating Count' with median value:\n", df['Rating Count'])
# Reason: Median is less sensitive to outliers, ideal for "Rating Count" since it could be skewed.
# 5. Impute 'Installs' with forward-fill (fill from the previous value)
df['Installs'] = df['Installs'].fillna(method='ffill')
print("\nImputed 'Installs' using forward fill:\n", df['Installs'])
# Reason: Forward-fill is suitable for time-based or sequential data like "Installs," ensuring
# continuity where missing values likely result from sequential recording gaps.
# 6. Impute 'Currency' with mode (most frequent value)
df['Currency'] = df['Currency'].fillna(df['Currency'].mode()[0])
print("\nImputed 'Currency' with mode (most frequent value):\n", df['Currency'])
# Reason: For categorical data like currency, using the mode ensures the most frequent value is applied.
# 7. Impute 'Size' with interpolation (linear interpolation)
df['Size'] = df['Size'].interpolate()
print("\nImputed 'Size' using interpolation:\n", df['Size'])
# Reason: Interpolation assumes a continuous progression between values and can smooth out missing
# data, ideal for continuous measures like "Size."
# 8. Impute 'Minimum Android' with backward-fill (fill from next value)
df['Minimum Android'] = df['Minimum Android'].fillna(method='bfill')
print("\nImputed 'Minimum Android' using backward fill:\n", df['Minimum Android'])
# Reason: Backward-fill is helpful when filling backward from future values, suitable for versions
# like "Minimum Android" if backward compatibility is assumed.
# 9. Impute 'Developer Id' using a placeholder
df['Developer Id'] = df['Developer Id'].fillna('Unknown Developer')
print("\nImputed 'Developer Id' with placeholder 'Unknown Developer':\n", df['Developer Id'])
# Reason: Using placeholders for identifiers like "Developer Id" retains structure and signals
# missing data without distorting analysis.
# 10. Impute 'Released' with the earliest known date
df['Released'] = df['Released'].fillna(df['Released'].min())
print("\nImputed 'Released' with the earliest known date:\n", df['Released'])
# Reason: Filling "Released" with the earliest known date ensures temporal continuity and avoids
# artificially inflating the app's age in subsequent analyses.
Make sure to choose imputation methods responsibly
The imputation strategies presented here are context-specific and should be adapted based on the nature of the dataset, the goals of the analysis, and the assumptions that can be reasonably made.
Best Practices
Understand Your Data
Analyze missing data patterns
Identify the type of missingness
Document missing value distributions
Choose Methods Wisely
Consider the impact on your analysis
Test multiple approaches
Validate results with domain experts
Document Everything
Record your missing value handling strategy
Note any assumptions made
Track the impact on your results
Real-World Example
Let's say you're analyzing customer purchase data. You notice that income data is missing for 30% of customers. Before rushing to delete or impute:
Check if younger customers are more likely to have missing income data
Examine if missing income data correlates with purchase behavior
Consider whether simple mean imputation might introduce bias
Test multiple imputation methods and validate results
Conclusion
Handling missing values is a crucial skill in data science. While it might seem tempting to just drop incomplete rows (and sometimes that's okay!), understanding the nature of your missing data and choosing appropriate handling methods can make the difference between a good analysis and a great one.
Remember: There's no one-size-fits-all solution. The key is understanding your data, the context, and the implications of your chosen method.
If you found this guide helpful, you might also enjoy my other posts on handling outliers, or dive into my Beginner’s Guide to Data Science and Easy Guide to AI
Thank you for reading. 🙂