Missing values are unavoidable in any data science project. It takes experience and skill to handle the missing data. Typical missing value handling consist of taking decisions like deleting the entire record, where applicable. If there are a large number of missing values in a column then even the entire column can be dropped. In certain cases,
Handling missing values in time series data can be even more challenging as dropping data points may even lead to further data problems.
Following is an example time series. Take the index (0-20) as any time interval.
Note the missing values in the above observations (Nan). In fact - 9 observations out of 20 are missing. If we plot on a graph - it looks like the following.
By Statistical Aggregates
Mean
df['new_col_name']=df['col_name'].fillna(df['col_name'].mean())
Median
df['new_col_name']=df['col_name'].fillna(df[col_name].median())
By Existing Observations
Last Observation Carried Forward (bfill)
df['new_col_name']=df['col_name'].fillna(method='bfill'))
Next Observation Carried Backward (ffill)
df['new_col_name']=df['col_name'].fillna(method='ffill'))
By Interpolation
Pad
df['new_col_name']=df['col_name'].interpolate(method='pad'))
Linear
df['new_col_name']=df['col_name'].interpolate(method='linear'))
Polynomial
df['new_col_name']=df['col_name'].interpolate(method='linear', order=2))
Spline
df['new_col_name']=df['col_name'].interpolate(method='linear', order=2))
Polynomial interpolation seems to produce the smoothest curve
If we look at all observations together
Comments