Understanding Data Drift
Last updated
Last updated
Data drift occurs when the distribution of production data is different from a certain baseline (e.g training data).
The model isn't designed to deal with this change in the feature space and so, its predictions may not be reliable. Drift can be caused by changes in the real world or by data pipeline issues - missing data, new values, changes to the schema, etc.
It's important to look at the data that has drifted and follow it back through its pipeline to find out when and where the drift started.
When should I retrain my model?
As the data begins to drift, we may not notice significant degradation in our model's performance immediately.
However, this is an excellent opportunity to retrain before the drift has a negative impact on performance.
To measure how distributions differ from each other, you can use a statistical distance. This is a metric that quantifies the distance between two distributions, and it is extremely useful.
There are many different statistical distances for different scenarios.
By default, Aporia calculates a metric called Drift Score, which is a smart combination of statistical distances such as Hellinger Distance for categorical variables and Jensen-Shannon Divergence for numeric variables.
Besides the default drift score, you can customize and add your own statistical distances.
Let's say we have a categorical feature called pet_type
with 2 possible values:
🐶 Dog
🐱 Cat
In our training set, the distribution of this feature was 100% 🐶 + 0% 🐱. This means that when we trained our model, we only had dogs and no cats.
Now, let's evaluate different scenarios in production, and see what would be the drift score:
If the current distribution is 0% 🐶 + 100% 🐱, the drift score would be 1.0.
Tons of drift!
If the current distribution is 50% 🐶 + 50% 🐱, the drift score would be 0.54.
If the current distribution is 60% 🐶 + 40% 🐱, the drift score would be 0.47.
If the current distribution is 100% 🐶 + 0% 🐱, the drift score would be 0.0.
No drift at all!