Aporia Documentation
Get StartedBook a Demo🚀 Cool StuffBlog
V1
V1
  • Welcome to Aporia!
  • 🤗Introduction
    • Quickstart
    • Support
  • 💡Core Concepts
    • Why Monitor ML Models?
    • Understanding Data Drift
    • Analyzing Performance
    • Tracking Data Segments
    • Models & Versions
    • Explainability
  • 🏠Storing your Predictions
    • Overview
    • Real-time Models (Postgres)
    • Real-time Models (Kafka)
    • Batch Models
    • Kubeflow / KServe
    • Logging to Aporia directly
  • 🚀Model Types
    • Regression
    • Binary Classification
    • Multiclass Classification
    • Multi-Label Classification
    • Ranking
  • 📜NLP
    • Intro to NLP Monitoring
    • Example: Text Classification
    • Example: Token Classification
    • Example: Question Answering
  • 🍪Data Sources
    • Overview
    • Amazon S3
    • Athena
    • BigQuery
    • Delta Lake
    • Glue Data Catalog
    • PostgreSQL
    • Redshift
    • Snowflake
  • ⚡Monitors
    • Overview
    • Data Drift
    • Metric Change
    • Missing Values
    • Model Activity
    • Model Staleness
    • New Values
    • Performance Degradation
    • Prediction Drift
    • Value Range
    • Custom Metric
  • 📡Integrations
    • Slack
    • JIRA
    • New Relic
    • Single Sign On (SAML)
    • Webhook
    • Bodywork
  • 🔑API Reference
    • Custom Metric Definition Language
    • REST API
    • SDK Reference
    • Metrics Glossary
Powered by GitBook
On this page
  • What is Data Drift?
  • Measuring Data Drift
  • Intuition to Drift Score
  1. Core Concepts

Understanding Data Drift

PreviousWhy Monitor ML Models?NextAnalyzing Performance

Last updated 2 years ago

What is Data Drift?

Data drift occurs when the distribution of production data is different from a certain baseline (e.g training data).

The model isn't designed to deal with this change in the feature space and so, its predictions may not be reliable. Drift can be caused by changes in the real world or by data pipeline issues - missing data, new values, changes to the schema, etc.

It's important to look at the data that has drifted and follow it back through its pipeline to find out when and where the drift started.

When should I retrain my model?

As the data begins to drift, we may not notice significant degradation in our model's performance immediately.

However, this is an excellent opportunity to retrain before the drift has a negative impact on performance.

Measuring Data Drift

To measure how distributions differ from each other, you can use a statistical distance. This is a metric that quantifies the distance between two distributions, and it is extremely useful.

There are many different statistical distances for different scenarios.

Besides the default drift score, you can customize and add your own statistical distances.

Intuition to Drift Score

Let's say we have a categorical feature called pet_type with 2 possible values:

  • 🐶 Dog

  • 🐱 Cat

In our training set, the distribution of this feature was 100% 🐶 + 0% 🐱. This means that when we trained our model, we only had dogs and no cats.

Now, let's evaluate different scenarios in production, and see what would be the drift score:

  • If the current distribution is 0% 🐶 + 100% 🐱, the drift score would be 1.0.

    • Tons of drift!

  • If the current distribution is 50% 🐶 + 50% 🐱, the drift score would be 0.54.

  • If the current distribution is 60% 🐶 + 40% 🐱, the drift score would be 0.47.

  • If the current distribution is 100% 🐶 + 0% 🐱, the drift score would be 0.0.

    • No drift at all!

By default, Aporia calculates a metric called Drift Score, which is a smart combination of statistical distances such as for categorical variables and for numeric variables.

💡
Hellinger Distance
Jensen-Shannon Divergence
Is there a data drift here? :)