Aporia Documentation
Get StartedBook a Demo🚀 Cool StuffBlog
V2
V2
  • 📖Aporia Docs
  • 🤗Introduction
    • Quickstart
    • Support
  • 💡Core Concepts
    • Why Monitor ML Models?
    • Understanding Data Drift
    • Analyzing Performance
    • Tracking Data Segments
    • Models & Versions
  • 🚀Deployment
    • AWS
    • Google Cloud
    • Azure
    • Databricks
    • Offline / On-Prem
    • Platform Architecture
  • 🏠Storing your Predictions
    • Overview
    • Real-time Models (Postgres)
    • Real-time Models (Kafka)
    • Batch Models
    • Kubeflow / KServe
  • 🧠Model Types
    • Regression
    • Binary Classification
    • Multiclass Classification
    • Multi-Label Classification
    • Ranking
  • 🌈Explainability
    • SHAP values
  • 📜NLP
    • Intro to NLP Monitoring
    • Example: Text Classification
    • Example: Token Classification
    • Example: Question Answering
  • 🍪Data Sources
    • Overview
    • Amazon S3
    • Athena
    • BigQuery
    • Databricks
    • Glue Data Catalog
    • Google Cloud Storage
    • PostgreSQL
    • Redshift
    • Snowflake
    • Microsoft SQL Server
    • Oracle
  • ⚡Monitors & Alerts
    • Overview
    • Data Drift
    • Metric Change
    • Missing Values
    • Model Activity
    • Model Staleness
    • Performance Degradation
    • Prediction Drift
    • Value Range
    • Custom Metric
    • New Values
    • Alerts Consolidation
  • 🎨Dashboards
    • Overview
  • 🤖ML Monitoring as Code
    • Getting started
    • Adding new models
    • Data Segments
    • Custom metrics
    • Querying metrics
    • Monitors
    • Dashboards
  • 📡Integrations
    • Slack
    • Webhook
    • Teams
    • Single Sign On (SAML)
    • Cisco
  • 🔐Administration
    • Role Based Access Control (RBAC)
  • 🔑API Reference
    • REST API
    • API Extended Reference
    • Custom Segment Syntax
    • Custom Metric Syntax
    • Code-Based Metrics
    • Metrics Glossary
  • ⏩Release Notes
    • Release Notes 2024
    • Release Notes 2023
Powered by GitBook
On this page
  • What is Data Drift?
  • Measuring Data Drift
  • Intuition to Drift Score
  1. Core Concepts

Understanding Data Drift

PreviousWhy Monitor ML Models?NextAnalyzing Performance

Last updated 2 years ago

What is Data Drift?

Data drift occurs when the distribution of production data is different from a certain baseline (e.g training data).

The model isn't designed to deal with this change in the feature space and so, its predictions may not be reliable. Drift can be caused by changes in the real world or by data pipeline issues - missing data, new values, changes to the schema, etc.

It's important to look at the data that has drifted and follow it back through its pipeline to find out when and where the drift started.

When should I retrain my model?

As the data begins to drift, we may not notice significant degradation in our model's performance immediately.

However, this is an excellent opportunity to retrain before the drift has a negative impact on performance.

Measuring Data Drift

To measure how distributions differ from each other, you can use a statistical distance. This is a metric that quantifies the distance between two distributions, and it is extremely useful.

There are many different statistical distances for different scenarios.

Besides the default drift score, you can customize and add your own statistical distances.

Intuition to Drift Score

Let's say we have a categorical feature called pet_type with 2 possible values:

  • 🐶 Dog

  • 🐱 Cat

In our training set, the distribution of this feature was 100% 🐶 + 0% 🐱. This means that when we trained our model, we only had dogs and no cats.

Now, let's evaluate different scenarios in production, and see what would be the drift score:

  • If the current distribution is 0% 🐶 + 100% 🐱, the drift score would be 1.0.

    • Tons of drift!

  • If the current distribution is 50% 🐶 + 50% 🐱, the drift score would be 0.54.

  • If the current distribution is 60% 🐶 + 40% 🐱, the drift score would be 0.47.

  • If the current distribution is 100% 🐶 + 0% 🐱, the drift score would be 0.0.

    • No drift at all!

By default, Aporia calculates a metric called Drift Score, which is a smart combination of statistical distances such as for categorical variables and for numeric variables.

💡
Hellinger Distance
Jensen-Shannon Divergence
Is there a data drift here? :)