Skip to content

Data Drift

Why monitor data drifts?

Data drifts are one of the top reasons why model accuracy degrades over time. Data drift is the change in model input data that leads to model performance degradation. Monitoring data drift helps detect these model performance issues.

Causes of data drift include:

  • Upstream process changes, such as a sensor being replaced that changes the units of measurement from inches to centimeters.
  • Data quality issues, such as a broken sensor always reading 0.
  • Natural drift in the data, such as mean temperature changing with the seasons.
  • Change in relation between features, or covariate shift.

Detection Methods

For this monitor type, you can select the following detection methods:

Data Drift Detection Methods

  • Anomaly Detection - The distribution of the inspected data is compared to the distribution of a time period before the data was collected.
  • Compared To Segment - The distribution of the inspected data is compared to the distribution of a different data segment.
  • Compared To Training - The distribution of the inspected data is compared to the distribution of the reported training data.

Configuration

Start from choosing the features / raw inputs you'd like to monitor. You can select as many as you want :-)

The monitor will compare the distributions of these fields between the inspection period to the baseline you chose. An alert is raised if the monitor finds a drift between these distributions.

Data Drift Configuration

Note that the monitor configuration may vary between the detection method you choose.

You can work with the monitor preview and play with these thresholds to make sure you have a healthy amount of alerts. Note that the thresholds are different for numeric field vs. categorical fields.

How are drifts calculated?

For numeric fields, Aporia detects drifts based on the Jensen–Shannon divergence metric. For categorical fields, we use Hellinger distance.

If you need to use other metrics, please contact us.

Creating this monitor using the REST API

POST https://app.aporia.com/v1beta/monitors
{
    "name": "Drift in Age feature",
    "type": "data_drift",
    "scheduling": "0 */4 * * *",
    "configuration": {
        "configuration": {
            "focal": {
                "source": "SERVING",
                "timePeriod": "1d",
                "alignBinsWithBaseline": true
            },
            "metric": {
                "type": "histogram"
            },
            "actions": [
                {
                    "type": "ALERT",
                    "schema": "v1",
                    "severity": "HIGH",
                    "alertType": "data_drift_anomaly",
                    "description": "A data drift was detected in feature <b>'{field}'</b>. A drift score of <b>{drift_score}</b> was detected. <br /> The drift was observed in the <b>{model}</b> model, in version <b>{model_version}</b> for the <b>last {focal_time_period} ({focal_times})</b> <b>{focal_segment}</b> compared to the <b>last {baseline_time_period} ({baseline_times})</b>. <br /><br /> Data drift can have a significant effect on model behavior and may lead to unexpected results.<br /><br /> Data drift might occur because: <ul><li>Natural changes in data</li><li>Data store / provider schema changes</li><li>Data store / provider issues</li><li>Data processing issues</li></ul>",
                    "notification": [
                        {
                            "type": "EMAIL",
                            "emails": [
                                "dev@aporia.com"
                            ]
                        }
                    ],
                    "visualization": "distribution_compare_chart"
                }
            ],
            "baseline": {
                "source": "SERVING",
                "skipPeriod": "1d",
                "timePeriod": "3w"
            },
            "logicEvaluations": [
                {
                    "name": "APORIA_DRIFT_SCORE",
                    "thresholds": {
                        "numeric": 0.2
                    }
                }
            ]
        },
        "identification": {
            "models": {
                "id": "seed-0000-fhpy"
            },
            "segment": {
                "group": null
            },
            "features": [
                "numeric_Age"
            ],
            "environment": null
        }
    }
}