Aporia Documentation
Get StartedBook a Demo🚀 Cool StuffBlog
V2
V2
  • 📖Aporia Docs
  • 🤗Introduction
    • Quickstart
    • Support
  • 💡Core Concepts
    • Why Monitor ML Models?
    • Understanding Data Drift
    • Analyzing Performance
    • Tracking Data Segments
    • Models & Versions
  • 🚀Deployment
    • AWS
    • Google Cloud
    • Azure
    • Databricks
    • Offline / On-Prem
    • Platform Architecture
  • 🏠Storing your Predictions
    • Overview
    • Real-time Models (Postgres)
    • Real-time Models (Kafka)
    • Batch Models
    • Kubeflow / KServe
  • 🧠Model Types
    • Regression
    • Binary Classification
    • Multiclass Classification
    • Multi-Label Classification
    • Ranking
  • 🌈Explainability
    • SHAP values
  • 📜NLP
    • Intro to NLP Monitoring
    • Example: Text Classification
    • Example: Token Classification
    • Example: Question Answering
  • 🍪Data Sources
    • Overview
    • Amazon S3
    • Athena
    • BigQuery
    • Databricks
    • Glue Data Catalog
    • Google Cloud Storage
    • PostgreSQL
    • Redshift
    • Snowflake
    • Microsoft SQL Server
    • Oracle
  • ⚡Monitors & Alerts
    • Overview
    • Data Drift
    • Metric Change
    • Missing Values
    • Model Activity
    • Model Staleness
    • Performance Degradation
    • Prediction Drift
    • Value Range
    • Custom Metric
    • New Values
    • Alerts Consolidation
  • 🎨Dashboards
    • Overview
  • 🤖ML Monitoring as Code
    • Getting started
    • Adding new models
    • Data Segments
    • Custom metrics
    • Querying metrics
    • Monitors
    • Dashboards
  • 📡Integrations
    • Slack
    • Webhook
    • Teams
    • Single Sign On (SAML)
    • Cisco
  • 🔐Administration
    • Role Based Access Control (RBAC)
  • 🔑API Reference
    • REST API
    • API Extended Reference
    • Custom Segment Syntax
    • Custom Metric Syntax
    • Code-Based Metrics
    • Metrics Glossary
  • ⏩Release Notes
    • Release Notes 2024
    • Release Notes 2023
Powered by GitBook
On this page
  • Storage
  • Directory Structure
  • Data Structure
  1. Storing your Predictions

Overview

PreviousModels & VersionsNextReal-time Models (Postgres)

Last updated 2 years ago

Monitoring your Machine Learning models begins with storing their inputs and outputs in production.

Oftentimes, this data is used not just for model monitoring, but also for retraining, auditing, and other purposes; therefore, it is crucial that you have complete control over it.

Aporia monitors your models by connecting directly to your data, in your format. This section discusses the fundamentals of storing model predictions.

Storage

Depending on your existing enterprise data lake infrastructure, performance requirements, and cloud costs constraints, storing your predictions can be done in a variety of data stores.

Here are some common options:

  • /

  • /

  • Parquet files on S3 / GCS / ABS

    • If you choose this option, a metastore such as is recommended.

Directory Structure

When storing your predictions, it's highly recommended to adopt a standardized directory structure (or SQL table structure) across all of your organization's models.

With a standardized structure, you'll be able to get all models onboarded to the monitoring system automatically.

Here is a very basic example:

s3://myorg-models/
├── my-model/
    ├── v1/
    │   ├── train.parquet
    │   ├── test.parquet
    │   ├── serving.parquet
    │   ├── artifact.onnx
    ├── v2/
    │   ├── train.parquet
    │   ├── test.parquet
    │   └── serving.parquet
    │   └── artifact.onnx

Even though this section focuses on the storage of predictions, you should also consider saving the training and test sets of your models. They can serve as a monitoring baseline.

Data Structure

Recommendations:

  • One row per prediction.

  • One column per feature, prediction or raw input.

  • Use a prefix for column names to identify their group (e.g features., raw_inputs., predictions., actuals., etc.)

  • For serving, add ID and prediction timestamp columns.

Example:

+-----+----------------------+-------------------+---------------+----------------+-------------------+-------------------------+--------------+----------------------+------------------------+
| id  |      timestamp       | predictions.score | actuals.score | raw_inputs.age | raw_inputs.gender | features.my_embeddings  | features.age | features.gender_male | features.gender_female |
+-----+----------------------+-------------------+---------------+----------------+-------------------+-------------------------+--------------+----------------------+------------------------+
|   1 | 2022-10-19T14:21:08Z |              0.58 |          0.59 |             64 | male              | [0.58, 0.19, 0.38, ...] |           64 |                    1 |                      0 |
|   2 | 2022-10-19T14:21:08Z |              0.64 |          0.66 |             62 | woman             | [0.48, 0.20, 0.42, ...] |           62 |                    0 |                      1 |
| ... | ...                  |               ... |           ... |            ... | ...               | ...                     |          ... |                  ... |                    ... |
+-----+----------------------+-------------------+---------------+----------------+-------------------+-------------------------+--------------+----------------------+------------------------+
🏠
BigQuery
Delta Lake
Databricks Lakehouse
Snowflake
Elasticsearch
OpenSearch
Glue Data Catalog