Aporia Documentation
Get StartedBook a Demo🚀 Cool StuffBlog
V1
V1
  • Welcome to Aporia!
  • 🤗Introduction
    • Quickstart
    • Support
  • 💡Core Concepts
    • Why Monitor ML Models?
    • Understanding Data Drift
    • Analyzing Performance
    • Tracking Data Segments
    • Models & Versions
    • Explainability
  • 🏠Storing your Predictions
    • Overview
    • Real-time Models (Postgres)
    • Real-time Models (Kafka)
    • Batch Models
    • Kubeflow / KServe
    • Logging to Aporia directly
  • 🚀Model Types
    • Regression
    • Binary Classification
    • Multiclass Classification
    • Multi-Label Classification
    • Ranking
  • 📜NLP
    • Intro to NLP Monitoring
    • Example: Text Classification
    • Example: Token Classification
    • Example: Question Answering
  • 🍪Data Sources
    • Overview
    • Amazon S3
    • Athena
    • BigQuery
    • Delta Lake
    • Glue Data Catalog
    • PostgreSQL
    • Redshift
    • Snowflake
  • ⚡Monitors
    • Overview
    • Data Drift
    • Metric Change
    • Missing Values
    • Model Activity
    • Model Staleness
    • New Values
    • Performance Degradation
    • Prediction Drift
    • Value Range
    • Custom Metric
  • 📡Integrations
    • Slack
    • JIRA
    • New Relic
    • Single Sign On (SAML)
    • Webhook
    • Bodywork
  • 🔑API Reference
    • Custom Metric Definition Language
    • REST API
    • SDK Reference
    • Metrics Glossary
Powered by GitBook
On this page
  • Configure Data Source
  • Connect Serving Data
  • Integrating Delayed Actuals
  • Connecting Training / Test Sets
  • Advanced Mapping
  1. Data Sources

Overview

PreviousExample: Question AnsweringNextAmazon S3

Last updated 2 years ago

Aporia monitors your models by connecting directly to your data. If you don't store your predictions yet, see our guide on (recommended), or just

Aporia currently supports the following data sources:

  • Amazon S3

  • BigQuery

  • Redshift

  • Athena

  • Snowflake

  • PostgreSQL

  • Delta Lake

  • Glue Data Catalog

If your storage or database are not shown here, please contact your Aporia account manager for further assistance.

Configure Data Source

Connecting to a data source begins with configuring its connection details. For example, to connect to a Postgres database, we can create the following data source object:

data_source = PostgresJDBCDataSource(
  url="jdbc:postgresql://<POSTGRES_HOSTNAME>/<DBNAME>",
  query="SELECT * FROM model_predictions",
  user="<DB_USER>",
  password="<DB_PASSWORD>"
)

Please refer to the documentation page of the relevant data source for a complete list of supported parameters and configuration options.

Connect Serving Data

After creating a data source, we can create a model version and connect it to the data source. For example:

apr_model = aporia.create_model_version(
  model_id="<MODEL_ID>",
  model_version="v1",
  model_type="binary"
  
  raw_inputs={
    "raw_text": "text",
  },

  features={
    "amount": "numeric",
    "owner": "string",
    "is_new": "boolean",
    "embeddings": {"type": "tensor", "dimensions": [768]},
  },

  predictions={
    "will_buy_insurance": "boolean",
    "proba": "numeric",
  },
)

apr_model.connect_serving(
  data_source=data_source,

  # Names of the prediction ID and prediction timestamp columns
  id_column="prediction_id",
  timestamp_column="prediction_timestamp",
)

By default, each raw input, feature, and prediction is mapped to the same column in the PostgreSQL query.

As part of the connect serving API, you must specify the following two additional columns:

  • id_column - A unique ID to represent this prediction.

  • timestamp_column - A column representing when did this prediction occur.

Integrating Delayed Actuals

Integrating actuals can be done by using the labels argument of the connect_serving API. To use it, each Aporia prediction can be mapped to a column representing its actual value.

For example, let's assume we have two columns - will_buy_insurance (which is the model prediction), and did_buy_insurance (the ground truth). To integrate it to Aporia:

apr_model = aporia.create_model_version(
  ...
  predictions={
    "will_buy_insurance": "boolean"
  }
)

apr_model.connect_serving(
  data_source=data_source,

  id_column="prediction_id",
  timestamp_column="prediction_timestamp",

  labels={
    # Prediction name -> Column name representing 
    "will_buy_insurance": "did_buy_insurance"
  }
)

The ground truth can be NULL until it actually has value, and that's okay.

Connecting Training / Test Sets

To connect your model version to training or test sets, you can use the connect_training and connect_testing APIs.

For example:

# Training set
apr_model.connect_training(
  data_source=training_set_data_source,
  id_column="id",
  timestamp_column="timestamp",
)

# Test set
apr_model.connect_testing(
  data_source=test_set_data_source,
  id_column="id",
  timestamp_column="timestamp",
)

Advanced Mapping

Any column that has the same name as a raw input, feature, or prediction in the model schema is mapped to the corresponding raw input, feature, or prediction.

However, you can override this mapping using the raw_inputs, features, predictions, and labels arguments to the connect_serving / connect_training / connect_testing APIs. Example:

apr_model.connect_serving(
  data_source=aporia.GlueDataSource(
    database="datalake",
    query="""
      SELECT
        my_id,
        full_name,
        age,
        my_gender_col,
        decision,
        was_decision_correct,
        occurred_at,
      FROM predictions
    """,
  ),

  id_column="my_id",
  timestamp_column="occurred_at",
  raw_inputs={
    "fullname": "full_name",
  }
  features={
    "age": "age",
    "gender": "my_gender_col",
  },
  predictions={
    "will_buy_insurance": "decision",
  },
  labels={
    "will_buy_insurance": "was_decision_correct"
  }
)
🍪
Storing Your Predictions
log them directly to Aporia.