Adding new models

Overview

To add a new model to Aporia using the Python SDK, you'll need to:

  1. Define serving dataset - This will include the SQL query or path to your serving / inference data.

  2. Define training dataset (optional) - This will include the SQL query or path to your model's training set.

  3. Define a model resource - The model resource will include the display name and type of the model in Aporia, as well as the link to different versions and their serving / training datasets.

Initialization

Start by creating a new Python file with the following initialization code:

import datetime
import os

from aporia import Aporia, MetricDataset, MetricParameters, TimeRange
import aporia.as_code as aporia

aporia_token = os.environ["APORIA_TOKEN"]
aporia_account = os.environ["APORIA_ACCOUNT"]
aporia_workspace = os.environ["APORIA_WORKSPACE"]

stack = aporia.Stack(
    host="https://platform.aporia.com",  # or "https://platform-eu.aporia.com"
    token=aporia_token,
    account=aporia_account,
    workspace=aporia_workspace,
)

# <Your model definition code goes here>


stack.apply(yes=True, rollback=False, config_path="config.json")

Defining Datasets

To add a new model to Aporia, start by defining a dataset. Datasets can be used to specify the SQL query or file path for model monitoring.

There are currently two types of datasets in Aporia:

  • Serving dataset - Includes the features and predictions of your model in production, as well as any other metadata you'd like to add for observability.

    • The serving dataset can also include delayed labels / actuals, and Aporia will make sure to refresh this data when it's updated. This is used to calculate performance metrics such as AUC ROC, nDCG@k, and so on.

  • Training dataset (optional) - Includes the features, predictions, and labels of your model during training set.

serving_dataset = aporia.Dataset(
  "my-model-serving",
  
  # Dataset type - can be "serving" or "training"
  type="serving",
  
  # Data source name from the "Integrations" page in Aporia
  # If you prefer to define data source as code, use the aporia.DataSource(...) API.
  data_source_name="MY_SNOWFLAKE",
  
  # SQL query or S3 path
  connection_data={
    "query": "SELECT * FROM model_predictions"
  },
  
  # Column to be used as a unique prediction ID
  id_column="prediction_id",
  
  # Column to be used as the prediction timestamp
  timestamp_column="prediction_timestamp"
  
  # Raw inputs are used to represent any metadata about the prediction.
  # Optional
  raw_inputs={
    "prediction_latency": "numeric",
    "raw_text": "text",
  },
  
  # Features
  features={
    "age": "numeric",
    "gender": "categorical",
    "text_embedding": "embedding",
    "image_embedding": "embedding",
  },
  
  # Predictions
  predictions={
    "score": "numeric",
  },
  
  # Delayed labels
  actuals={
    "purchased": "boolean",
  },
  actual_mapping={"purchased": "score"},
)

While the dataset represents a specific query or file that's relevant to a specific model, a data source includes the connection string data (e.g user, role, etc.).

A data source can be shared across many different datasets. A data source is often created once, while datasets are added every time a new model is added.

The name of the data source should be identical to a data source that exists in the Integrations page in Aporia.

If you're using an SQL-based data source such as Databricks, Snowflake, Postgres, Glue Data Catalog, Athena, Redshift, or BigQuery, then the format of connection_data should be a dict with a query key as shown in the code example above.

If you're using a file data source like S3, Azure Blob Stroage, or Google Cloud Storage, the connection_data dictionary should look like this:

aporia.Dataset(
  ...,
  
  connection_data={
    # Files to read
    "regex": "my-model/v1/*.parquet",
    
    # Format of the file
    "object_format": "parquet" # Can also be "csv" / "delta" / "json",
    
    # For CSV, Read the first line of the file as column names? (optional)    
    # "header": true
  }
)

Column Mapping

Aporia uses a simple dictionary format to map between column names to features, predictions, raw inputs, and actuals.

Here's a table to describe the different type of field groups that exist within Aporia:

Group
Description
Required

Features

Inputs to the model

Yes

Predictions

Outputs from the model

Yes

Raw Inputs

Any metadata about the prediction. Examples: * Prediction latency * Raw text * Gender - might not be a feature of the model, but you still want to monitor for bias & fairness, so this is a good fit for raw inputs

No

Actuals

Delayed feedback after the prediction, used to calculate performance metrics.

No

You can specify each of these field groups as a Python dictionary in the aporia.Dataset(...) parameters. The key represents the column name from the file / SQL query, and the value represents the data type:

aporia.Dataset(
  ...,
  features={
    # columnName -> dataType
    "age": "numeric",
  }
)

Data Types

Each column can be one of the following data types:

Data Type
Description
Value examples

numeric

Any continuous variable (e.g score, age, and so on).

53.4, 0.05, 20

categorical

Any discrete variable (e.g gender, country, state, etc.).

"US", "California", "5"

boolean

Any boolean value

true, false, 0, 1

datetime

Any datetime value

timestamp objects

text

Raw text

"Hello, how are you?"

array

List of discrete categories

["flight1911", "flight2020"]

embedding

Numeric vectors

[0.58201, 0.293948, ...]

image_url

Image URLs

https://my-website.com/img.png

Actuals / Delayed Labels

To calculate performance metrics in Aporia, you can add actuals (or delayed labels) to the prediction data.

While usually this data is not in the same table as the prediction data, you can use a SQL JOIN query to merge between the feature / prediction data and actuals. Aporia will take care of refreshing the data when it is updated. If you don't have actuals for a prediction yet, the value for the acshould be NULL. Therefore, it's often very common to use a LEFT JOIN query like this:

SELECT * FROM model_predictions
LEFT JOIN model_actuals USING prediction_id

Then, you can use the actuals and actual_mapping parameters when creating a dataset:

serving_dataset = aporia.Dataset(
  predictions={
    "recommended_items": "array",
  },
  actuals={
    "relevant_items": "array",
  },
  actual_mapping={
    # Actual name -> Prediction name
    "relevant_items": "recommended_items"
  },
)

Defining models

Next, to define a model simply create an aporia.Model object with links to the relevant datasets, and add it to your stack:

model_version = aporia.Version(
    "model_version_v1.0.0",
    serving=serving_dataset,
    training=training_dataset,
    name="v1.0.0",
)

model = aporia.Model(
    "My Model",
    type=aporia.ModelType.RANKING,
    versions=[model_version],
)

stack.add(model)
stack.apply(yes=True, rollback=False, config_path="model.json")

Last updated