Adding new models
To add a new model to Aporia using the Python SDK, you'll need to:
- 1.Define serving dataset - This will include the SQL query or path to your serving / inference data.
- 2.Define training dataset (optional) - This will include the SQL query or path to your model's training set.
- 3.Define a model resource - The model resource will include the display name and type of the model in Aporia, as well as the link to different versions and their serving / training datasets.
Start by creating a new Python file with the following initialization code:
import datetime
import os
from aporia import Aporia, MetricDataset, MetricParameters, TimeRange
import aporia.as_code as aporia
aporia_token = os.environ["APORIA_TOKEN"]
aporia_account = os.environ["APORIA_ACCOUNT"]
aporia_workspace = os.environ["APORIA_WORKSPACE"]
stack = aporia.Stack(
host="https://platform.aporia.com", # or "https://platform-eu.aporia.com"
token=aporia_token,
account=aporia_account,
workspace=aporia_workspace,
)
# <Your model definition code goes here>
stack.apply(yes=True, rollback=False, config_path="config.json")
To add a new model to Aporia, start by defining a dataset. Datasets can be used to specify the SQL query or file path for model monitoring.
There are currently two types of datasets in Aporia:
- Serving dataset - Includes the features and predictions of your model in production, as well as any other metadata you'd like to add for observability.
- The serving dataset can also include delayed labels / actuals, and Aporia will make sure to refresh this data when it's updated. This is used to calculate performance metrics such as AUC ROC, nDCG@k, and so on.
- Training dataset (optional) - Includes the features, predictions, and labels of your model during training set.
serving_dataset = aporia.Dataset(
"my-model-serving",
# Dataset type - can be "serving" or "training"
type="serving",
# Data source name from the "Integrations" page in Aporia
# If you prefer to define data source as code, use the aporia.DataSource(...) API.
data_source_name="MY_SNOWFLAKE",
# SQL query or S3 path
connection_data={
"query": "SELECT * FROM model_predictions"
},
# Column to be used as a unique prediction ID
id_column="prediction_id",
# Column to be used as the prediction timestamp
timestamp_column="prediction_timestamp"
# Raw inputs are used to represent any metadata about the prediction.
# Optional
raw_inputs={
"prediction_latency": "numeric",
"raw_text": "text",
},
# Features
features={
"age": "numeric",
"gender": "categorical",
"text_embedding": "embedding",
"image_embedding": "embedding",
},
# Predictions
predictions={
"score": "numeric",
},
# Delayed labels
actuals={
"purchased": "boolean",
},
actual_mapping={"purchased": "score"},
)
While the dataset represents a specific query or file that's relevant to a specific model, a data source includes the connection string data (e.g user, role, etc.).
A data source can be shared across many different datasets. A data source is often created once, while datasets are added every time a new model is added.
The name of the data source should be identical to a data source that exists in the Integrations page in Aporia.

If you're using an SQL-based data source such as Databricks, Snowflake, Postgres, Glue Data Catalog, Athena, Redshift, or BigQuery, then the format of
connection_data
should be a dict with a query
key as shown in the code example above. If you're using a file data source like S3, Azure Blob Stroage, or Google Cloud Storage, the
connection_data
dictionary should look like this:aporia.Dataset(
...,
connection_data={
# Files to read
"regex": "my-model/v1/*.parquet",
# Format of the file
"object_format": "parquet" # Can also be "csv" / "delta" / "json",
# For CSV, Read the first line of the file as column names? (optional)
# "header": true
}
)
Aporia uses a simple dictionary format to map between column names to features, predictions, raw inputs, and actuals.
Here's a table to describe the different type of field groups that exist within Aporia:
Group | Description | Required |
---|---|---|
Features | Inputs to the model | Yes |
Predictions | Outputs from the model | Yes |
Raw Inputs | Any metadata about the prediction. Examples:
* Prediction latency
* Raw text
* Gender - might not be a feature of the model, but you still want to monitor for bias & fairness, so this is a good fit for raw inputs | No |
Actuals | Delayed feedback after the prediction, used to calculate performance metrics. | No |
You can specify each of these field groups as a Python dictionary in the
aporia.Dataset(...)
parameters. The key represents the column name from the file / SQL query, and the value represents the data type:aporia.Dataset(
...,
features={
# columnName -> dataType
"age": "numeric",
}
)
Each column can be one of the following data types:
Data Type | Description | Value examples |
---|---|---|
numeric | Any continuous variable (e.g score, age, and so on). | 53.4, 0.05, 20 |
categorical | Any discrete variable (e.g gender, country, state, etc.). | "US", "California", "5" |
boolean | Any boolean value | true, false, 0, 1 |
datetime | Any datetime value | timestamp objects |
text | Raw text | "Hello, how are you?" |
array | List of discrete categories | ["flight1911", "flight2020"] |
embedding | Numeric vectors | [0.58201, 0.293948, ...] |
image_url | Image URLs | https://my-website.com/img.png |
To calculate performance metrics in Aporia, you can add actuals (or delayed labels) to the prediction data.
While usually this data is not in the same table as the prediction data, you can use a SQL
JOIN
query to merge between the feature / prediction data and actuals. Aporia will take care of refreshing the data when it is updated. If you don't have actuals for a prediction yet, the value for the acshould be NULL. Therefore, it's often very common to use a LEFT JOIN
query like this:SELECT * FROM model_predictions
LEFT JOIN model_actuals USING prediction_id
Then, you can use the
actuals
and actual_mapping
parameters when creating a dataset:serving_dataset = aporia.Dataset(
predictions={
"recommended_items": "array",
},
actuals={
"relevant_items": "array",
},
actual_mapping={
# Actual name -> Prediction name
"relevant_items": "recommended_items"
},
)
Next, to define a model simply create an
aporia.Model
object with links to the relevant datasets, and add it to your stack:model_version = aporia.Version(
"model_version_v1.0.0",
serving=serving_dataset,
training=training_dataset,
name="v1.0.0",
)
model = aporia.Model(
"My Model",
type=aporia.ModelType.RANKING,
versions=[model_version],
)
stack.add(model)
stack.apply(yes=True, rollback=False, config_path="model.json")