Adding new models
Overview
To add a new model to Aporia using the Python SDK, you'll need to:
Define serving dataset - This will include the SQL query or path to your serving / inference data.
Define training dataset (optional) - This will include the SQL query or path to your model's training set.
Define a model resource - The model resource will include the display name and type of the model in Aporia, as well as the link to different versions and their serving / training datasets.
Initialization
Start by creating a new Python file with the following initialization code:
Defining Datasets
To add a new model to Aporia, start by defining a dataset. Datasets can be used to specify the SQL query or file path for model monitoring.
There are currently two types of datasets in Aporia:
Serving dataset - Includes the features and predictions of your model in production, as well as any other metadata you'd like to add for observability.
The serving dataset can also include delayed labels / actuals, and Aporia will make sure to refresh this data when it's updated. This is used to calculate performance metrics such as AUC ROC, nDCG@k, and so on.
Training dataset (optional) - Includes the features, predictions, and labels of your model during training set.
While the dataset represents a specific query or file that's relevant to a specific model, a data source includes the connection string data (e.g user, role, etc.).
A data source can be shared across many different datasets. A data source is often created once, while datasets are added every time a new model is added.
The name of the data source should be identical to a data source that exists in the Integrations page in Aporia.
If you're using an SQL-based data source such as Databricks, Snowflake, Postgres, Glue Data Catalog, Athena, Redshift, or BigQuery, then the format of connection_data
should be a dict with a query
key as shown in the code example above.
If you're using a file data source like S3, Azure Blob Stroage, or Google Cloud Storage, the connection_data
dictionary should look like this:
Column Mapping
Aporia uses a simple dictionary format to map between column names to features, predictions, raw inputs, and actuals.
Here's a table to describe the different type of field groups that exist within Aporia:
Features
Inputs to the model
Yes
Predictions
Outputs from the model
Yes
Raw Inputs
Any metadata about the prediction. Examples: * Prediction latency * Raw text * Gender - might not be a feature of the model, but you still want to monitor for bias & fairness, so this is a good fit for raw inputs
No
Actuals
Delayed feedback after the prediction, used to calculate performance metrics.
No
You can specify each of these field groups as a Python dictionary in the aporia.Dataset(...)
parameters. The key represents the column name from the file / SQL query, and the value represents the data type:
Data Types
Each column can be one of the following data types:
numeric
Any continuous variable (e.g score, age, and so on).
53.4, 0.05, 20
categorical
Any discrete variable (e.g gender, country, state, etc.).
"US", "California", "5"
boolean
Any boolean value
true, false, 0, 1
datetime
Any datetime value
timestamp objects
text
Raw text
"Hello, how are you?"
array
List of discrete categories
["flight1911", "flight2020"]
embedding
Numeric vectors
[0.58201, 0.293948, ...]
image_url
Image URLs
https://my-website.com/img.png
Actuals / Delayed Labels
To calculate performance metrics in Aporia, you can add actuals (or delayed labels) to the prediction data.
While usually this data is not in the same table as the prediction data, you can use a SQL JOIN
query to merge between the feature / prediction data and actuals. Aporia will take care of refreshing the data when it is updated. If you don't have actuals for a prediction yet, the value for the acshould be NULL. Therefore, it's often very common to use a LEFT JOIN
query like this:
Then, you can use the actuals
and actual_mapping
parameters when creating a dataset:
Defining models
Next, to define a model simply create an aporia.Model
object with links to the relevant datasets, and add it to your stack:
Last updated