Overview

PreviousExample: Question Answering NextAmazon S3

Last updated 2 years ago

Overview

Aporia monitors your models by connecting directly to your data. If you don't store your predictions yet, see our guide on (recommended), or just

Aporia currently supports the following data sources:

Amazon S3
BigQuery
Redshift
Athena
Snowflake
PostgreSQL
Delta Lake
Glue Data Catalog

If your storage or database are not shown here, please contact your Aporia account manager for further assistance.

Configure Data Source

Connecting to a data source begins with configuring its connection details. For example, to connect to a Postgres database, we can create the following data source object:

data_source = PostgresJDBCDataSource(
  url="jdbc:postgresql://<POSTGRES_HOSTNAME>/<DBNAME>",
  query="SELECT * FROM model_predictions",
  user="<DB_USER>",
  password="<DB_PASSWORD>"
)

Please refer to the documentation page of the relevant data source for a complete list of supported parameters and configuration options.

Connect Serving Data

After creating a data source, we can create a model version and connect it to the data source. For example:

apr_model = aporia.create_model_version(
  model_id="<MODEL_ID>",
  model_version="v1",
  model_type="binary"
  
  raw_inputs={
    "raw_text": "text",
  },

  features={
    "amount": "numeric",
    "owner": "string",
    "is_new": "boolean",
    "embeddings": {"type": "tensor", "dimensions": [768]},
  },

  predictions={
    "will_buy_insurance": "boolean",
    "proba": "numeric",
  },
)

apr_model.connect_serving(
  data_source=data_source,

  # Names of the prediction ID and prediction timestamp columns
  id_column="prediction_id",
  timestamp_column="prediction_timestamp",
)

By default, each raw input, feature, and prediction is mapped to the same column in the PostgreSQL query.

As part of the connect serving API, you must specify the following two additional columns:

id_column - A unique ID to represent this prediction.
timestamp_column - A column representing when did this prediction occur.

Integrating Delayed Actuals

Integrating actuals can be done by using the labels argument of the connect_serving API. To use it, each Aporia prediction can be mapped to a column representing its actual value.

For example, let's assume we have two columns - will_buy_insurance (which is the model prediction), and did_buy_insurance (the ground truth). To integrate it to Aporia:

apr_model = aporia.create_model_version(
  ...
  predictions={
    "will_buy_insurance": "boolean"
  }
)

apr_model.connect_serving(
  data_source=data_source,

  id_column="prediction_id",
  timestamp_column="prediction_timestamp",

  labels={
    # Prediction name -> Column name representing 
    "will_buy_insurance": "did_buy_insurance"
  }
)

The ground truth can be NULL until it actually has value, and that's okay.

Connecting Training / Test Sets

To connect your model version to training or test sets, you can use the connect_training and connect_testing APIs.

For example:

# Training set
apr_model.connect_training(
  data_source=training_set_data_source,
  id_column="id",
  timestamp_column="timestamp",
)

# Test set
apr_model.connect_testing(
  data_source=test_set_data_source,
  id_column="id",
  timestamp_column="timestamp",
)

Advanced Mapping

Any column that has the same name as a raw input, feature, or prediction in the model schema is mapped to the corresponding raw input, feature, or prediction.

However, you can override this mapping using the raw_inputs, features, predictions, and labels arguments to the connect_serving / connect_training / connect_testing APIs. Example:

apr_model.connect_serving(
  data_source=aporia.GlueDataSource(
    database="datalake",
    query="""
      SELECT
        my_id,
        full_name,
        age,
        my_gender_col,
        decision,
        was_decision_correct,
        occurred_at,
      FROM predictions
    """,
  ),

  id_column="my_id",
  timestamp_column="occurred_at",
  raw_inputs={
    "fullname": "full_name",
  }
  features={
    "age": "age",
    "gender": "my_gender_col",
  },
  predictions={
    "will_buy_insurance": "decision",
  },
  labels={
    "will_buy_insurance": "was_decision_correct"
  }
)