Skip to content

Concepts

Model Pipeline

Models

In Aporia, a model is any system that can make predictions and can be improved through the use of data.

We use this broad definition in order to support a large number of use cases:

  • A simple Pytorch model is a valid Aporia model.
  • An ensemble of 15 XGBoost models, 37 LightGBM models and a few determinstic algorithms is also a valid Aporia model.
  • An evolutionary algorithm can also be a valid Aporia model - though we focused less on this kind of use cases.

Aporia models usually serve specific business use cases: Fraud Detection, Credit Risk, Patient Diagnosis, Churn Prediction, LTV, etc.

Model Version

Each model in Aporia contains different Model Versions. When you (re)train your model, you should create a new model version in Aporia.

When creating a new model version in Aporia, you'll be able to specify its schema - which contains a definition of the inputs and outputs of the model. This schema is usually inferred automatically.

Model Types

Models can solve different kinds of problems:

  • Binary classification models predict a binary outcome (one of two possible classes)
    • "Is this email spam or not spam?"
  • Multiclass classification models generate predictions for one of more than two classes
    • "Is this product a book, movie, or clothing?"
  • Regression models predict a numeric value
    • "What will the temperature be in NYC tomorrow?"

We call these model types. We currently support 3 model types: binary, multiclass and regression.

Training and Test Sets

Before training a model, the available data will often be splitted into several distinct data sets:

  • A training set which will be used to train/fit the model
  • A test set which will be used to provide an unbiased evaluation of a final model artifact on the training set

Features

In Aporia, the direct inputs to models are called features. For example, a patient diagnosis model might have features such as age, gender, U.S state (maybe one-hot encoded), etc.

Predictions

The outputs of your model, and any other values you may produce from them, are referred to as the predictions made by the model.

If there's a value generated directly by your model, or by some transformation performed later, you can log that value as a prediction to Aporia to monitor it.

Raw Inputs

Features don't appear out of thin air - they are usually constructed using various transformations on raw data, that was either part of a data-set or received at runtime from a user.

We refer to that raw data as the raw_inputs of a model.

For example, your dataset might contain a column with state names - you will then have some preprocessing code convert that raw_input to a feature using one-hot encoding.

Actuals

In many systems, a model will try to predict an outcome which can later be verified - for example, whether or not a user will buy an insurance policy.

In such cases, we refer to the real-world outcome as the actual value of the prediction.

Data Segment

A data segment is a subgroup of your data, according to a filter on one (or more) features.

For example, let's consider a group of people, in which 20% enjoy eating chips. If we look at a sub-group (data-segmet) of the entire group, to whom the filter "people who enjoy ice cream" applies, we might discover that 60% of the people in that sub-group enjoy chips.

This example shows us that examining your data using various data segments might change your perspective, and lead to new insights.

Environment

Models are deployed to an environment, where they are used to make predictions.

The most commonly used environments are staging and production, but users can specify any environment they desire.

Field Types

Each field (feature, prediction value, raw input or actual value) has a field type, that must be explicitly defined during model version creation.

The available field types are:

  • numeric - valid examples: 1, 2.87, 0.53, 300.13
  • boolean - valid examples: True, False
  • categorical - valid examples: 1, 2, 3. The categories must be numbers.
  • string - a categorical field with string values - e.g U.S State.
  • datetime - this can contain either Python datetime objects, or an ISO-8601 timestamp string
  • vector - currently not supported in predictions
  • text - freeform text

Other terms

Field

In Aporia, a field refers to any single feature, predictions, raw input, or actual.

Segment Group

A Data Segment Group is a group of different data segments that are defined using different filters on the same field.

For example, for an age field we might define a data segment group that contains the following data segments:

  • age < 10
  • 10 <= age < 50
  • age > 50