Aporia Documentation
Get StartedBook a DemoπŸš€ Cool StuffBlog
V2
V2
  • πŸ“–Aporia Docs
  • πŸ€—Introduction
    • Quickstart
    • Support
  • πŸ’‘Core Concepts
    • Why Monitor ML Models?
    • Understanding Data Drift
    • Analyzing Performance
    • Tracking Data Segments
    • Models & Versions
  • πŸš€Deployment
    • AWS
    • Google Cloud
    • Azure
    • Databricks
    • Offline / On-Prem
    • Platform Architecture
  • 🏠Storing your Predictions
    • Overview
    • Real-time Models (Postgres)
    • Real-time Models (Kafka)
    • Batch Models
    • Kubeflow / KServe
  • 🧠Model Types
    • Regression
    • Binary Classification
    • Multiclass Classification
    • Multi-Label Classification
    • Ranking
  • 🌈Explainability
    • SHAP values
  • πŸ“œNLP
    • Intro to NLP Monitoring
    • Example: Text Classification
    • Example: Token Classification
    • Example: Question Answering
  • πŸͺData Sources
    • Overview
    • Amazon S3
    • Athena
    • BigQuery
    • Databricks
    • Glue Data Catalog
    • Google Cloud Storage
    • PostgreSQL
    • Redshift
    • Snowflake
    • Microsoft SQL Server
    • Oracle
  • ⚑Monitors & Alerts
    • Overview
    • Data Drift
    • Metric Change
    • Missing Values
    • Model Activity
    • Model Staleness
    • Performance Degradation
    • Prediction Drift
    • Value Range
    • Custom Metric
    • New Values
    • Alerts Consolidation
  • 🎨Dashboards
    • Overview
  • πŸ€–ML Monitoring as Code
    • Getting started
    • Adding new models
    • Data Segments
    • Custom metrics
    • Querying metrics
    • Monitors
    • Dashboards
  • πŸ“‘Integrations
    • Slack
    • Webhook
    • Teams
    • Single Sign On (SAML)
    • Cisco
  • πŸ”Administration
    • Role Based Access Control (RBAC)
  • πŸ”‘API Reference
    • REST API
    • API Extended Reference
    • Custom Segment Syntax
    • Custom Metric Syntax
    • Code-Based Metrics
    • Metrics Glossary
  • ⏩Release Notes
    • Release Notes 2024
    • Release Notes 2023
Powered by GitBook
On this page
  • Extract Embeddings
  • Storing your Predictions
  • Schema mapping
  • Next steps
  1. NLP

Example: Question Answering

PreviousExample: Token ClassificationNextOverview

Last updated 1 year ago

Question answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document.

Throughout the guide, we will use a simple question answering model based on πŸ€—

>>> from transformers import pipeline

>>> qa_model = pipeline("question-answering")

This downloads a default pretrained model and tokenizer for Questioning Answering. Now you can use the qa_model on your target question / context:

qa_model(
    question="Where are the best cookies?",
    context="The best cookies are in Aporia's office."
)

# ==> {'score': 0.8362494111061096,
#      'start': 24,
#      'end': 39,
#      'answer': "Aporia's office"}

Extract Embeddings

To extract embeddings from the model, we'll first need to do two things:

  1. Pass output_hidden_states=True to our model params.

  2. When we call pipeline(...) it does a lot of things for us - preprocessing, inference, and postprocessing. We'll need to break all this, so we can interfere in the middle and get embeddings

In other words:

qa_model = pipeline("question-answering", model_kwargs={"output_hidden_states": True})

# Preprocess
model_inputs = next(qa_model.preprocess(QuestionAnsweringPipeline.create_sample(
    question="Where are the best cookies?", 
    context="The best cookies are in Aporia's office."
)))

# Inference
model_output = qa_model.model(input_ids=model_inputs["input_ids"])

# Postprocessing
start, end = model_output[:2]
qa_model.postprocess([{"start": start, "end": end, **model_inputs}])
  # ==> {'score': 0.8362494111061096, 'start': 24, 'end': 39, 'answer': "Aporia's office"}

And finally, to extract embeddings for this prediction:

embeddings = torch.mean(model_output.hidden_states[-1], dim=1).squeeze()

Storing your Predictions

For example, you could use a Parquet file on S3 or a Postgres table that looks like this:

id
question (text)
context (text)
embeddings (embedding)
answer (text)
score (numeric)
timestamp (datetime)

1

Where are the best cookies?

The best cookies are in...

[0.77, 0.87, 0.94, ...]

Aporia's Office

0.982

2021-11-20 13:41:00

2

Where is the best hummus?

The best hummus is in...

[0.97, 0.82, 0.13, ...]

Another Place

0.881

2021-11-20 13:45:00

3

Where is the best burger?

The best burger is in...

[0.14, 0.55, 0.66, ...]

Blablabla

0.925

2021-11-20 13:49:00

Schema mapping

There are 2 unique types in aporia to help you integrate your NLP model - text, and embedding.

The text should be used with your raw_text column. Note that by default, in the UI every string column will be automatically marked as categorical, but you'll have the option to change it to text for NLP use cases.

The embedding as the name suggested, should be used with your embedding column. Note that by default, in the UI every array column will be automatically marked as array, but you'll have the option to change it to embedding for NLP use cases.

Next steps

  • Create a custom dashboard for your model in Aporia - Drag & drop widgets to show different performance metrics, top drifted features, etc.

  • Visualize NLP drift using Aporia's Embeddings Projector - Use the Embedding Projector widget within the investigation room, to view drift between different datasets in production, using UMAP for dimension reduction.

  • Set up monitors to get notified for ML issues - Including data integrity issues, model performance degradation, and model drift. For example:

    • Make sure the distribution of the different entity labels doesn’t drift across time

    • Make sure the distribution of the embedding vector doesn’t drift across time

The next step would be to store your predictions in a data store, including the embeddings themselves. For more information on storing your predictions, please check out the section.

To integrate this type of model follow our .

Check out the for more information about how to connect from different data sources.

This type of model is a , with text raw input and a embedding feature.

πŸ“œ
πŸ‘
HuggingFace
πŸ˜‰
Storing Your Predictions
Quickstart
data sources section
multiclass model