# Intro to NLP Monitoring

This guide will walk you through the core concepts of NLP model monitoring, including drift detection and model performance. 🚀

Throughout the guide, we will use a simple sentiment analysis model based on 🤗 [HuggingFace](https://huggingface.co/):

```python
>>> from transformers import pipeline

>>> classifier = pipeline("sentiment-analysis")
```

This downloads a default pre-trained model and tokenizer for Sentiment Analysis. Now you can use the `classifier` on your target text:

```python
>>> classifier("I love cookies and Aporia")
[{'label': 'POSITIVE', 'score': 0.9997883439064026}]
```

## Extract Embeddings&#x20;

To effectively detect drift in NLP models, we use *embeddings*.

{% hint style="info" %}
**But... what are embeddings?**

Textual data is complex, high-dimensional, and free-form. Embeddings represent text as *low-dimensional vectors*.&#x20;

Various language models, such as [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) and transformer-based models like [BERT](https://en.wikipedia.org/wiki/BERT_\(language_model\)), are used obtain embeddings for NLP models. In case of BERT, embeddings are usually vectors of size 768.
{% endhint %}

To get embeddings for our HuggingFace model, we'll need to do two things:

1. Pass `output_hidden_states=True` to our model params.
2. When we call `pipeline(...)` it does a lot of things for us - preprocessing, inference, and post processing. **We need to break all of this down into each step**, so we can extract the embeddings.

In other words:

```python
classifier = pipeline(
    task="sentiment-analysis",
    model_kwargs={"output_hidden_states": True}
    )

# Preprocessing
model_input = classifier.preprocess("I love cookies and Aporia")

# Inference
model_output = classifier.forward(model_input)

# Postprocessing
classifier.postprocess(model_output)
  # ==> {'label': 'POSITIVE', 'score': 0.9998340606689453} 
```

And finally, to extract embeddings for this prediction:

```python
embeddings = torch.mean(model_output.hidden_states[-1], dim=1).squeeze()
```

## Storing your Predictions

The next step would be to store your predictions in a data store, including the embeddings themselves. For more information on storing your predictions, please check out the [Storing Your Predictions](/storing-your-predictions/overview.md) section.

For example, you could use a Parquet file on S3 or a Postgres table that looks like this:

<table><thead><tr><th width="88.33333333333331">id</th><th width="292">raw_text (text)</th><th width="263">embeddings (embedding)</th><th width="162.66666666666674">prediction (boolean)</th><th width="186">score (numeric)</th><th width="207">timestamp (datetime)</th></tr></thead><tbody><tr><td>1</td><td>I love cookies and Aporia</td><td><code>[0.77, 0.87, 0.94, ...]</code></td><td><code>True</code></td><td>0.98</td><td>2021-11-20 13:41:00</td></tr><tr><td>2</td><td>This restaurant was really bad</td><td><code>[0.97, 0.82, 0.13, ...]</code></td><td><code>False</code></td><td>0.88</td><td>2021-11-20 13:45:00</td></tr><tr><td>3</td><td>Hummus is the tastiest thing ever</td><td><code>[0.14, 0.55, 0.66, ...]</code></td><td><code>True</code></td><td>0.92</td><td>2021-11-20 13:49:00</td></tr></tbody></table>

* Note that in the prediction column True is the Positive sentiment, and the false is the Negative.

To integrate this type of model follow our [Quickstart](/introduction/quickstart.md).

Check out the [data sources section](/data-sources/overview.md) for more information about how to connect from different data sources.

### Schema mapping

There are 2 unique types in Aporia to help you integrate your NLP model - `text`, and `embedding`.

The `text` should be used with your raw\_text column. Note that by default, in the UI every string column will be automatically marked as `categorical`, but you'll have the option to change it to `text` for NLP use cases.

The `embedding` as the name suggested, should be used with your embedding column. Note that by default, in the UI every array column will be automatically marked as `array`, but you'll have the option to change it to `embedding` for NLP use cases.

## Next steps

* **Create a custom dashboard for your model in Aporia** - Drag & drop widgets to show different performance metrics, top drifted features, etc.
* **Visualize NLP drift using Aporia's Embeddings Projector** - Use the Embedding Projector widget within the investigation room, to view drift between different datasets in production, using UMAP for dimension reduction.
* **Set up monitors to get notified for ML issues** - Including data integrity issues, model performance degradation, and model drift.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.aporia.com/nlp/intro-to-nlp-monitoring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
