# Intro to NLP Monitoring

This guide will walk you through the core concepts of NLP model monitoring, including drift detection and model performance. 🚀

Throughout the guide, we will use a simple sentiment analysis model based on 🤗 [HuggingFace](https://huggingface.co/):

```python
>>> from transformers import pipeline

>>> classifier = pipeline("sentiment-analysis")
```

This downloads a default pre-trained model and tokenizer for Sentiment Analysis. Now you can use the `classifier` on your target text:

```python
>>> classifier("I love cookies and Aporia")
[{'label': 'POSITIVE', 'score': 0.9997883439064026}]
```

## Extract Embeddings&#x20;

To effectively detect drift in NLP models, we use *embeddings*.

{% hint style="info" %}
**But... what are embeddings?**

Textual data is complex, high-dimensional, and free-form. Embeddings represent text as *low-dimensional vectors*.&#x20;

Various language models, such as [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) and transformer-based models like [BERT](https://en.wikipedia.org/wiki/BERT_\(language_model\)), are used obtain embeddings for NLP models. In case of BERT, embeddings are usually vectors of size 768.
{% endhint %}

To get embeddings for our HuggingFace model, we'll need to do two things:

1. Pass `output_hidden_states=True` to our model params.
2. When we call `pipeline(...)` it does a lot of things for us - preprocessing, inference, and post processing. **We need to break all of this down into each step**, so we can extract the embeddings.

In other words:

```python
classifier = pipeline(
    task="sentiment-analysis",
    model_kwargs={"output_hidden_states": True}
    )

# Preprocessing
model_input = classifier.preprocess("I love cookies and Aporia")

# Inference
model_output = classifier.forward(model_input)

# Postprocessing
classifier.postprocess(model_output)
  # ==> {'label': 'POSITIVE', 'score': 0.9998340606689453} 
```

And finally, to extract embeddings for this prediction:

```python
embeddings = torch.mean(model_output.hidden_states[-1], dim=1).squeeze()
```

## Storing your Predictions

The next step would be to store your predictions in a data store, including the embeddings themselves. For more information on storing your predictions, please check out the [Storing Your Predictions](https://docs.aporia.com/storing-your-predictions) section.

For example, you could use a Parquet file on S3 or a Postgres table that looks like this:

<table><thead><tr><th width="88.33333333333331">id</th><th width="292">raw_text (text)</th><th width="263">embeddings (embedding)</th><th width="162.66666666666674">prediction (boolean)</th><th width="186">score (numeric)</th><th width="207">timestamp (datetime)</th></tr></thead><tbody><tr><td>1</td><td>I love cookies and Aporia</td><td><code>[0.77, 0.87, 0.94, ...]</code></td><td><code>True</code></td><td>0.98</td><td>2021-11-20 13:41:00</td></tr><tr><td>2</td><td>This restaurant was really bad</td><td><code>[0.97, 0.82, 0.13, ...]</code></td><td><code>False</code></td><td>0.88</td><td>2021-11-20 13:45:00</td></tr><tr><td>3</td><td>Hummus is the tastiest thing ever</td><td><code>[0.14, 0.55, 0.66, ...]</code></td><td><code>True</code></td><td>0.92</td><td>2021-11-20 13:49:00</td></tr></tbody></table>

* Note that in the prediction column True is the Positive sentiment, and the false is the Negative.

To integrate this type of model follow our [Quickstart](https://docs.aporia.com/introduction/quickstart).

Check out the [data sources section](https://docs.aporia.com/data-sources) for more information about how to connect from different data sources.

### Schema mapping

There are 2 unique types in Aporia to help you integrate your NLP model - `text`, and `embedding`.

The `text` should be used with your raw\_text column. Note that by default, in the UI every string column will be automatically marked as `categorical`, but you'll have the option to change it to `text` for NLP use cases.

The `embedding` as the name suggested, should be used with your embedding column. Note that by default, in the UI every array column will be automatically marked as `array`, but you'll have the option to change it to `embedding` for NLP use cases.

## Next steps

* **Create a custom dashboard for your model in Aporia** - Drag & drop widgets to show different performance metrics, top drifted features, etc.
* **Visualize NLP drift using Aporia's Embeddings Projector** - Use the Embedding Projector widget within the investigation room, to view drift between different datasets in production, using UMAP for dimension reduction.
* **Set up monitors to get notified for ML issues** - Including data integrity issues, model performance degradation, and model drift.
