# Amazon S3

This guide describes how to connect Aporia to an S3 data source in order to monitor a new ML Model in production.&#x20;

We will assume that your model inputs, outputs and optionally delayed actuals are stored in a file in S3. Currently, the following file formats are supported:

* `parquet`
* `json`
* `csv`
* `delta`

This data source may also be used to connect to your model's training/test set to be used as a baseline for model monitoring.

### Create a IAM role for S3 access

In order to provide access to S3, create a IAM role with the necessary API permissions.

#### Step 1: Create Role

1. Log into your AWS Console and go to the **IAM** console.
2. Click the **Roles** tab in the sidebar.
3. Click **Create role**.
4. In **Select type of trusted entity**, click the **Web Identity** tile.&#x20;

   <figure><img src="/files/ufvkU5Cc0gXzy8ehZGs2" alt=""><figcaption></figcaption></figure>
5. Under **Identity Provider**, click on **Create New**.
6. Under **Provider Type**, click the **OpenID Connect** tile.
7. In the **Provider URL** field, enter the Aporia cluster OIDC URL.
8. In the Audience field, enter "sts.amazonaws.com".
9. Click the **Add provider** button.
10. Close the new tab
11. Refresh the **Identity Provider** list.
12. Select the newly created identity provider.
13. In the **Audience** field, select “sts.amazonaws.com”.
14. Click the **Next** button.
15. Click the **Next** button.
16. In the **Role name** field, enter a role name.<br>

    <figure><img src="/files/DAfhdTvMVVa0ydSMIWdq" alt=""><figcaption></figcaption></figure>

#### Step 2: Create an access policy

1. In the list of roles, click the role you created.
2. Add an inline policy.
3. On the Permissions tab, click **Add permissions** then click **Create inline policy**.\
   &#x20;

   <figure><img src="/files/xp09FEAfd0OWwDM1lzKL" alt=""><figcaption></figcaption></figure>
4. In the policy editor, click the **JSON** tab.<br>

   <figure><img src="/files/ukF2WXi0nPJvXSOHezTO" alt=""><figcaption></figcaption></figure>
5. Copy the following access policy, and make sure to fill your correct bucket name.

   ```json
   {
       "Version": "2012-10-17",
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:Get*",
   		"s3:List*"
               ],
               "Resource": [
                   "arn:aws:s3:::<BUCKET_NAME>",
                   "arn:aws:s3:::<BUCKET_NAME>/*"
               ]
           }
       ]
   }
   ```
6. Click **Review Policy**.
7. In the **Name** field, enter a policy name.
8. Click **Create policy**.
9. If you use Service Control Policies to deny certain actions at the AWS account level, ensure that `sts:AssumeRoleWithWebIdentity` is allowlisted so Aporia can assume the cross-account role.
10. In the role summary, copy the **Role ARN**.

Next, please provide your Aporia account manager with the Role ARN for the role you've just created.

### Creating an S3 data source in Aporia

To create a new model to be monitored in Aporia, you can call the `aporia.create_model(...)` API:

```python
aporia.create_model("<MODEL_ID>", "<MODEL_NAME>")
```

Each model in Aporia contains different **Model Versions**. When you (re)train your model, you should create a new model version in Aporia.

```python
apr_model = aporia.create_model_version(
  model_id="<MODEL_ID>",
  model_version="v1",
  model_type="binary"
  
  raw_inputs={
    "raw_text": "text",
  },

  features={
    "amount": "numeric",
    "owner": "string",
    "is_new": "boolean",
    "embeddings": {"type": "tensor", "dimensions": [768]},
  },

  predictions={
    "will_buy_insurance": "boolean",
    "proba": "numeric",
  },
)
```

Each raw input, feature or prediction is mapped by default to the column of the same name in the Athena query.

By creating a feature named `amount` or a prediction named `proba`, for example, the S3 data source will expect a column in the file named `amount` or `proba`, respectively.

Next, create an instance of `S3DataSource` and pass it to `apr_model.connect_serving(...)` or `apr_model.connect_training(...)`:

```python
data_source = S3DataSource(
  object_path="s3://my-bucket/my-file.parquet"
  object_format="parquet",  # other options: csv, json, delta

  # Optional - use the select_expr param to apply additional Spark SQL 
  select_expr=["<SPARK_SQL>", ...],

  # Optional - use the read_options param to apply any Spark configuration
  # (e.g custom Spark resources necessary for this model)
  read_options={...}
)

apr_model.connect_serving(
  data_source=data_source,

  # Names of the prediction ID and prediction timestamp columns
  id_column="prediction_id",
  timestamp_column="prediction_timestamp",
)
```

Note that as part of the `connect_serving` API, you are required to specify additional 2 columns:

* `id_column` - A unique ID to represent this prediction.
* `timestamp_column` - A column representing when did this prediction occur.

### What's Next

For more information on:

* Advanced feature / prediction <-> column mapping
* How to integrate delayed actuals
* How to integrate training / test sets

Please see the [Data Sources Overview](/v1/data-sources/overview.md) page.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.aporia.com/v1/data-sources/s3.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
