Batch Models

If your model runs periodically every X days, we refer to it as a batch model (as opposed to a real-time model).

Typically, storing the predictions of batch models is straightforward. The code examples that follow are naive "illustrations" of how to do so.

Example: Pandas to Parquet on S3

If you use Pandas, you can append any DataFrame to a Parquet file on S3 or other cloud storages by using the fastparquet library:

import fastparquet

# Preprocess & predict
X = preprocess(...)
y = model.predict(X_pred)

# Concatenate features, predictions and any other metadata
df = ...

# Store predictions
fastparquet.write(
    filename=f"s3://my-models/{MODEL_ID}/{MODEL_VERSION}/serving.parquet",
    data=df,
    append=True,
)

Example: Pyspark to Delta Lake

This example is especially useful on Databricks, but can you can use it on Delta Lake + Spark on K8s operator for example:

# Predict on SparkML
y = model.transform(X)

# Concatenate features, predictions and any other metadata
df = ...

# Append to a Delta table
df.write.format("delta").mode("append").saveAsTable("my_model_serving")

Last updated