Introducing Cog and Containerizing Machine Learning Models

by | May 23, 2023

Spread the word

Deploying and Scaling Containerized Machine Learning Models – Part 1

This four-part series focuses on leveraging Cog and Acorn frameworks to package, deploy, and scale machine learning models in cloud native environments. Part 1 introduces Cog as the framework for containerizing ML models, while part 2 focuses on integrating Cog with Acorn to target Kubernetes clusters for deployment. Part 3 discusses leveraging GPUs to accelerate the inference of computer vision models, and finally, part 4 covers deploying transformer models that deal with natural language processing (NLP). 

Machine learning models have two lives – training and inference. We need infrastructure powered by high-end CPUs and GPUs to train complex models, which is typically done in the cloud or an enterprise data center. Once a model is trained, it is packaged and deployed as an API for applications to consume them. The inference API is exposed as a REST or gRPC endpoint. With Kubernetes gaining ground, inference API is deployed as workloads running within the cluster. Like deploying software binaries, ML engineers and DevOps teams collaborate to package and deploy the model artifacts.  

The workflow of packaging and deploying ML artifacts involves assembling the model, and the inference code into a container image, packaging that as a Kubernetes deployment, and exposing it as a service. DevOps teams are responsible for gathering all the dependencies needed by the inference code along with appropriate frameworks and runtimes such as TensorFlow and CUDA. Any mismatch between the inference code, the dependencies, versions of frameworks, and the runtime environment leads to a laborious and expensive debugging process. 

Cog is an open source project from Replicate.ai, a startup offering an ML model hosting service. It simplifies and automates the process of containerizing an ML model and all the required entities into an OCI image. 

Like Acorn, Cog takes a declarative approach to build container images for inference. It automates the generation of Dockerfile along with all the required dependencies, frameworks, and runtime requirements. The container image has an in-built FASTAPI HTTP server to expose REST endpoints. Refer to Cog’s documentation for more details. 

To appreciate the power of Cog, let’s go through an end-to-end example of training a model and using it for inference. We will then use Cog to package the model and inference code. 

Training a simple machine learning model 

Since the objective is to understand the workflow and not train complex models, we will build a simple Scikit-learning model that predicts the salary based on the number of years of experience. It’s a linear regression model trained on a tiny toy dataset. 

For this tutorial, you need an x86 machine with Python 3.8 or above installed. You also need Docker to build the container image. 

Start by cloning the GitHub repository with the dataset, code to train the model, and the inference API based on Flask. 

git clone https://github.com/janakiramm/Salary.git && cd Salary

Create a Python virtual environment, activate it, and install the dependencies. 

python3 -m virtualenv salenv

source salenv/bin/activate

pip3 install -r requirements.txt

The CSV file under the data directory acts as the dataset. It has two columns representing years of experience and salary. 

cat data/sal.csv
x,y
0,103100
1,104900
2,106800
3,108700
4,110400
5,112300
6,114200
7,116100
8,117800
9,119700
10,121600
11,123300
12,125200
13,127100
14,128900
15,130700
16,132600
17,134400
18,136300
19,138000
20,139900

The file train.py under the train directory has the code to train and generate the model.

import numpy as np
import pandas as pd
import os
import joblib
from argparse import ArgumentParser
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


parser = ArgumentParser()
parser.add_argument("-i", "--in", dest="input",
                    help="location of input dataset")
parser.add_argument("-o", "--out",dest="output",
                    help="location of model"
                    )

dataset = parser.parse_args().input
model_dir = parser.parse_args().output

sal = pd.read_csv(dataset,header=0, index_col=None)
X = sal[['x']]
y = sal['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

lm = LinearRegression() 
lm.fit(X_train.values,y_train.values) 

print('Intercept :', round(lm.intercept_,2))
print('Slope :', round(lm.coef_[0],2))

from sklearn.metrics import mean_squared_error
y_predict= lm.predict(X_test.values)
mse = mean_squared_error(y_predict,y_test)
print('MSE :', round(mse,2))

if not os.path.exists(model_dir):
    os.makedirs(model_dir)
filename = model_dir+'/model.pkl'

joblib.dump(lm, filename,compress=1)

The code accepts the dataset’s location and dumps the serialized model in the target directory. Let’s train and generate the model. 

python3 train.py -i ../data/sal.csv --out ../model

The serialized model is saved to the model directory. 

Performing inference on the trained model 

Let’s now write code to load the model and expose it as a REST endpoint. You can find the inference code under the deploy directory.

from flask import Flask, jsonify
from flask import request
import joblib
from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument("-m", "--model", dest="model",
                    help="location of the pickle file")

filename = parser.parse_args().model

app = Flask(__name__)

@app.route('/')
def index():
    return "Stackoverflow Salary Predictor"

@app.route('/predict', methods=['POST'])
def predict():
	loaded_model=joblib.load(filename)
	x=int(request.json["exp"])
	y=loaded_model.predict([[x]])[0]
	sal=jsonify({'salary': round(y,2)})
	return sal

if __name__ == '__main__':
      app.run(host='0.0.0.0', port=8080)

The above program accepts the location of the model and generates a REST endpoint to perform predictions. 

Let’s invoke the API through a cURL request.

curl -X POST -H "Content-type: application/json" -d "{\"exp\":\"25\"}" http://localhost:8080/predict

The above screenshot shows that our inference code is working as expected. For an individual with 25 years of experience, the salary is approximately 149,000. 

Focus on the function predict, which is the essence of the inference code. It first loads and deserializes the model artifact and then parses the input payload to extract the variable representing the experience. It then passes the variable to the model to derive the prediction, which is sent as a JSON response to the HTTP request.  

Though this is a trivial example, the workflow remains the same. The next step is to define a Dockerfile and build the image. You can find many examples and best practices for building Python-based container images. 

This tutorial will not cover building the container image. Instead, it shows how Cog framework simplifies the process of packaging the container image for inference.

Using Cog to build the container image for inference 

Cog needs three files – cog.yaml, predict.py, and the model artifact. The cog.yaml file contains the version of python, the core dependencies, GPU requirements, and the function responsible for invoking the model. The function lives in an external Python file that contains the preprocessing logic and the actual invocation of the model. 

Let’s start by installing the Cog CLI tool. 

brew install cog

Check the version of Cog to verify the installation. 

Now is the time to create cog.yaml. Create a directory called inference that contains the files related to Cog. 

If you run cog init, the CLI generates the template that acts as the starting point. 

build:
  python_version: "3.8"
  python_packages:
    - numpy==1.21.6
    - scipy==1.7.3
    - scikit-learn==1.0.2
    - pandas==1.3.5
    - joblib==1.2.0
predict: "predict.py:Predictor"

The build section declares the version of Python. The next section has all the dependencies that typically go into the requirements.txt file. Finally, it has the name of the Python file and the function responsible for the prediction.

Let’s create the prediction.py, which is essential for inference. 

from typing import Any
from cog import BasePredictor, Input, Path
import joblib

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory to make running multiple predictions efficient"""
        self.model=joblib.load("model.pkl")

    # The arguments and types the model takes as input
    def predict(self,
          exp: int = Input(description="Experience in years")) -> Any:
        """Run a single prediction on the model"""
        sal=self.model.predict([[exp]])
        return round(sal[0],2)

Pay attention to the predict function. It’s similar to the code we wrote in the inference.py. To avoid loading the model for each request, we load it and deserialize it once, which is done within the setup function. 

Before building the image, let’s copy the serialized model into the inference directory. 

cp ../model/model.pkl .

We have three files needed by Cog to build the image successfully,

We can now perform inference with a single command: 

cog predict -i exp=25

Let’s kick off the image-building process. We can pass a tag that helps push to the registry later. As you can see, Cog built the image and then passed the input to perform the prediction. 

Now that the code is working, we are ready to build the image. 

export DOCKER_HUB_USERNAME=janakiramm
cog build -t $DOCKER_HUB_USERNAME/salpred

This builds a Docker image with the mentioned tag.

We can run the image with Docker CLI to test the REST endpoint.

docker run --rm -d --name salpred -p 5000:5000 janakiramm/salpred

To extract just the prediction, we can use the jq utility. 

curl -X POST -H "Content-type: application/json" -d '{"input": {"exp": "25"}}'  --silent http://localhost:5000/predictions | jq '.output'

If you are curious about Dockerfile built by Cog, run the below command to see explore the contents. 

Cog is a simple yet powerful tool to automate the build process of ML inference code. 

In the next part of this series, we explore the integration of Cog and Acorn to deploy the inference service in Kubernetes. Stay tuned.  

Janakiram is a practicing architect, analyst, and advisor focusing on emerging infrastructure technologies. He provides strategic advisory to hyperscalers, technology platform companies, startups, ISVs, and enterprises. As a practitioner working with a diverse Enterprise customer base across cloud native, machine learning, IoT, and edge domains, Janakiram gains insight into the enterprise challenges, pitfalls, and opportunities involved in emerging technology adoption. Janakiram is an Amazon, Microsoft, and Google certified cloud architect, as well as a CNCF Ambassador and Microsoft Regional Director. He is an active contributor at Gigaom Research, Forbes, The New Stack, and InfoWorld. You can follow him on twitter.

Header Photo by Jeffrey Betts on Unsplash


Spread the word