Deploying and Scaling a PyTorch Chatbot Model with Kubernetes, Cog and Acorn

by | May 31, 2023

Spread the word

Deploying and Scaling Containerized Machine Learning Models – Part 4

This four-part series focuses on leveraging Kubernetes, Cog and Acorn frameworks to package, deploy, and scale machine learning models in cloud native environments. Part 1 introduces Cog as the framework for containerizing ML models, while part 2 focuses on integrating Cog with Acorn to target Kubernetes clusters for deployment. Part 3 discusses leveraging GPUs to accelerate the inference of computer vision models, and finally, this tutorial covers deploying a Chatbot on Kubernetes using a transformer model that deals with natural language processing (NLP).

In the last and final part of the ML tutorial series, we will deploy a natural language processing (NLP) model based on Google’s BERT to build a chatbot and run it on Kubernetes. It takes advantage of the GPU to accelerate the inference. To complete this walkthrough, you need a Kubernetes cluster running on hosts with a GPU. It should also have the NVIDIA GPU operator installed.

Step 1 – Creating the Cog artifacts to build the Chatbot container image

The first step is to download the pre-trained Mobile BERT model from Hugging Face Hub, which we will embed in the container. 

Create a new directory, and run the below command:

pip install transformers==3.4.0

Let’s create a Python script to download the model and tokenizer artifacts from Hugging Face.
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

def get_model(model):
    model = AutoModelForQuestionAnswering.from_pretrained(model,use_cdn=True)
  except Exception as e:

def get_tokenizer(tokenizer):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer)
  except Exception as e:


Running the above script will populate the model directory with the artifacts.

The next step is to create the file that makes the prediction. Create a file called with the below contents:

from cog import BasePredictor, Input, Path
import json
import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, AutoConfig

class Predictor(BasePredictor):

    def encode(self, tokenizer, question, context):
        encoded = tokenizer.encode_plus(question, context)
        return encoded["input_ids"], encoded["attention_mask"]

    def decode(self, tokenizer, token):
        answer_tokens = tokenizer.convert_ids_to_tokens(
            token, skip_special_tokens=True)
        return tokenizer.convert_tokens_to_string(answer_tokens)

    def setup(self):
        self.tokenizer = AutoTokenizer.from_pretrained("./model")
        self.model = AutoModelForQuestionAnswering.from_pretrained("./model")

    def predict(self, context: str = Input(description="context"),question: str = Input(description="question")) -> str:
        input_ids, attention_mask = self.encode(self.tokenizer, question, context)
        start_scores, end_scores = self.model(torch.tensor(
           [input_ids]), attention_mask=torch.tensor([attention_mask]))
        ans_tokens = input_ids[torch.argmax(
           start_scores): torch.argmax(end_scores)+1]
        answer = self.decode(self.tokenizer,ans_tokens)
        return answer

The setup method loads the pre-trained model with weights and the tokenizer from the file system and caches them in memory. 

The next method, predict, accepts two parameters – the context and a question. It calls the encoder and decoder to get the answer to the question and returns that value. 

With the model and the inference code in place, it’s time to define cog.yaml, which brings everything together to build the image. 

  gpu: true
  python_version: "3.8"
    - torch==1.5.0
    - transformers==3.4.0
    - protobuf==3.20.1
predict: "

Notice that we set the gpu key to true, hinting Cog to include an appropriate CUDA base image. This is a powerful mechanism where Cog determines the most optimized version of the CUDA image based on the packages included.  

If you run this on a cluster without GPU, set it to false. 

We then add the Python modules needed by the inference code. Finally, we associate the method with the prediction code file.

Go ahead and build the Docker image and push it to the Docker Hub.

export DOCKER_HUB_USERNAME=janakiramm
cog build -t $DOCKER_HUB_USERNAME/bert-qa

docker push $DOCKER_HUB_USERNAME/bert-qa

Step 2 – Deploying the model using Acorn to Kubernetes

With the image pushed to Docker Hub, we are ready to package and deploy our Chatbot as an Acorn application to Kubernetes.  If you aren’t familiar with Acorn, it is a portable platform that simplifies deploying and running applications on Kubernetes.

Define the Acornfile as shown below:

    "chatbot": {
        image: "janakiramm/bert-qa"
        memory: 1Gi
        ports: publish: "80:5000/http"
        scale: 1

Since the model is memory intensive, we set the memory to 1GiB. By increasing the value of the scale parameter, you can increase the number of replicas.

This is one of the simplest Acornfile that deploys an app based on the model and exposes it as an HTTP endpoint. 

Run the Acorn application with the below command:

Step 3 – Performing inference on the Chatbot application

Since Cog accepts the request as a JSON payload, we must prepare the input file appropriately. 

Create a file called input.dat with the following contents:

{"input": {"context": "Ant-Man and the Wasp: Quantumania is a 2023 American superhero film based on Marvel Comics featuring the characters Scott Lang as Ant-Man and Hope Pym as Wasp. Produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures, it is the sequel to Ant-Man (2015) and Ant-Man and the Wasp (2018) and the 31st film of the Marvel Cinematic Universe (MCU). The film was directed by Peyton Reed, written by Jeff Loveness, and stars Paul Rudd as Scott Lang and Evangeline Lilly as Hope van Dyne" , "question" : "Who played the role of Ant-man?"}}

We can now call the REST API of the model through cURL.

Passing the output through jq utility gives us a readable output.  

Try changing the question in the input to see different values. We can easily extend this to build an interactive chatbot to answer questions based on the context. 

In this tutorial, we have seen how to use Cog and Acorn together to deploy NLP-based models to build chatbot-style applications for Kubernetes. This is the last of our tutorials on containerizing ML workloads and running them on Kubernetes with Cog and Acorn.

Janakiram is a practicing architect, analyst, and advisor focusing on emerging infrastructure technologies. He provides strategic advisory to hyperscalers, technology platform companies, startups, ISVs, and enterprises. As a practitioner working with a diverse Enterprise customer base across cloud native, machine learning, IoT, and edge domains, Janakiram gains insight into the enterprise challenges, pitfalls, and opportunities involved in emerging technology adoption. Janakiram is an Amazon, Microsoft, and Google certified cloud architect, as well as a CNCF Ambassador and Microsoft Regional Director. He is an active contributor at Gigaom Research, Forbes, The New Stack, and InfoWorld. You can follow him on Twitter.

Header Photo by Jeffrey Betts on Unsplash

Spread the word