Deploying and Scaling TensorFlow Vision AI Models on Kubernetes

May 25, 2023 by Janakiram MSV
Deploying and Scaling TensorFlow Vision AI Models on Kubernetes

Deploying and Scaling Containerized Machine Learning Models – Part 3

This four-part series focuses on leveraging Cog and Acorn frameworks to package, deploy, and scale machine learning models in cloud native environments. Part 1 introduces Cog as the framework for containerizing ML models, while part 2 focuses on integrating Cog with Acorn to target Kubernetes clusters for deployment. Part 3 discusses running TensorFlow on Kubernetes with GPUs for the inference of computer vision models, and finally, part 4 covers deploying transformer models that deal with natural language processing (NLP)

In this, the third part of our ML tutorial series, we will deploy a deep learning model based on Google’s MobileNet SSD to perform image classification. It is a good example of how to run TensorFlow on Kubernetes while taking advantage of GPUs to accelerate the inference. To complete this walkthrough, you need a Kubernetes cluster running on hosts with a GPU. The Kubernetes cluster should also have the NVIDIA GPU operator installed.

Creating the Cog artifacts to build the container image

The first step is to download the pre-trained MobileNet SSD model from TensorFlow Hub, which we will embed in the container.

Create a new directory, and run the below command:


The next step is to create the file that makes the prediction. Create a file called with the below contents:

from typing import Any from cog import BasePredictor, Input, Path import tensorflow as tf from tensorflow.keras.preprocessing import image as keras_image from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions import numpy as np class Predictor(BasePredictor): def setup(self): self.model = tf.keras.applications.mobilenet_v2.MobileNetV2(weights="mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224.h5") def predict(self, image: Path = Input(description="Image to classify")) -> Any: img = keras_image.load_img(image, target_size=(224, 224)) x = keras_image.img_to_array(img) x = np.expand_dims(x, axis=0) x = preprocess_input(x) preds = self.model.predict(x) return decode_predictions(preds, top=3)[0]

The setup method loads the pre-trained models with weights from the file system and caches them in memory.

The next method, predict, accepts an image and uses the Keras library to preprocess it, which is passed onto the model.predict() method. The top 3 predictions and their probability score are sent as an output.

With the model and the inference code in place, it’s time to define cog.yaml, which brings everything together to build the image.

build: gpu: true python_version: "3.8" python_packages: - pillow==9.1.0 - tensorflow==2.8.0 predict: ""

Notice that we set the gpu key to true, hinting Cog to include an appropriate CUDA base image. This is a powerful mechanism where Cog determines the most optimized version of the CUDA image based on the packages included, so we can run our TensorFlow model on Kubernetes with GPUs.

We then add the Python modules needed by the inference code. Finally, we associate the method with the prediction code file.

Go ahead and build the Docker image and push it to the Docker Hub.

export DOCKER_HUB_USERNAME=janakiramm cog build -t $DOCKER_HUB_USERNAME/mobilenet-gpu docker push $DOCKER_HUB_USERNAME/mobilenet-gpu

If you want to test the container before deploying your machine learning model to Kubernetes, you need to install NVIDIA Container Toolkit. Refer to the NVIDIA documentation for the steps.

Running our TensorFlow model on Kubernetes with Acorn

With the image pushed to Docker Hub, we are ready to package and deploy it as an Acorn application.

Define the Acornfile as shown below:

containers:{ "mobilenet-gpu": { image: "janakiramm/mobilenet-gpu:latest" ports: publish: "80:5000/http" scale: 1 } }

This is one of the simplest Acornfiile that deploys an app based on the model and exposes it as an HTTP endpoint.

Run the Acorn application with the below command:

carbon-7-935x1024.png carbon-1-1-980x180.png

Performing inference on the Acorn application

Since Cog accepts an image encoded in base64 and wrapped in a JSON payload, we must prepare the input file appropriately.

I have an image of the tiger named image.jpg. Let’s create a BASH script that encodes and generates the required payload.


function img-data() { TYPE=$(file --mime-type -b $1) ENC=$(base64 -i $1) echo "{"input": {"image":"data:$TYPE;base64,$ENC"}}" } source; img-data image.jpg > input_img.dat

We can now call the REST API of the model through curl.


Passing the output through jq utility gives us a readable output.


As we can see, the top result for the image is a tiger with a probability score of 73%.

Analyzing the Cog and Acorn TensorFlow environment

Let’s start by looking at the Dockerfile generated by Cog.

cog debug docker


Since we defined gpu:true in the cog.yaml file, it has detected that the base image must be nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04. Cog has also included the Debian packages that are needed by CUDA to work.

This is one of the biggest advantages of using Cog. Instead of handcrafting the Dockerfile, we let Cog generate the most optimal version of it.

Next, let’s check if our model exploits the GPU. For that, we need to access the shell of the Acorn app and run a few commands.

acorn exec mnet-ssd

Once you are inside the container’s shell, check for the GPU with the nvidia-smi command.


My single-node Kubernetes cluster is powered by NVIDIA GeForce RTX 3090 with 24GB of RAM and 10496 cores. This is confirmed by the output seen above.

Next, let’s see if TensorFlow is able to access the GPU.


It is confirmed that TensorFlow can access the GPU, significantly accelerating the inference speed.

In this tutorial, we have seen how to use Cog and Acorn together for running TensorFlow on Kubernetes. We’ve shown how to run Vision AI models accelerated by GPU hosts. At this point we’ve seen how to build and deploy containerized machine learning models and scale them on Kubernetes easily using Cog and Acorn. In the last and final part of this series, we will deploy an NLP transformer model that acts like a chatbot. Stay tuned.