Deploying and Scaling TensorFlow Vision AI Models on Kubernetes

by | May 25, 2023

Spread the word

Deploying and Scaling Containerized Machine Learning Models – Part 3

This four-part series focuses on leveraging Cog and Acorn frameworks to package, deploy, and scale machine learning models in cloud native environments. Part 1 introduces Cog as the framework for containerizing ML models, while part 2 focuses on integrating Cog with Acorn to target Kubernetes clusters for deployment. Part 3 discusses running TensorFlow on Kubernetes with GPUs for the inference of computer vision models, and finally, part 4 covers deploying transformer models that deal with natural language processing (NLP). 

In this, the third part of our ML tutorial series, we will deploy a deep learning model based on Google’s MobileNet SSD to perform image classification. It is a good example of how to run TensorFlow on Kubernetes while taking advantage of GPUs to accelerate the inference. To complete this walkthrough, you need a Kubernetes cluster running on hosts with a GPU. The Kubernetes cluster should also have the NVIDIA GPU operator installed. 

Creating the Cog artifacts to build the container image

The first step is to download the pre-trained MobileNet SSD model from TensorFlow Hub, which we will embed in the container.

Create a new directory, and run the below command:


The next step is to create the file that makes the prediction. Create a file called with the below contents:

from typing import Any
from cog import BasePredictor, Input, Path

import tensorflow as tf
from tensorflow.keras.preprocessing import image as keras_image
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions

import numpy as np

class Predictor(BasePredictor):
    def setup(self):
        self.model = tf.keras.applications.mobilenet_v2.MobileNetV2(weights="mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224.h5")

    def predict(self, image: Path = Input(description="Image to classify")) -> Any:
        img = keras_image.load_img(image, target_size=(224, 224))
        x = keras_image.img_to_array(img)
        x = np.expand_dims(x, axis=0)
        x = preprocess_input(x)
        preds = self.model.predict(x)
        return decode_predictions(preds, top=3)[0]

The setup method loads the pre-trained models with weights from the file system and caches them in memory. 

The next method, predict, accepts an image and uses the Keras library to preprocess it, which is passed onto the model.predict() method. The top 3 predictions and their probability score are sent as an output. 

With the model and the inference code in place, it’s time to define cog.yaml, which brings everything together to build the image. 

  gpu: true
  python_version: "3.8"
    - pillow==9.1.0
    - tensorflow==2.8.0
predict: ""

Notice that we set the gpu key to true, hinting Cog to include an appropriate CUDA base image. This is a powerful mechanism where Cog determines the most optimized version of the CUDA image based on the packages included, so we can run our TensorFlow model on Kubernetes with GPUs.

We then add the Python modules needed by the inference code. Finally, we associate the method with the prediction code file.

Go ahead and build the Docker image and push it to the Docker Hub.

export DOCKER_HUB_USERNAME=janakiramm
cog build -t $DOCKER_HUB_USERNAME/mobilenet-gpu

docker push $DOCKER_HUB_USERNAME/mobilenet-gpu

If you want to test the container before deploying your machine learning model to Kubernetes, you need to install NVIDIA Container Toolkit. Refer to the NVIDIA documentation for the steps. 

Running our TensorFlow model on Kubernetes with Acorn

With the image pushed to Docker Hub, we are ready to package and deploy it as an Acorn application. 

Define the Acornfile as shown below:

    "mobilenet-gpu": {
        image: "janakiramm/mobilenet-gpu:latest"
        ports: publish: "80:5000/http"
        scale: 1

This is one of the simplest Acornfiile that deploys an app based on the model and exposes it as an HTTP endpoint. 

Run the Acorn application with the below command:

Performing inference on the Acorn application 

Since Cog accepts an image encoded in base64 and wrapped in a JSON payload, we must prepare the input file appropriately. 

I have an image of the tiger named image.jpg. Let’s create a BASH script that encodes and generates the required payload.

function img-data() {
  TYPE=$(file --mime-type -b $1)
  ENC=$(base64 -i $1)
  echo "{\"input\": {\"image\":\"data:$TYPE;base64,$ENC\"}}"

source; img-data image.jpg > input_img.dat

We can now call the REST API of the model through curl. 

Passing the output through jq utility gives us a readable output. 

As we can see, the top result for the image is a tiger with a probability score of 73%.

Analyzing the Cog and Acorn TensorFlow environment

Let’s start by looking at the Dockerfile generated by Cog. 

cog debug docker

Since we defined gpu:true in the cog.yaml file, it has detected that the base image must be nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04. Cog has also included the Debian packages that are needed by CUDA to work. 

This is one of the biggest advantages of using Cog. Instead of handcrafting the Dockerfile, we let Cog generate the most optimal version of it. 

Next, let’s check if our model exploits the GPU. For that, we need to access the shell of the Acorn app and run a few commands. 

acorn exec mnet-ssd

Once you are inside the container’s shell, check for the GPU with the nvidia-smi command. 

My single-node Kubernetes cluster is powered by NVIDIA GeForce RTX 3090 with 24GB of RAM and 10496 cores. This is confirmed by the output seen above. 

Next, let’s see if TensorFlow is able to access the GPU. 

It is confirmed that TensorFlow can access the GPU, significantly accelerating the inference speed. 

In this tutorial, we have seen how to use Cog and Acorn together for running TensorFlow on Kubernetes. We’ve shown how to run Vision AI models accelerated by GPU hosts. At this point we’ve seen how to build and deploy containerized machine learning models and scale them on Kubernetes easily using Cog and Acorn. In the last and final part of this series, we will deploy an NLP transformer model that acts like a chatbot. Stay tuned. 

Janakiram is a practicing architect, analyst, and advisor focusing on emerging infrastructure technologies. He provides strategic advisory to hyperscalers, technology platform companies, startups, ISVs, and enterprises. As a practitioner working with a diverse Enterprise customer base across cloud native, machine learning, IoT, and edge domains, Janakiram gains insight into the enterprise challenges, pitfalls, and opportunities involved in emerging technology adoption. Janakiram is an Amazon, Microsoft, and Google certified cloud architect, as well as a CNCF Ambassador and Microsoft Regional Director. He is an active contributor at Gigaom Research, Forbes, The New Stack, and InfoWorld. You can follow him on Twitter.

Header Photo by Jeffrey Betts on Unsplash

Spread the word