Deploying AI Service using EC2

Deploying an AI service

I made a contract with Hanyang University AI Lab, and worked on deploying AI service using EC2.

The task was quite simple: Deploy an eye-tracker model and deploy it to the backend server

Since I was confident with EC2 and production process, I thought it would be just as same as deploying a web service. But during the work, I found some key-points that make AI service unique compared to normal web service.

GPU servers are Expensive!

GPU server, a simple Nvidia T4 server costs 4x bigger than CPU server. For this reason, I tried to work on CPU inferencing. But the inference time was too long, and the client needed to show a demo in CES 2025. I eventually migrated to GPU server and used CUDA. 3 min CPU inference -> 5 sec GPU inference

Model should be loaded to GPU before getting the request

AI Model checkpoints are stored in disk, and loaded to the GPU memory when we use it. We have to be careful not to load the model to GPU memory everytime user requests for inferencing.

In the following code with FastAPI, I stored the ml_model at global variable, and initialized it during startup. The ml_model object is alive until the fastapi server is down, which means the server keep reusing the model(in the GPU memory) for inferencing.

from contextlib import asynccontextmanager
import time
import os
from sam2.build_sam import build_sam2_video_predictor
import torch

ml_models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    if torch.cuda.is_available():
        device = torch.device("cuda")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
    print(f"device: {device}")
    if device.type == "cuda":
        # use bfloat16 for the entire notebook
        torch.autocast("cuda", dtype=torch.bfloat16).__enter__()
        # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
        if torch.cuda.get_device_properties(0).major >= 8:
            torch.backends.cuda.matmul.allow_tf32 = True
            torch.backends.cudnn.allow_tf32 = True
    # model_cfg = "sam2_hiera_b+.yaml"
    model_cfg = "sam2_hiera_t.yaml"
    # sam2_ckpt = "src/ckpt/sam2_hiera_base_plus.pt"
    sam2_ckpt = "src/ckpt/sam2_hiera_tiny.pt"
    ml_models["sam2_predictor"] = build_sam2_video_predictor(model_cfg, sam2_ckpt, device=device)
    yield 
    ml_models.clear()

app = FastAPI(lifespan=lifespan)

CD using Docker

I think dockerize is the most important process in production. If we use docker, we can deploy/patch the service with a single command. This was my first trial to use CUDA inside a container.

I built a Dockerfile and docker-compose.yml as follows:

I used official image from nvidia/cuda, and solely used uv-python for setting-up python environment.

FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

WORKDIR /app

# Instead of git clone, we copy the whole content into the docker container
COPY . /app

RUN uv venv --python 3.11.10
RUN . .venv/bin/activate

# install dependencies for cv2
RUN apt-get update && apt-get install make wget ffmpeg libsm6 libxext6  -y

# install torch and torchvision
RUN uv pip install -r requirements.txt

# install sam2 
WORKDIR /app/sam2

RUN uv pip install -e .

# Also enable to run sam2 in jupyter notebook for testing
# RUN uv pip install --system -e ".[notebooks]"

# download sam2 model checkpoint

WORKDIR /app

RUN make download-checkpoint

CMD ["uv", "run", "fastapi", "run", "src/app.py", "--port", "80"]

services:
  eye-tracker-server:
    container_name: eye-tracker-server
    restart: always
    build:
      context: ..
      dockerfile: .docker/Dockerfile
    ports:
      - "80:80"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities:
                - gpu
                - utility
                - compute
                - video

References

Last updated 3 months ago

Was this helpful?

from contextlib import asynccontextmanager import time import os from sam2.build_sam import build_sam2_video_predictor import torch ml_models = {} @asynccontextmanager async def lifespan(app: FastAPI): if torch.cuda.is_available(): device = torch.device("cuda") elif torch.backends.mps.is_available(): device = torch.device("mps") else: device = torch.device("cpu") print(f"device: {device}") if device.type == "cuda": # use bfloat16 for the entire notebook torch.autocast("cuda", dtype=torch.bfloat16).__enter__() # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices) if torch.cuda.get_device_properties(0).major >= 8: torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True # model_cfg = "sam2_hiera_b+.yaml" model_cfg = "sam2_hiera_t.yaml" # sam2_ckpt = "src/ckpt/sam2_hiera_base_plus.pt" sam2_ckpt = "src/ckpt/sam2_hiera_tiny.pt" ml_models["sam2_predictor"] = build_sam2_video_predictor(model_cfg, sam2_ckpt, device=device) yield ml_models.clear() app = FastAPI(lifespan=lifespan)

FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ WORKDIR /app # Instead of git clone, we copy the whole content into the docker container COPY . /app RUN uv venv --python 3.11.10 RUN . .venv/bin/activate # install dependencies for cv2 RUN apt-get update && apt-get install make wget ffmpeg libsm6 libxext6 -y # install torch and torchvision RUN uv pip install -r requirements.txt # install sam2 WORKDIR /app/sam2 RUN uv pip install -e . # Also enable to run sam2 in jupyter notebook for testing # RUN uv pip install --system -e ".[notebooks]" # download sam2 model checkpoint WORKDIR /app RUN make download-checkpoint CMD ["uv", "run", "fastapi", "run", "src/app.py", "--port", "80"]

services: eye-tracker-server: container_name: eye-tracker-server restart: always build: context: .. dockerfile: .docker/Dockerfile ports: - "80:80" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: - gpu - utility - compute - video