I made a contract with Hanyang University AI Lab, and worked on deploying AI service using EC2.
The task was quite simple:
Deploy an eye-tracker model and deploy it to the backend server
Since I was confident with EC2 and production process, I thought it would be just as same as deploying a web service. But during the work, I found some key-points that make AI service unique compared to normal web service.
GPU servers are Expensive!
GPU server, a simple Nvidia T4 server costs 4x bigger than CPU server. For this reason, I tried to work on CPU inferencing. But the inference time was too long, and the client needed to show a demo in CES 2025. I eventually migrated to GPU server and used CUDA.
3 min CPU inference -> 5 sec GPU inference
Model should be loaded to GPU before getting the request
AI Model checkpoints are stored in disk, and loaded to the GPU memory when we use it. We have to be careful not to load the model to GPU memory everytime user requests for inferencing.
In the following code with FastAPI, I stored the ml_model at global variable, and initialized it during startup. The ml_model object is alive until the fastapi server is down, which means the server keep reusing the model(in the GPU memory) for inferencing.
from contextlib import asynccontextmanager
import time
import os
from sam2.build_sam import build_sam2_video_predictor
import torch
ml_models = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print(f"device: {device}")
if device.type == "cuda":
# use bfloat16 for the entire notebook
torch.autocast("cuda", dtype=torch.bfloat16).__enter__()
# turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
if torch.cuda.get_device_properties(0).major >= 8:
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# model_cfg = "sam2_hiera_b+.yaml"
model_cfg = "sam2_hiera_t.yaml"
# sam2_ckpt = "src/ckpt/sam2_hiera_base_plus.pt"
sam2_ckpt = "src/ckpt/sam2_hiera_tiny.pt"
ml_models["sam2_predictor"] = build_sam2_video_predictor(model_cfg, sam2_ckpt, device=device)
yield
ml_models.clear()
app = FastAPI(lifespan=lifespan)
CD using Docker
I think dockerize is the most important process in production. If we use docker, we can deploy/patch the service with a single command. This was my first trial to use CUDA inside a container.
I built a Dockerfile and docker-compose.yml as follows:
I used official image from nvidia/cuda, and solely used uv-python for setting-up python environment.
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
WORKDIR /app
# Instead of git clone, we copy the whole content into the docker container
COPY . /app
RUN uv venv --python 3.11.10
RUN . .venv/bin/activate
# install dependencies for cv2
RUN apt-get update && apt-get install make wget ffmpeg libsm6 libxext6 -y
# install torch and torchvision
RUN uv pip install -r requirements.txt
# install sam2
WORKDIR /app/sam2
RUN uv pip install -e .
# Also enable to run sam2 in jupyter notebook for testing
# RUN uv pip install --system -e ".[notebooks]"
# download sam2 model checkpoint
WORKDIR /app
RUN make download-checkpoint
CMD ["uv", "run", "fastapi", "run", "src/app.py", "--port", "80"]