ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Deploying an AI service
  • GPU servers are Expensive!
  • Model should be loaded to GPU before getting the request
  • CD using Docker
  • References

Was this helpful?

Edit on GitHub
  1. App development

Deploying AI Service using EC2

Deploying an AI service

I made a contract with Hanyang University AI Lab, and worked on deploying AI service using EC2.

The task was quite simple: Deploy an eye-tracker model and deploy it to the backend server

Since I was confident with EC2 and production process, I thought it would be just as same as deploying a web service. But during the work, I found some key-points that make AI service unique compared to normal web service.

GPU servers are Expensive!

GPU server, a simple Nvidia T4 server costs 4x bigger than CPU server. For this reason, I tried to work on CPU inferencing. But the inference time was too long, and the client needed to show a demo in CES 2025. I eventually migrated to GPU server and used CUDA. 3 min CPU inference -> 5 sec GPU inference

Model should be loaded to GPU before getting the request

AI Model checkpoints are stored in disk, and loaded to the GPU memory when we use it. We have to be careful not to load the model to GPU memory everytime user requests for inferencing.

In the following code with FastAPI, I stored the ml_model at global variable, and initialized it during startup. The ml_model object is alive until the fastapi server is down, which means the server keep reusing the model(in the GPU memory) for inferencing.

from contextlib import asynccontextmanager
import time
import os
from sam2.build_sam import build_sam2_video_predictor
import torch

ml_models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    if torch.cuda.is_available():
        device = torch.device("cuda")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
    print(f"device: {device}")
    if device.type == "cuda":
        # use bfloat16 for the entire notebook
        torch.autocast("cuda", dtype=torch.bfloat16).__enter__()
        # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
        if torch.cuda.get_device_properties(0).major >= 8:
            torch.backends.cuda.matmul.allow_tf32 = True
            torch.backends.cudnn.allow_tf32 = True
    # model_cfg = "sam2_hiera_b+.yaml"
    model_cfg = "sam2_hiera_t.yaml"
    # sam2_ckpt = "src/ckpt/sam2_hiera_base_plus.pt"
    sam2_ckpt = "src/ckpt/sam2_hiera_tiny.pt"
    ml_models["sam2_predictor"] = build_sam2_video_predictor(model_cfg, sam2_ckpt, device=device)
    yield 
    ml_models.clear()

app = FastAPI(lifespan=lifespan)

CD using Docker

I think dockerize is the most important process in production. If we use docker, we can deploy/patch the service with a single command. This was my first trial to use CUDA inside a container.

I built a Dockerfile and docker-compose.yml as follows:

I used official image from nvidia/cuda, and solely used uv-python for setting-up python environment.

FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

WORKDIR /app

# Instead of git clone, we copy the whole content into the docker container
COPY . /app

RUN uv venv --python 3.11.10
RUN . .venv/bin/activate

# install dependencies for cv2
RUN apt-get update && apt-get install make wget ffmpeg libsm6 libxext6  -y

# install torch and torchvision
RUN uv pip install -r requirements.txt

# install sam2 
WORKDIR /app/sam2

RUN uv pip install -e .

# Also enable to run sam2 in jupyter notebook for testing
# RUN uv pip install --system -e ".[notebooks]"

# download sam2 model checkpoint

WORKDIR /app

RUN make download-checkpoint

CMD ["uv", "run", "fastapi", "run", "src/app.py", "--port", "80"]
services:
  eye-tracker-server:
    container_name: eye-tracker-server
    restart: always
    build:
      context: ..
      dockerfile: .docker/Dockerfile
    ports:
      - "80:80"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities:
                - gpu
                - utility
                - compute
                - video

References

Last updated 3 months ago

Was this helpful?

[1]

[2]

[3]

[3]

https://hub.docker.com/r/nvidia/cuda
https://fastapi.tiangolo.com/advanced/events/
https://medium.com/@albertqueralto/enabling-cuda-capabilities-in-docker-containers-51a3566ad014
https://velog.io/@whattsup_kim/GPU-%EA%B0%9C%EB%B0%9C%ED%99%98%EA%B2%BD-%EA%B5%AC%EC%B6%95%ED%95%98%EA%B8%B0-docker%EB%A5%BC-%ED%99%9C%EC%9A%A9%ED%95%98%EC%97%AC-%EA%B0%9C%EB%B0%9C%ED%99%98%EA%B2%BD-%ED%95%9C-%EB%B2%88%EC%97%90-%EA%B5%AC%EC%B6%95%ED%95%98%EA%B8%B0