ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • What does SAM(Segment Anything Model) do?
  • What makes this paper special?
  • Segment Anything Model
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Transformer

Segment Anything

Last updated 3 months ago

Was this helpful?

What does SAM(Segment Anything Model) do?

You can try a demo at . SAM model can do flexible segmentation using promps(text, dots in image, bounding-box etc.)

What makes this paper special?

  1. Made zero-shot segmentation available using promptable segmentation

  2. Made Segment Anything model.

  3. Built SA-1B dataset using data-engine based on SAM itself

In this post, I am going to focus on the model structure. If you are curious how SAM team built SA-1B dataset, please read the paper.

Segment Anything Model

Segment Anything is similar to other visual multi-modal model. It takes image and prompt, and output segmentation mask.

Let's take a look at each components

Image Encoder

Prompt Encoder

Mask Decoder

Mask decoder has complicated structure.

Output tokens are passed to MLP, and generate many masks. Also it predicts the IoU score for each output tokens.

Training Procedure

Since SAM outputs many masks for single prompt, it has an ambiguity. For example, if I pick a point on the torso of human(wearing clothe), then it may output segmentation mask for the shirt, shirt+pants, human itself. To resolve this ambiguity during back-propagation, we pick 3 top masks with highest IoU score.

Upon 3 masks, we calculate the loss as follows:

criterion=focal loss+dice losscriterion = focal \ loss + dice \ losscriterion=focal loss+dice loss

focal loss: cross-entropy loss that focus on the mis-classified pixels

dice loss: calculates similarity of the ground-truth mask and generated mask

References

It uses . ViT-MAE is ViT that use special training technique. It mask some part of an image and train the encoder and decoder to recover the original image.

It uses off-the-shelf text encoder in model.

Output tokens are kind of . It adds information about the output information of the orange box, which needs to be applied two-times sequentially.

[1]

[2]

[3]

[4]

[5]

[6]

ViT-MAE
CLIP
class-token in ViT Image Classification
https://arxiv.org/abs/2304.02643
https://medium.com/@utkarsh135/segment-anything-model-sam-explained-2900743cb61e
https://developers-shack.tistory.com/13
https://jordano-jackson.tistory.com/121
https://velog.io/@heomollang/DeiT-%EA%B4%80%EB%A0%A8-%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0-03-AN-IMAGE-IS-WORTH-16X16-WORDSTRANSFORMERS-FOR-IMAGE-RECOGNITION-AT-SCALEViT
https://medium.com/@sanjay_dutta/flower-image-classification-using-vision-transformer-vit-50b71694cda3
official site
LogoSegment AnythingarXiv.org
Image in segment-anything demo
Concept of ViT-MAE