Segment Anything

Last updated 3 months ago

Was this helpful?

Segment Anything

What does SAM(Segment Anything Model) do?

You can try a demo at . SAM model can do flexible segmentation using promps(text, dots in image, bounding-box etc.)

What makes this paper special?

Made zero-shot segmentation available using promptable segmentation
Made Segment Anything model.
Built SA-1B dataset using data-engine based on SAM itself

In this post, I am going to focus on the model structure. If you are curious how SAM team built SA-1B dataset, please read the paper.

Segment Anything Model

Segment Anything is similar to other visual multi-modal model. It takes image and prompt, and output segmentation mask.

Let's take a look at each components

Image Encoder

Prompt Encoder

Mask Decoder

Mask decoder has complicated structure.

Output tokens are passed to MLP, and generate many masks. Also it predicts the IoU score for each output tokens.

Training Procedure

Since SAM outputs many masks for single prompt, it has an ambiguity. For example, if I pick a point on the torso of human(wearing clothe), then it may output segmentation mask for the shirt, shirt+pants, human itself. To resolve this ambiguity during back-propagation, we pick 3 top masks with highest IoU score.

Upon 3 masks, we calculate the loss as follows:

criterion = focal \ loss + dice \ loss

focal loss: cross-entropy loss that focus on the mis-classified pixels
dice loss: calculates similarity of the ground-truth mask and generated mask

References