Segment Anything
Last updated
Was this helpful?
Last updated
Was this helpful?
You can try a demo at . SAM model can do flexible segmentation using promps(text, dots in image, bounding-box etc.)
Made zero-shot segmentation available using promptable segmentation
Made Segment Anything model.
Built SA-1B dataset using data-engine based on SAM itself
In this post, I am going to focus on the model structure. If you are curious how SAM team built SA-1B dataset, please read the paper.
Segment Anything is similar to other visual multi-modal model. It takes image and prompt, and output segmentation mask.
Let's take a look at each components
Mask decoder has complicated structure.
Output tokens are passed to MLP, and generate many masks. Also it predicts the IoU score for each output tokens.
Since SAM outputs many masks for single prompt, it has an ambiguity. For example, if I pick a point on the torso of human(wearing clothe), then it may output segmentation mask for the shirt, shirt+pants, human itself. To resolve this ambiguity during back-propagation, we pick 3 top masks with highest IoU score.
Upon 3 masks, we calculate the loss as follows:
focal loss: cross-entropy loss that focus on the mis-classified pixels
dice loss: calculates similarity of the ground-truth mask and generated mask
It uses . ViT-MAE is ViT that use special training technique. It mask some part of an image and train the encoder and decoder to recover the original image.
It uses off-the-shelf text encoder in model.
Output tokens are kind of . It adds information about the output information of the orange box, which needs to be applied two-times sequentially.
[1]
[2]
[3]
[4]
[5]
[6]