Fast Segment Anything
The paper proposes using a CNN-detector instead of a Transformer architecture to produce a 50x increase in the segmentation task.
The authors have replaced the Transformer (ViT) architecture with a YOLOv8 model. The task is also reformulated into two sequential stages of (1)producing segmentation masks using a CNN-based architecture and; (2) outputting the region of interest corresponding to the prompt.
This opens up several potential industrial-grade applications in building extraction from EOS imagery, salient object detection and anomaly detection.
The FastSAM method seems to have some weakness, primarily (1) the low quality of small-sized segmentation maks which have large confidence scores. This is because the confidence score comes from the YOLOv8 model and is not strongly correlated to the mask quality; (2) masks of some tiny-sized objects tend to be near the square, while masks of larger objects tend to have artifacts at the border of the bounding boxes.
Dataset💽:SA 1B Dataset