EdgeTAM: On-Device Track Something Model

On prime of Segment Anything Model (SAM), iTagPro smart tracker SAM 2 additional extends its functionality from image to video inputs by way of a memory bank mechanism and obtains a remarkable efficiency compared with previous methods, making it a basis model for video segmentation job. In this paper, we goal at making SAM 2 way more environment friendly in order that it even runs on cell units whereas maintaining a comparable efficiency. Despite several works optimizing SAM for higher efficiency, we discover they are not adequate for SAM 2 as a result of all of them deal with compressing the picture encoder, while our benchmark exhibits that the newly introduced reminiscence attention blocks are additionally the latency bottleneck. Given this commentary, we suggest EdgeTAM, which leverages a novel 2D Spatial Perceiver to scale back the computational cost. Particularly, the proposed 2D Spatial Perceiver encodes the densely saved frame-stage reminiscences with a lightweight Transformer that accommodates a hard and fast set of learnable queries.

Given that video segmentation is a dense prediction job, we discover preserving the spatial structure of the memories is important so that the queries are split into world-level and patch-degree teams. We also propose a distillation pipeline that further improves the efficiency with out inference overhead. DAVIS 2017, MOSE, SA-V val, and SA-V take a look at, iTagPro smart device whereas working at 16 FPS on iPhone 15 Pro Max. SAM to handle each picture and video inputs, with a memory bank mechanism, and is trained with a brand iTagPro device new large-scale multi-grained video tracking dataset (SA-V). Despite achieving an astonishing efficiency compared to previous video object segmentation (VOS) fashions and allowing extra various consumer prompts, SAM 2, as a server-aspect basis model, isn't environment friendly for on-machine inference. CPU and NPU. Throughout the paper, we interchangeably use iPhone and iPhone 15 Pro Max for simplicity.. SAM for higher effectivity only consider squeezing its picture encoder since the mask decoder is extraordinarily lightweight. SAM 2. Specifically, iTagPro device SAM 2 encodes previous frames with a reminiscence encoder, iTagPro device and these frame-stage memories together with object-level pointers (obtained from the mask decoder) serve because the reminiscence bank.

These are then fused with the options of present frame by way of reminiscence attention blocks. As these memories are densely encoded, this results in an enormous matrix multiplication throughout the cross-attention between present body options and ItagPro memory options. Therefore, regardless of containing comparatively fewer parameters than the picture encoder, the computational complexity of the memory attention is just not affordable for on-gadget inference. The hypothesis is further proved by Fig. 2, where lowering the variety of reminiscence attention blocks virtually linearly cuts down the general decoding latency and inside each memory consideration block, iTagPro device removing the cross attention provides the most important velocity-up. To make such a video-based monitoring mannequin run on machine, in EdgeTAM, we have a look at exploiting the redundancy in videos. To do that in practice, we propose to compress the uncooked frame-level reminiscences before performing reminiscence consideration. We begin with naïve spatial pooling and observe a big efficiency degradation, ItagPro particularly when utilizing low-capacity backbones.

However, naïvely incorporating a Perceiver additionally results in a extreme drop in performance. We hypothesize that as a dense prediction job, the video segmentation requires preserving the spatial structure of the reminiscence financial institution, which a naïve Perceiver discards. Given these observations, we propose a novel lightweight module that compresses body-level reminiscence function maps while preserving the 2D spatial construction, iTagPro device named 2D Spatial Perceiver. Specifically, we cut up the learnable queries into two teams, where one group capabilities equally to the unique Perceiver, the place every question performs international attention on the enter options and outputs a single vector as the body-level summarization. In the opposite group, the queries have 2D priors, i.e., every question is only accountable for compressing a non-overlapping local patch, thus the output maintains the spatial construction while reducing the entire number of tokens. Along with the structure enchancment, we further propose a distillation pipeline that transfers the data of the powerful teacher SAM 2 to our pupil model, which improves the accuracy at no cost of inference overhead.

We find that in each stages, aligning the features from picture encoders of the unique SAM 2 and our environment friendly variant benefits the performance. Besides, iTagPro device we additional align the function output from the memory consideration between the instructor SAM 2 and iTagPro key finder our pupil mannequin within the second stage so that in addition to the image encoder, reminiscence-related modules may receive supervision alerts from the SAM 2 instructor. SA-V val and take a look at by 1.3 and 3.3, respectively. Putting collectively, we suggest EdgeTAM (Track Anything Model for Edge units), that adopts a 2D Spatial Perceiver for effectivity and information distillation for accuracy. Through comprehensive benchmark, we reveal that the latency bottleneck lies within the reminiscence attention module. Given the latency evaluation, we propose a 2D Spatial Perceiver that considerably cuts down the reminiscence consideration computational cost with comparable performance, which may be integrated with any SAM 2 variants. We experiment with a distillation pipeline that performs feature-clever alignment with the unique SAM 2 in both the image and video segmentation phases and observe performance improvements without any additional price throughout inference.