Sökning: onr:"swepub:oai:DiVA.org:liu-191406" >
Video Instance Segm...
Video Instance Segmentation via Multi-Scale Spatio-Temporal Split Attention Transformer
-
- Thawakar, Omkar (författare)
- MBZUAI, U Arab Emirates
-
- Narayan, Sanath (författare)
- IIAI, U Arab Emirates
-
- Cao, Jiale (författare)
- Tianjin Univ, Peoples R China
-
visa fler...
-
- Cholakkal, Hisham (författare)
- MBZUAI, U Arab Emirates
-
- Anwer, Rao Muhammad (författare)
- MBZUAI, U Arab Emirates
-
- Khan, Muhammad Haris (författare)
- MBZUAI, U Arab Emirates
-
- Khan, Salman (författare)
- MBZUAI, U Arab Emirates
-
- Felsberg, Michael (författare)
- Linköpings universitet,Datorseende,Tekniska fakulteten
-
- Khan, Fahad (författare)
- Linköpings universitet,Datorseende,Tekniska fakulteten,MBZUAI, U Arab Emirates
-
visa färre...
-
(creator_code:org_t)
- 2022-10-22
- 2022
- Engelska.
-
Ingår i: COMPUTER VISION, ECCV 2022, PT XXIX. - Cham : SPRINGER INTERNATIONAL PUBLISHING AG. - 9783031198175 - 9783031198182 ; , s. 666-681
- Relaterad länk:
-
https://urn.kb.se/re...
-
visa fler...
-
https://doi.org/10.1...
-
visa färre...
Abstract
Ämnesord
Stäng
- State-of-the-art transformer-based video instance segmentation (VIS) approaches typically utilize either single-scale spatio-temporal features or per-frame multi-scale features during the attention computations. We argue that such an attention computation ignores the multiscale spatio-temporal feature relationships that are crucial to tackle target appearance deformations in videos. To address this issue, we propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale spatio-temporal split (MS-STS) attention module in the encoder. The proposed MS-STS module effectively captures spatio-temporal feature relationships at multiple scales across frames in a video. We further introduce an attention block in the decoder to enhance the temporal consistency of the detected instances in different frames of a video. Moreover, an auxiliary discriminator is introduced during training to ensure better foreground-background separability within the multiscale spatio-temporal feature space. We conduct extensive experiments on two benchmarks: Youtube-VIS (2019 and 2021). Our MS-STS VIS achieves state-of-the-art performance on both benchmarks. When using the ResNet50 backbone, our MS-STS achieves a mask AP of 50.1%, outperforming the best reported results in literature by 2.7% and by 4.8% at higher overlap threshold of AP75, while being comparable in model size and speed on Youtube-VIS 2019 val. set. When using the Swin Transformer backbone, MS-STS VIS achieves mask AP of 61.0% on Youtube-VIS 2019 val. set.
Ämnesord
- NATURVETENSKAP -- Data- och informationsvetenskap -- Datorseende och robotik (hsv//swe)
- NATURAL SCIENCES -- Computer and Information Sciences -- Computer Vision and Robotics (hsv//eng)
Publikations- och innehållstyp
- ref (ämneskategori)
- kon (ämneskategori)
Hitta via bibliotek
Till lärosätets databas
- Av författaren/redakt...
-
Thawakar, Omkar
-
Narayan, Sanath
-
Cao, Jiale
-
Cholakkal, Hisha ...
-
Anwer, Rao Muham ...
-
Khan, Muhammad H ...
-
visa fler...
-
Khan, Salman
-
Felsberg, Michae ...
-
Khan, Fahad
-
visa färre...
- Om ämnet
-
- NATURVETENSKAP
-
NATURVETENSKAP
-
och Data och informa ...
-
och Datorseende och ...
- Artiklar i publikationen
-
COMPUTER VISION, ...
- Av lärosätet
-
Linköpings universitet