TBN-ViT: Temporal Bilateral Network with Vision Transformer for Video Scene Parsing

12/02/2021
by   Bo Yan, et al.
0

Video scene parsing in the wild with diverse scenarios is a challenging and great significance task, especially with the rapid development of automatic driving technique. The dataset Video Scene Parsing in the Wild(VSPW) contains well-trimmed long-temporal, dense annotation and high resolution clips. Based on VSPW, we design a Temporal Bilateral Network with Vision Transformer. We first design a spatial path with convolutions to generate low level features which can preserve the spatial information. Meanwhile, a context path with vision transformer is employed to obtain sufficient context information. Furthermore, a temporal context module is designed to harness the inter-frames contextual information. Finally, the proposed method can achieve the mean intersection over union(mIoU) of 49.85% for the VSPW2021 Challenge test dataset.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro