VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen Jin, Xiang Bai

▶ Huazhong University of Science and Technology ▶ South China University of Technology ▶ Zhejiang University

arXiv Code

🔥 VimTS is a unified video and image text spotter for enhancing the cross-domain generalization. It outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data.

Video

Framework

Overall framework of our method.

Overall framework of CoDeF-based synthetic method.

VTD-368k

We manually collect and filter text-free, open-source and unrestricted videos from NExT-QA, Charades-Ego, Breakfast, A2D, MPI-Cooking, ActorShift and Hollywood. By utilizing the CoDeF, our synthetic method facilitates the achievement of realistic and stable text flow propagation, significantly reducing the occurrence of distortions.

Benchmark Experiments

For image-level cross-domain text spotting, we conduct experiments on six cross-domain scenarios to evaluate VimTS. For video-level cross-domain text spotting, we conduct experiments on two popular video text spotting benchmarks to evaluate VimTS. The results are presented in the following.

Conclusion: It is worth mentioning that our method demonstrates it is viable that still text images can be learned to be well transferred to video text images. Since still images require significantly less annotation effort compared to video image, exploring methods to bridge the domain gaps will be highly valuable. Furthermore, we demonstrate that current Large Multimodal Models still face limitations in cross-domain text spotting. Using fewer parameters and less data to improve the generalization of Large Multimodal Models in text spotting is worth further exploration.

Some Visualization

Compared with MLMMS

BibTeX

@misc{liuvimts, author={Liu, Yuliang and Huang, Mingxin and Yan, Hao and Deng, Linger and Wu, Weijia and Lu, Hao and Shen, Chunhua and Jin, Lianwen and Bai, Xiang}, title={VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization}, publisher={arXiv preprint arXiv:2404.19652}, year={2024}, }