Skip to content

The official repo for the technical report "Scalable Mask Annotation for Video Text Spotting"

Notifications You must be signed in to change notification settings

ViTAE-Transformer/SAMText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Scalable Mask Annotation for Video Text Spotting

This is the official repository of the paper Scalable Mask Annotation for Video Text Spotting.

Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, Dacheng Tao

News | Abstract | Method | Usage | Results | Statement

News

02/05/2023

Abstract

Video text spotting refers to localizing, recognizing, and tracking textual elements such as captions, logos, license plates, signs, and other forms of text within consecutive video frames. However, current datasets available for this task rely on quadrilateral ground truth annotations, which may result in including excessive background content and inaccurate text boundaries. Furthermore, methods trained on these datasets often produce prediction results in the form of quadrilateral boxes, which limits their ability to handle complex scenarios such as dense or curved text. To address these issues, we propose a scalable mask annotation pipeline called SAMText for video text spotting.SAMText leverages the SAM model to generate mask annotations for scene text images or video frames at scale. Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips sourced from existing datasets and over 9 million mask annotations. We have also conducted a thorough statistical analysis of the generated masks and their quality, identifying several research topics that could be further explored based on this dataset.

Method

Figure 1: Overview of the SAMText pipeline that builds upon the SAM approach to generate mask annotations for scene text images or video frames at scale. The input bounding box may be sourced from existing annotations or derived from a scene text detection model.

Usage

The code and dataset will be released soon.

Results

The Quality of Generated Masks

Figure 3: The distribution of IoU between the generated masks and ground truth masks in the COCOText training dataset: COCO_Text V2 To evaluate the performance of SAMText, we select the COCO-Text training dataset [25] as it provides ground truth mask annotations for text instances. Specifically, we randomly sample 10% of the training data and calculate the IoU between the masks generated by SAMText and their corresponding ground truth masks. Our findings show that SAMText has high accuracy, with an average IoU of 0.70. Figure 3 presents the histogram of IoU scores. Notably, the majority of IoU scores are centered around 0.75, suggesting that SAMText performs well.

Visualization of Generated Masks

Figure 2: Some visualization results of the generated masks in five datasets using the SAMText pipeline. The top row shows the scene text frames while the bottom row shows the generated masks.

In Figure 2, we show some visualization results of the generated masks in five datasets using the SAMText pipeline. The top row shows the scene text frames while the bottom row shows the generated masks. As can be seen, the generated masks possess fewer background components and align more precisely with the text boundaries than the bounding boxes. As a result, the generated mask annotations facilitate conducting more comprehensive research on this dataset, e.g., video text segmentation and video text spotting using mask annotations.

Dataset Statistics and Analysis

The size distribution.

Figure 4: (a) The mask size distributions of the ICDAR15, RoadText-1k, LSVDT, and DSText datasets. Masks exceeding 10,000 pixels are excluded from the statistics. (b) The mask size distributions of the BOVText datasets. Masks exceeding 80,000 pixels are excluded from the statistics.

The IoU and COV distribution.

Figure 5: (a) The distribution of IoU between the generated masks and ground truth bounding boxes in each dataset. (b) The CoV distribution of mask size changes for the same individual in consecutive frames in all five datasets, excluding the CoV scores exceeding 1.0 from the statistics.

The spatial distribution.

Figure 6: Visualization of the heatmaps that depict the spatial distribution of the generated masks in the five video text spotting datasets employed to establish SAMText-9M.

Statement

This project is for research purpose only. For any other questions please contact haibinhe@whu.edu.cn.

Citation

If you find SAMText helpful, please consider giving this repo a star:star: and citing:

@inproceedings{SAMText,
  title={Scalable Mask Annotation for Video Text Spotting},
  author={Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, Dacheng Tao},
  booktitle={arxiv},
  year={arXiv preprint arXiv:2305.01443}
}

About

The official repo for the technical report "Scalable Mask Annotation for Video Text Spotting"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published