open-mmlab · Mountchicken · Jan 18, 2023 · Jan 18, 2023
diff --git a/...oo/textrecog/ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.yaml b/...oo/textrecog/ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.yaml
@@ -0,0 +1,74 @@
+Title: 'ASTER: An Attentional Scene Text Recognizer with Flexible Rectification'
+Abbreviation: ASTER
+Tasks:
+ - TextRecog
+Venue: TPAMI
+Year: 2018
+Lab/Company:
+ - Huazhong University of Science and Technology, Wuhan, China
+URL:
+  Venue: 'https://ieeexplore.ieee.org/abstract/document/8395027/'
+  Arxiv: 'https://openreview.net/forum?id=S7EEqJHedpH'
+Paper Reading URL: N/A
+Code: 'https://github.com/ayumiymk/aster.pytorch'
+Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/dev-1.x/configs/textrecog/aster'
+PaperType:
+ - Algorithm
+Abstract: 'A challenging aspect of scene text recognition is to handle text with
+distortions or irregular layout. In particular, perspective text and curved
+text are common in natural scenes and are difficult to recognize. In this work,
+we introduce ASTER, an end-to-end neural network model that comprises a
+rectification network and a recognition network. The rectification network
+adaptively transforms an input image into a new one, rectifying the text in it.
+It is powered by a flexible Thin-Plate Spline transformation which handles a
+variety of text irregularities and is trained without human annotations. The
+recognition network is an attentional sequence-to-sequence model that predicts
+a character sequence directly from the rectified image. The whole model is
+trained end to end, requiring only images and their groundtruth text. Through
+extensive experiments, we verify the effectiveness of the rectification and
+demonstrate the state-of-the-art recognition performance of ASTER. Furthermore,
+we demonstrate that ASTER is a powerful component in end-to-end recognition
+systems, for its ability to enhance the detector.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/213168893-7c600e03-c1f0-464a-8236-40ae26fbff89.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 86.0
+     IIIT5K:
+       WAICS: 93.4
+     SVT:
+       WAICS: 93.6
+     IC13:
+       WAICS: 94.5
+     IC15:
+       WAICS: 76.1
+     SVTP:
+       WAICS: 78.5
+     CUTE:
+       WAICS: 79.5
+Bibtex: '@article{shi2018aster,
+  title={Aster: An attentional scene text recognizer with flexible rectification},
+  author={Shi, Baoguang and Yang, Mingkun and Wang, Xinggang and Lyu, Pengyuan and Yao, Cong and Bai, Xiang},
+  journal={IEEE transactions on pattern analysis and machine intelligence},
+  volume={41},
+  number={9},
+  pages={2035--2048},
+  year={2018},
+  publisher={IEEE}
+}'
diff --git a/paper_zoo/textrecog/Aggregation Cross-Entropy for Sequence Recognition.yaml b/paper_zoo/textrecog/Aggregation Cross-Entropy for Sequence Recognition.yaml
@@ -0,0 +1,68 @@
+Title: 'Aggregation Cross-Entropy for Sequence Recognition'
+Abbreviation: ACE
+Tasks:
+ - TextRecog
+Venue: CVPR
+Year: 2019
+Lab/Company:
+ - South China University of Technology
+URL:
+  Venue: 'http://openaccess.thecvf.com/content_CVPR_2019/html/Xie_Aggregation_Cross-Entropy_for_Sequence_Recognition_CVPR_2019_paper.html'
+  Arxiv: 'https://arxiv.org/abs/1904.08364'
+Paper Reading URL: N/A
+Code: 'https://github.com/summerlvsong/Aggregation-CrossEntropy'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'In this paper, we propose a novel method, aggregation cross-entropy
+(ACE), for sequence recognition from a brand new perspective. The ACE loss
+function exhibits competitive performance to CTC and the attention mechanism,
+with much quicker implementation (as it involves only four fundamental
+formulas), faster inference\back-propagation (approximately O(1) in parallel),
+less storage requirement (no parameter and negligible runtime memory), and
+convenient employment (by replacing CTC with ACE). Furthermore, the proposed
+ACE loss function exhibits two noteworthy properties: (1) it can be directly
+applied for 2D prediction by flattening the 2D prediction into 1D prediction
+as the input and (2) it requires only characters and their numbers in the
+sequence annotation for supervision, which allows it to advance beyond sequence
+recognition, e.g., counting problem.'
+MODELS:
+ Architecture:
+  - CTC
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/213173571-fdf09df3-9769-4d52-bf44-6f58c9b5453d.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 79.4
+     IIIT5K:
+       WAICS: 82.3
+     SVT:
+       WAICS: 82.6
+     IC13:
+       WAICS: 89.7
+     IC15:
+       WAICS: 68.9
+     SVTP:
+       WAICS: 70.1
+     CUTE:
+       WAICS: 82.6
+Bibtex: '@inproceedings{xie2019aggregation,
+  title={Aggregation cross-entropy for sequence recognition},
+  author={Xie, Zecheng and Huang, Yaoxiong and Zhu, Yuanzhi and Jin, Lianwen and Liu, Yuliang and Xie, Lele},
+  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
+  pages={6538--6547},
+  year={2019}
+}'
diff --git a/...k for Image-based Sequence Recognition and Its Application to Scene Text Recognition.yaml b/...k for Image-based Sequence Recognition and Its Application to Scene Text Recognition.yaml
@@ -0,0 +1,76 @@
+Title: 'An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition'
+Abbreviation: TPAMI
+Tasks:
+ - TextRecog
+Venue: TPAMI
+Year: 2016
+Lab/Company:
+ - School of Electronic Information and Communications Huazhong University of Science and Technology, Wuhan, China
+URL:
+  Venue: 'https://ieeexplore.ieee.org/abstract/document/7801919/'
+  Arxiv: 'https://arxiv.org/abs/1507.05717'
+Paper Reading URL: N/A
+Code: 'https://github.com/bgshih/crnn'
+Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/crnn'
+PaperType:
+ - Algorithm
+Abstract: 'Image-based sequence recognition has been a longstanding research
+topic in computer vision. In this paper, we investigate the problem of scene
+text recognition, which is among the most important and challenging tasks in
+image-based sequence recognition. A novel neural network architecture, which
+integrates feature extraction, sequence modeling and transcription into a unified
+framework, is proposed. Compared with previous systems for scene text recognition,
+the proposed architecture possesses four distinctive properties: (1) It is
+end-to-end trainable, in contrast to most of the existing algorithms whose
+components are separately trained and tuned. (2) It naturally handles sequences
+in arbitrary lengths, involving no character segmentation or horizontal scale
+ normalization. (3) It is not confined to any predefined lexicon and achieves
+ remarkable performances in both lexicon-free and lexicon-based scene text
+ recognition tasks. (4) It generates an effective yet much smaller model,
+ which is more practical for real-world application scenarios. The experiments
+ on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR
+ datasets, demonstrate the superiority of the proposed algorithm over the prior
+ arts. Moreover, the proposed algorithm performs well in the task of image-based
+ music score recognition, which evidently verifies the generality of it.'
+MODELS:
+ Architecture:
+  - CTC
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/213174579-e89dbd14-8ace-4f16-9cb6-4b882dbd4e27.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  8.3M
+ Experiment:
+   Training DataSets:
+     - MJ
+   Test DataSets:
+     Avg.: 81.9
+     IIIT5K:
+       WAICS: 78.2
+     SVT:
+       WAICS: 80.8
+     IC13:
+       WAICS: 86.7
+     IC15:
+       WAICS: N/A
+     SVTP:
+       WAICS: N/A
+     CUTE:
+       WAICS: N/A
+Bibtex: '@article{shi2016end,
+  title={An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition},
+  author={Shi, Baoguang and Bai, Xiang and Yao, Cong},
+  journal={IEEE transactions on pattern analysis and machine intelligence},
+  volume={39},
+  number={11},
+  pages={2298--2304},
+  year={2016},
+  publisher={IEEE}
+}'
diff --git a/...o/textrecog/Attention after Attention: Reading Text in the Wild with Cross Attention.yaml b/...o/textrecog/Attention after Attention: Reading Text in the Wild with Cross Attention.yaml
@@ -0,0 +1,74 @@
+Title: 'Attention after Attention: Reading Text in the Wild with Cross Attention'
+Abbreviation: Huang et al.
+Tasks:
+ - TextRecog
+Venue: ICDAR
+Year: 2019
+Lab/Company:
+ - School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510641, China
+URL:
+  Venue: 'https://ieeexplore.ieee.org/abstract/document/8977967/'
+  Arxiv: N/A
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Recent methods mostly regarded scene text recognition as a
+sequence-to-sequence problem. These methods roughly transform the image into a
+feature sequence and use the algorithms for sequence-to-sequence problem like
+CTC or attention to decode the characters. However, text in images is distributed
+in a two-dimensional (2D) space and roughly converting the features of text
+into a feature sequence may introduce extra noise, especially if the text is
+irregular. In this paper, we propose a novel framework named cross attention
+network, which learns to attend to local features of a 2D feature map
+corresponding to individual characters. The network contains two 1D attention
+networks, which operates harmoniously in two directions. Thus, one of the
+attention modules vertically attends to the features corresponding to the whole
+text of 2D features and the other horizontal module selects the local features
+to decode individual characters. Extensive experiments are performed on various
+regular benchmarks, including SVT, ICDAR2003, ICDAR2013, and IIIT5K-Words,
+which demonstrate that the proposed model either outperforms or is comparable
+to all previous methods. Moreover, the model is evaluated on irregular benchmarks
+including SVTPerspective, CUTE80 and ICDAR 2015. The performance on irregular
+benchmarks shows the robustness of our model.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/213172663-4c3c5ea1-84b8-40e5-8453-e389c7ee5595.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 86.4
+     IIIT5K:
+       WAICS: 94.5
+     SVT:
+       WAICS: 90.0
+     IC13:
+       WAICS: 94.2
+     IC15:
+       WAICS: 75.3
+     SVTP:
+       WAICS: 79.8
+     CUTE:
+       WAICS: 84.7
+Bibtex: '@inproceedings{fang2021read,
+  title={Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition},
+  author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={7098--7107},
+  year={2021}
+}'
diff --git a/paper_zoo/textrecog/Decoupled Attention Network for Text Recognition.yaml b/paper_zoo/textrecog/Decoupled Attention Network for Text Recognition.yaml
@@ -0,0 +1,75 @@
+Title: 'Decoupled Attention Network for Text Recognition'
+Abbreviation: DAN
+Tasks:
+ - TextRecog
+Venue: AAAI
+Year: 2019
+Lab/Company:
+ - School of Electronic and Information Engineering, South China University of Technology
+ - Lenovo Research
+URL:
+  Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/6903'
+  Arxiv: 'https://arxiv.org/abs/1912.10205'
+Paper Reading URL: N/A
+Code: 'https://github.com/Wang-Tianwei/Decoupled-attentionnetwork'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Text recognition has attracted considerable research interests because
+of its various applications. The cutting-edge text recognition methods are
+based on attention mechanisms. However, most of attention methods usually
+suffer from serious alignment problem due to its recurrency alignment operation,
+where the alignment relies on historical decoding results. To remedy this
+issue, we propose a decoupled attention network (DAN), which decouples the
+alignment operation from using historical decoding results. DAN is an effective,
+flexible and robust end-to-end text recognizer, which consists of three
+components: 1) a feature encoder that extracts visual features from the input
+image; 2) a convolutional alignment module that performs the alignment
+operation based on visual features from the encoder; and 3) a decoupled text
+decoder that makes final prediction by jointly using the feature map and
+attention maps. Experimental results show that DAN achieves state-of-the-art
+performance on multiple text recognition tasks, including offline handwritten
+text recognition and regular/irregular scene text recognition. Codes will be
+released.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/213171943-35e9c57c-fdce-4866-91c4-a47dad9a7b3b.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS:  N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 86.0
+     IIIT5K:
+       WAICS: 94.3
+     SVT:
+       WAICS: 89.2
+     IC13:
+       WAICS: 93.9
+     IC15:
+       WAICS: 74.5
+     SVTP:
+       WAICS: 80.0
+     CUTE:
+       WAICS: 84.4
+Bibtex: '@inproceedings{wang2020decoupled,
+  title={Decoupled attention network for text recognition},
+  author={Wang, Tianwei and Zhu, Yuanzhi and Jin, Lianwen and Luo, Canjie and Chen, Xiaoxue and Wu, Yaqiang and Wang, Qianying and Cai, Mingxiang},
+  booktitle={Proceedings of the AAAI conference on artificial intelligence},
+  volume={34},
+  number={07},
+  pages={12216--12224},
+  year={2020}
+}'