diff --git a/paper_zoo/textrecog/ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.yaml b/paper_zoo/textrecog/ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.yaml new file mode 100644 index 000000000..75ab7e20f --- /dev/null +++ b/paper_zoo/textrecog/ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.yaml @@ -0,0 +1,74 @@ +Title: 'ASTER: An Attentional Scene Text Recognizer with Flexible Rectification' +Abbreviation: ASTER +Tasks: + - TextRecog +Venue: TPAMI +Year: 2018 +Lab/Company: + - Huazhong University of Science and Technology, Wuhan, China +URL: + Venue: 'https://ieeexplore.ieee.org/abstract/document/8395027/' + Arxiv: 'https://openreview.net/forum?id=S7EEqJHedpH' +Paper Reading URL: N/A +Code: 'https://github.com/ayumiymk/aster.pytorch' +Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/dev-1.x/configs/textrecog/aster' +PaperType: + - Algorithm +Abstract: 'A challenging aspect of scene text recognition is to handle text with +distortions or irregular layout. In particular, perspective text and curved +text are common in natural scenes and are difficult to recognize. In this work, +we introduce ASTER, an end-to-end neural network model that comprises a +rectification network and a recognition network. The rectification network +adaptively transforms an input image into a new one, rectifying the text in it. +It is powered by a flexible Thin-Plate Spline transformation which handles a +variety of text irregularities and is trained without human annotations. The +recognition network is an attentional sequence-to-sequence model that predicts +a character sequence directly from the rectified image. The whole model is +trained end to end, requiring only images and their groundtruth text. Through +extensive experiments, we verify the effectiveness of the rectification and +demonstrate the state-of-the-art recognition performance of ASTER. Furthermore, +we demonstrate that ASTER is a powerful component in end-to-end recognition +systems, for its ability to enhance the detector.' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/213168893-7c600e03-c1f0-464a-8236-40ae26fbff89.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 86.0 + IIIT5K: + WAICS: 93.4 + SVT: + WAICS: 93.6 + IC13: + WAICS: 94.5 + IC15: + WAICS: 76.1 + SVTP: + WAICS: 78.5 + CUTE: + WAICS: 79.5 +Bibtex: '@article{shi2018aster, + title={Aster: An attentional scene text recognizer with flexible rectification}, + author={Shi, Baoguang and Yang, Mingkun and Wang, Xinggang and Lyu, Pengyuan and Yao, Cong and Bai, Xiang}, + journal={IEEE transactions on pattern analysis and machine intelligence}, + volume={41}, + number={9}, + pages={2035--2048}, + year={2018}, + publisher={IEEE} +}' diff --git a/paper_zoo/textrecog/Aggregation Cross-Entropy for Sequence Recognition.yaml b/paper_zoo/textrecog/Aggregation Cross-Entropy for Sequence Recognition.yaml new file mode 100644 index 000000000..d392b46c2 --- /dev/null +++ b/paper_zoo/textrecog/Aggregation Cross-Entropy for Sequence Recognition.yaml @@ -0,0 +1,68 @@ +Title: 'Aggregation Cross-Entropy for Sequence Recognition' +Abbreviation: ACE +Tasks: + - TextRecog +Venue: CVPR +Year: 2019 +Lab/Company: + - South China University of Technology +URL: + Venue: 'http://openaccess.thecvf.com/content_CVPR_2019/html/Xie_Aggregation_Cross-Entropy_for_Sequence_Recognition_CVPR_2019_paper.html' + Arxiv: 'https://arxiv.org/abs/1904.08364' +Paper Reading URL: N/A +Code: 'https://github.com/summerlvsong/Aggregation-CrossEntropy' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'In this paper, we propose a novel method, aggregation cross-entropy +(ACE), for sequence recognition from a brand new perspective. The ACE loss +function exhibits competitive performance to CTC and the attention mechanism, +with much quicker implementation (as it involves only four fundamental +formulas), faster inference\back-propagation (approximately O(1) in parallel), +less storage requirement (no parameter and negligible runtime memory), and +convenient employment (by replacing CTC with ACE). Furthermore, the proposed +ACE loss function exhibits two noteworthy properties: (1) it can be directly +applied for 2D prediction by flattening the 2D prediction into 1D prediction +as the input and (2) it requires only characters and their numbers in the +sequence annotation for supervision, which allows it to advance beyond sequence +recognition, e.g., counting problem.' +MODELS: + Architecture: + - CTC + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/213173571-fdf09df3-9769-4d52-bf44-6f58c9b5453d.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 79.4 + IIIT5K: + WAICS: 82.3 + SVT: + WAICS: 82.6 + IC13: + WAICS: 89.7 + IC15: + WAICS: 68.9 + SVTP: + WAICS: 70.1 + CUTE: + WAICS: 82.6 +Bibtex: '@inproceedings{xie2019aggregation, + title={Aggregation cross-entropy for sequence recognition}, + author={Xie, Zecheng and Huang, Yaoxiong and Zhu, Yuanzhi and Jin, Lianwen and Liu, Yuliang and Xie, Lele}, + booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, + pages={6538--6547}, + year={2019} +}' diff --git a/paper_zoo/textrecog/An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition.yaml b/paper_zoo/textrecog/An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition.yaml new file mode 100644 index 000000000..26bb3e0c9 --- /dev/null +++ b/paper_zoo/textrecog/An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition.yaml @@ -0,0 +1,76 @@ +Title: 'An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition' +Abbreviation: TPAMI +Tasks: + - TextRecog +Venue: TPAMI +Year: 2016 +Lab/Company: + - School of Electronic Information and Communications Huazhong University of Science and Technology, Wuhan, China +URL: + Venue: 'https://ieeexplore.ieee.org/abstract/document/7801919/' + Arxiv: 'https://arxiv.org/abs/1507.05717' +Paper Reading URL: N/A +Code: 'https://github.com/bgshih/crnn' +Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/crnn' +PaperType: + - Algorithm +Abstract: 'Image-based sequence recognition has been a longstanding research +topic in computer vision. In this paper, we investigate the problem of scene +text recognition, which is among the most important and challenging tasks in +image-based sequence recognition. A novel neural network architecture, which +integrates feature extraction, sequence modeling and transcription into a unified +framework, is proposed. Compared with previous systems for scene text recognition, +the proposed architecture possesses four distinctive properties: (1) It is +end-to-end trainable, in contrast to most of the existing algorithms whose +components are separately trained and tuned. (2) It naturally handles sequences +in arbitrary lengths, involving no character segmentation or horizontal scale + normalization. (3) It is not confined to any predefined lexicon and achieves + remarkable performances in both lexicon-free and lexicon-based scene text + recognition tasks. (4) It generates an effective yet much smaller model, + which is more practical for real-world application scenarios. The experiments + on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR + datasets, demonstrate the superiority of the proposed algorithm over the prior + arts. Moreover, the proposed algorithm performs well in the task of image-based + music score recognition, which evidently verifies the generality of it.' +MODELS: + Architecture: + - CTC + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/213174579-e89dbd14-8ace-4f16-9cb6-4b882dbd4e27.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: 8.3M + Experiment: + Training DataSets: + - MJ + Test DataSets: + Avg.: 81.9 + IIIT5K: + WAICS: 78.2 + SVT: + WAICS: 80.8 + IC13: + WAICS: 86.7 + IC15: + WAICS: N/A + SVTP: + WAICS: N/A + CUTE: + WAICS: N/A +Bibtex: '@article{shi2016end, + title={An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition}, + author={Shi, Baoguang and Bai, Xiang and Yao, Cong}, + journal={IEEE transactions on pattern analysis and machine intelligence}, + volume={39}, + number={11}, + pages={2298--2304}, + year={2016}, + publisher={IEEE} +}' diff --git a/paper_zoo/textrecog/Attention after Attention: Reading Text in the Wild with Cross Attention.yaml b/paper_zoo/textrecog/Attention after Attention: Reading Text in the Wild with Cross Attention.yaml new file mode 100644 index 000000000..3741cc57b --- /dev/null +++ b/paper_zoo/textrecog/Attention after Attention: Reading Text in the Wild with Cross Attention.yaml @@ -0,0 +1,74 @@ +Title: 'Attention after Attention: Reading Text in the Wild with Cross Attention' +Abbreviation: Huang et al. +Tasks: + - TextRecog +Venue: ICDAR +Year: 2019 +Lab/Company: + - School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510641, China +URL: + Venue: 'https://ieeexplore.ieee.org/abstract/document/8977967/' + Arxiv: N/A +Paper Reading URL: N/A +Code: N/A +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Recent methods mostly regarded scene text recognition as a +sequence-to-sequence problem. These methods roughly transform the image into a +feature sequence and use the algorithms for sequence-to-sequence problem like +CTC or attention to decode the characters. However, text in images is distributed +in a two-dimensional (2D) space and roughly converting the features of text +into a feature sequence may introduce extra noise, especially if the text is +irregular. In this paper, we propose a novel framework named cross attention +network, which learns to attend to local features of a 2D feature map +corresponding to individual characters. The network contains two 1D attention +networks, which operates harmoniously in two directions. Thus, one of the +attention modules vertically attends to the features corresponding to the whole +text of 2D features and the other horizontal module selects the local features +to decode individual characters. Extensive experiments are performed on various +regular benchmarks, including SVT, ICDAR2003, ICDAR2013, and IIIT5K-Words, +which demonstrate that the proposed model either outperforms or is comparable +to all previous methods. Moreover, the model is evaluated on irregular benchmarks +including SVTPerspective, CUTE80 and ICDAR 2015. The performance on irregular +benchmarks shows the robustness of our model.' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/213172663-4c3c5ea1-84b8-40e5-8453-e389c7ee5595.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 86.4 + IIIT5K: + WAICS: 94.5 + SVT: + WAICS: 90.0 + IC13: + WAICS: 94.2 + IC15: + WAICS: 75.3 + SVTP: + WAICS: 79.8 + CUTE: + WAICS: 84.7 +Bibtex: '@inproceedings{fang2021read, + title={Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition}, + author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong}, + booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, + pages={7098--7107}, + year={2021} +}' diff --git a/paper_zoo/textrecog/Decoupled Attention Network for Text Recognition.yaml b/paper_zoo/textrecog/Decoupled Attention Network for Text Recognition.yaml new file mode 100644 index 000000000..6a44aaab0 --- /dev/null +++ b/paper_zoo/textrecog/Decoupled Attention Network for Text Recognition.yaml @@ -0,0 +1,75 @@ +Title: 'Decoupled Attention Network for Text Recognition' +Abbreviation: DAN +Tasks: + - TextRecog +Venue: AAAI +Year: 2019 +Lab/Company: + - School of Electronic and Information Engineering, South China University of Technology + - Lenovo Research +URL: + Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/6903' + Arxiv: 'https://arxiv.org/abs/1912.10205' +Paper Reading URL: N/A +Code: 'https://github.com/Wang-Tianwei/Decoupled-attentionnetwork' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Text recognition has attracted considerable research interests because +of its various applications. The cutting-edge text recognition methods are +based on attention mechanisms. However, most of attention methods usually +suffer from serious alignment problem due to its recurrency alignment operation, +where the alignment relies on historical decoding results. To remedy this +issue, we propose a decoupled attention network (DAN), which decouples the +alignment operation from using historical decoding results. DAN is an effective, +flexible and robust end-to-end text recognizer, which consists of three +components: 1) a feature encoder that extracts visual features from the input +image; 2) a convolutional alignment module that performs the alignment +operation based on visual features from the encoder; and 3) a decoupled text +decoder that makes final prediction by jointly using the feature map and +attention maps. Experimental results show that DAN achieves state-of-the-art +performance on multiple text recognition tasks, including offline handwritten +text recognition and regular/irregular scene text recognition. Codes will be +released.' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/213171943-35e9c57c-fdce-4866-91c4-a47dad9a7b3b.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 86.0 + IIIT5K: + WAICS: 94.3 + SVT: + WAICS: 89.2 + IC13: + WAICS: 93.9 + IC15: + WAICS: 74.5 + SVTP: + WAICS: 80.0 + CUTE: + WAICS: 84.4 +Bibtex: '@inproceedings{wang2020decoupled, + title={Decoupled attention network for text recognition}, + author={Wang, Tianwei and Zhu, Yuanzhi and Jin, Lianwen and Luo, Canjie and Chen, Xiaoxue and Wu, Yaqiang and Wang, Qianying and Cai, Mingxiang}, + booktitle={Proceedings of the AAAI conference on artificial intelligence}, + volume={34}, + number={07}, + pages={12216--12224}, + year={2020} +}' diff --git a/paper_zoo/textrecog/Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition.yaml b/paper_zoo/textrecog/Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition.yaml new file mode 100644 index 000000000..009c99b5a --- /dev/null +++ b/paper_zoo/textrecog/Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition.yaml @@ -0,0 +1,76 @@ +Title: 'Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition' +Abbreviation: ABINet +Tasks: + - TextRecog +Venue: CVPR +Year: 2021 +Lab/Company: + - University of Science and Technology of China +URL: + Venue: 'http://openaccess.thecvf.com/content/CVPR2021/html/Fang_Read_Like_Humans_Autonomous_Bidirectional_and_Iterative_Language_Modeling_for_CVPR_2021_paper.html' + Arxiv: 'https://arxiv.org/abs/2103.06495' +Paper Reading URL: 'https://mp.weixin.qq.com/s/blBkim58-sUBR0EOxDvtvA' +Code: 'https://github.com/FangShancheng/ABINet' +Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/dev-1.x/configs/textrecog/abinet' +PaperType: + - Algorithm +Abstract: 'Linguistic knowledge is of great benefit to scene text recognition. +However, how to effectively model linguistic rules in end-to-end deep +networks remains a research challenge. In this paper, we argue that the +limited capacity of language models comes from: 1) implicitly language +modeling; 2) unidirectional feature representation; and 3) language model +with noise input. Correspondingly, we propose an autonomous, bidirectional +and iterative ABINet for scene text recognition. Firstly, the autonomous +suggests to block gradient flow between vision and language models to enforce +explicitly language modeling. Secondly, a novel bidirectional cloze network +(BCN) as the language model is proposed based on bidirectional feature +representation. Thirdly, we propose an execution manner of iterative correction +for language model which can effectively alleviate the impact of noise input. +Additionally, based on the ensemble of iterative predictions, we propose a +self-training method which can learn from unlabeled images effectively. +Extensive experiments indicate that ABINet has superiority on lowquality images +and achieves state-of-the-art results on several mainstream benchmarks. +Besides, the ABINet trained with ensemble self-training shows promising +improvement in realizing human-level recognition. Code is available at +https://github.com/FangShancheng/ABINet.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + - Semi-Supervised + Language Modality: + - Explicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/213165915-d719091f-febe-4a57-b51f-a71b26afb543.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 93.6 + IIIT5K: + WAICS: 97.2 + SVT: + WAICS: 95.5 + IC13: + WAICS: 97.7 + IC15: + WAICS: 86.9 + SVTP: + WAICS: 89.9 + CUTE: + WAICS: 94.1 +Bibtex: '@inproceedings{fang2021read, + title={Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition}, + author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong}, + booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, + pages={7098--7107}, + year={2021} +}' diff --git a/paper_zoo/textrecog/Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition.yaml b/paper_zoo/textrecog/Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition.yaml new file mode 100644 index 000000000..325a0531f --- /dev/null +++ b/paper_zoo/textrecog/Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition.yaml @@ -0,0 +1,72 @@ +Title: 'Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition' +Abbreviation: SAR +Tasks: + - TextRecog +Venue: AAAI +Year: 2019 +Lab/Company: + - Australian Centre for Robotic Vision, The University of Adelaide, Australia + - School of Computer Science, Northwestern Polytechnical University, China +URL: + Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/4881' + Arxiv: 'https://arxiv.org/abs/1811.00751' +Paper Reading URL: N/A +Code: 'https://github.com/wangpengnorman/SAR-Strong-Baseline-for-Text-Recognition' +Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/dev-1.x/configs/textrecog/sar' +PaperType: + - Algorithm +Abstract: 'Recognizing irregular text in natural scene images is challenging due +to the large variance in text appearance, such as curvature, orientation and +distortion. Most existing approaches rely heavily on sophisticated model designs +and/or extra fine-grained annotations, which, to some extent, increase the +difficulty in algorithm implementation and data collection. In this work, we +propose an easy-to-implement strong baseline for irregular scene text +recognition, using offthe-shelf neural network components and only word-level +annotations. It is composed of a 31-layer ResNet, an LSTMbased encoder-decoder +framework and a 2-dimensional attention module. Despite its simplicity, the +proposed method is robust. It achieves state-of-the-art performance on irregular +text recognition benchmarks and comparable results on regular text datasets. +Code is available at: https://tinyurl.com/ShowAttendRead' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/213175678-efcb4452-b80b-4e0b-8d36-57fbfca2668b.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + - Real + Test DataSets: + Avg.: 89.2 + IIIT5K: + WAICS: 95.0 + SVT: + WAICS: 91.2 + IC13: + WAICS: 94.0 + IC15: + WAICS: 78.8 + SVTP: + WAICS: 86.4 + CUTE: + WAICS: 89.6 +Bibtex: '@inproceedings{li2019show, + title={Show, attend and read: A simple and strong baseline for irregular text recognition}, + author={Li, Hui and Wang, Peng and Shen, Chunhua and Zhang, Guyu}, + booktitle={Proceedings of the AAAI conference on artificial intelligence}, + volume={33}, + number={01}, + pages={8610--8617}, + year={2019} +}' diff --git a/paper_zoo/textrecog/Text Recognition in the Wild: A Survey.yaml b/paper_zoo/textrecog/Text Recognition in the Wild: A Survey.yaml new file mode 100644 index 000000000..63f824512 --- /dev/null +++ b/paper_zoo/textrecog/Text Recognition in the Wild: A Survey.yaml @@ -0,0 +1,41 @@ +Title: 'Text Recognition in the Wild: A Survey' +Abbreviation: Chen et al. +Tasks: + - TextRecog +Venue: Others +Year: 2021 +Lab/Company: + - College of Electronic and Information Engineering, South China University of Technology, China +URL: + Venue: 'https://dl.acm.org/doi/abs/10.1145/3440756' + Arxiv: 'https://arxiv.org/abs/2005.03492' +Paper Reading URL: N/A +Code: 'https://github.com/HCIILAB/Scene-Text-Recognition' +Supported In MMOCR: N/S +PaperType: + - Survey +Abstract: 'The history of text can be traced back over thousands of years. Rich +and precise semantic information carried by text is important in a wide range of +vision-based application scenarios. Therefore, text recognition in natural +scenes has been an active research field in computer vision and pattern +recognition. In recent years, with the rise and development of deep learning, +numerous methods have shown promising in terms of innovation, practicality, +and efficiency. This paper aims to (1) summarize the fundamental problems and +the state-of-the-art associated with scene text recognition; (2) introduce new +insights and ideas; (3) provide a comprehensive review of publicly available +resources; (4) point out directions for future work. In summary, this +literature review attempts to present the entire picture of the field of +scene text recognition. It provides a comprehensive reference for people +entering this field, and could be helpful to inspire future research. +Related resources are available at our Github +repository: https://github.com/HCIILAB/Scene-Text-Recognition.' +Bibtex: '@article{chen2021text, + title={Text recognition in the wild: A survey}, + author={Chen, Xiaoxue and Jin, Lianwen and Zhu, Yuanzhi and Luo, Canjie and Wang, Tianwei}, + journal={ACM Computing Surveys (CSUR)}, + volume={54}, + number={2}, + pages={1--35}, + year={2021}, + publisher={ACM New York, NY, USA} +}' diff --git a/paper_zoo/textrecog/Towards Accurate Scene Text Recognition with Semantic Reasoning Networks.yaml b/paper_zoo/textrecog/Towards Accurate Scene Text Recognition with Semantic Reasoning Networks.yaml new file mode 100644 index 000000000..efd6da4bf --- /dev/null +++ b/paper_zoo/textrecog/Towards Accurate Scene Text Recognition with Semantic Reasoning Networks.yaml @@ -0,0 +1,74 @@ +Title: 'Towards Accurate Scene Text Recognition with Semantic Reasoning Networks' +Abbreviation: SRN +Tasks: + - TextRecog +Venue: CVPR +Year: 2020 +Lab/Company: + - School of Artificial Intelligence, University of Chinese Academy of Sciences + - Department of Computer Vision Technology(VIS), Baidu Inc. +URL: + Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Yu_Towards_Accurate_Scene_Text_Recognition_With_Semantic_Reasoning_Networks_CVPR_2020_paper.html' + Arxiv: 'https://arxiv.org/abs/2003.12294' +Paper Reading URL: N/A +Code: 'https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/configs/rec/rec_r50_fpn_srn.yml' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Scene text image contains two levels of contents: visual texture and +semantic information. Although the previous scene text recognition methods have +made great progress over the past few years, the research on mining semantic +information to assist text recognition attracts less attention, only RNN-like +structures are explored to implicitly model semantic information. However, we +observe that RNN based methods have some obvious shortcomings, such as +time-dependent decoding manner and one-way serial transmission of semantic +context, which greatly limit the help of semantic information and the +computation efficiency. To mitigate these limitations, we propose a novel +end-to-end trainable framework named semantic reasoning network (SRN) for +accurate scene text recognition, where a global semantic reasoning module (GSRM) +is introduced to capture global semantic context through multi-way parallel +transmission. The state-of-the-art results on 7 public benchmarks, including +regular text, irregular text and non-Latin long text, verify the effectiveness +and robustness of the proposed method. In addition, the speed of SRN has +significant advantages over the RNN based methods, demonstrating its value +in practical use.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + Language Modality: + - Explicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/213168893-7c600e03-c1f0-464a-8236-40ae26fbff89.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 90.9 + IIIT5K: + WAICS: 94.8 + SVT: + WAICS: 91.5 + IC13: + WAICS: 95.5 + IC15: + WAICS: 82.7 + SVTP: + WAICS: 85.1 + CUTE: + WAICS: 87.8 +Bibtex: '@inproceedings{yu2020towards, + title={Towards accurate scene text recognition with semantic reasoning networks}, + author={Yu, Deli and Li, Xuan and Zhang, Chengquan and Liu, Tao and Han, Junyu and Liu, Jingtuo and Ding, Errui}, + booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, + pages={12113--12122}, + year={2020} +}' diff --git a/paper_zoo/textrecog/What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis.yaml b/paper_zoo/textrecog/What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis.yaml new file mode 100644 index 000000000..9318bc678 --- /dev/null +++ b/paper_zoo/textrecog/What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis.yaml @@ -0,0 +1,71 @@ +Title: 'What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis' +Abbreviation: Baek et al +Tasks: + - TextRecog +Venue: ICCV +Year: 2019 +Lab/Company: + - Clova AI Research, NAVER/LINE Corp. +URL: + Venue: 'http://openaccess.thecvf.com/content_ICCV_2019/html/Baek_What_Is_Wrong_With_Scene_Text_Recognition_Model_Comparisons_Dataset_ICCV_2019_paper.html' + Arxiv: 'https://arxiv.org/abs/1904.01906' +Paper Reading URL: N/A +Code: 'https://github.com/clovaai/deep-text-recognition-benchmark' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Many new proposals for scene text recognition (STR) models have been +introduced in recent years. While each claim to have pushed the boundary of the +technology, a holistic and fair comparison has been largely missing in the field +due to the inconsistent choices of training and evaluation datasets. This paper +addresses this difficulty with three major contributions. First, we examine the +inconsistencies of training and evaluation datasets, and the performance gap +results from inconsistencies. Second, we introduce a unified four-stage STR +framework that most existing STR models fit into. Using this framework allows +for the extensive evaluation of previously proposed STR modules and the +discovery of previously unexplored module combinations. Third, we analyze +the module-wise contributions to performance in terms of accuracy, speed, +and memory demand, under one consistent set of training and evaluation datasets. +Such analyses clean up the hindrance on the current comparisons to understand +the performance gain of the existing modules. Our code is publicly available.' +MODELS: + Architecture: + - Attention + - CTC + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/213169752-33203ec5-5602-44f0-8524-4ce77091dda8.png' + FPS: + DEVICE: N/A + ITEM: 35.3 + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: 49.6M + Experiment: + Training DataSets: + - MJ + - ST + Test DataSets: + Avg.: 82.1 + IIIT5K: + WAICS: 87.9 + SVT: + WAICS: 87.5 + IC13: + WAICS: 92.3 + IC15: + WAICS: 71.8 + SVTP: + WAICS: 79.2 + CUTE: + WAICS: 74.0 +Bibtex: '@inproceedings{baek2019wrong, + title={What is wrong with scene text recognition model comparisons? dataset and model analysis}, + author={Baek, Jeonghun and Kim, Geewook and Lee, Junyeop and Park, Sungrae and Han, Dongyoon and Yun, Sangdoo and Oh, Seong Joon and Lee, Hwalsuk}, + booktitle={Proceedings of the IEEE/CVF international conference on computer vision}, + pages={4715--4723}, + year={2019} +}'