Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paper List-5] Add 10 textrecog papers #1684

Open
wants to merge 2 commits into
base: dev-1.x
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Title: 'ASTER: An Attentional Scene Text Recognizer with Flexible Rectification'
Abbreviation: ASTER
Tasks:
- TextRecog
Venue: TPAMI
Year: 2018
Lab/Company:
- Huazhong University of Science and Technology, Wuhan, China
URL:
Venue: 'https://ieeexplore.ieee.org/abstract/document/8395027/'
Arxiv: 'https://openreview.net/forum?id=S7EEqJHedpH'
Paper Reading URL: N/A
Code: 'https://github.com/ayumiymk/aster.pytorch'
Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/dev-1.x/configs/textrecog/aster'
PaperType:
- Algorithm
Abstract: 'A challenging aspect of scene text recognition is to handle text with
distortions or irregular layout. In particular, perspective text and curved
text are common in natural scenes and are difficult to recognize. In this work,
we introduce ASTER, an end-to-end neural network model that comprises a
rectification network and a recognition network. The rectification network
adaptively transforms an input image into a new one, rectifying the text in it.
It is powered by a flexible Thin-Plate Spline transformation which handles a
variety of text irregularities and is trained without human annotations. The
recognition network is an attentional sequence-to-sequence model that predicts
a character sequence directly from the rectified image. The whole model is
trained end to end, requiring only images and their groundtruth text. Through
extensive experiments, we verify the effectiveness of the rectification and
demonstrate the state-of-the-art recognition performance of ASTER. Furthermore,
we demonstrate that ASTER is a powerful component in end-to-end recognition
systems, for its ability to enhance the detector.'
MODELS:
Architecture:
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/213168893-7c600e03-c1f0-464a-8236-40ae26fbff89.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 86.0
IIIT5K:
WAICS: 93.4
SVT:
WAICS: 93.6
IC13:
WAICS: 94.5
IC15:
WAICS: 76.1
SVTP:
WAICS: 78.5
CUTE:
WAICS: 79.5
Bibtex: '@article{shi2018aster,
title={Aster: An attentional scene text recognizer with flexible rectification},
author={Shi, Baoguang and Yang, Mingkun and Wang, Xinggang and Lyu, Pengyuan and Yao, Cong and Bai, Xiang},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={41},
number={9},
pages={2035--2048},
year={2018},
publisher={IEEE}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Title: 'Aggregation Cross-Entropy for Sequence Recognition'
Abbreviation: ACE
Tasks:
- TextRecog
Venue: CVPR
Year: 2019
Lab/Company:
- South China University of Technology
URL:
Venue: 'http://openaccess.thecvf.com/content_CVPR_2019/html/Xie_Aggregation_Cross-Entropy_for_Sequence_Recognition_CVPR_2019_paper.html'
Arxiv: 'https://arxiv.org/abs/1904.08364'
Paper Reading URL: N/A
Code: 'https://github.com/summerlvsong/Aggregation-CrossEntropy'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'In this paper, we propose a novel method, aggregation cross-entropy
(ACE), for sequence recognition from a brand new perspective. The ACE loss
function exhibits competitive performance to CTC and the attention mechanism,
with much quicker implementation (as it involves only four fundamental
formulas), faster inference\back-propagation (approximately O(1) in parallel),
less storage requirement (no parameter and negligible runtime memory), and
convenient employment (by replacing CTC with ACE). Furthermore, the proposed
ACE loss function exhibits two noteworthy properties: (1) it can be directly
applied for 2D prediction by flattening the 2D prediction into 1D prediction
as the input and (2) it requires only characters and their numbers in the
sequence annotation for supervision, which allows it to advance beyond sequence
recognition, e.g., counting problem.'
MODELS:
Architecture:
- CTC
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/213173571-fdf09df3-9769-4d52-bf44-6f58c9b5453d.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 79.4
IIIT5K:
WAICS: 82.3
SVT:
WAICS: 82.6
IC13:
WAICS: 89.7
IC15:
WAICS: 68.9
SVTP:
WAICS: 70.1
CUTE:
WAICS: 82.6
Bibtex: '@inproceedings{xie2019aggregation,
title={Aggregation cross-entropy for sequence recognition},
author={Xie, Zecheng and Huang, Yaoxiong and Zhu, Yuanzhi and Jin, Lianwen and Liu, Yuliang and Xie, Lele},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
pages={6538--6547},
year={2019}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
Title: 'An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition'
Abbreviation: TPAMI
Tasks:
- TextRecog
Venue: TPAMI
Year: 2016
Lab/Company:
- School of Electronic Information and Communications Huazhong University of Science and Technology, Wuhan, China
URL:
Venue: 'https://ieeexplore.ieee.org/abstract/document/7801919/'
Arxiv: 'https://arxiv.org/abs/1507.05717'
Paper Reading URL: N/A
Code: 'https://github.com/bgshih/crnn'
Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/crnn'
PaperType:
- Algorithm
Abstract: 'Image-based sequence recognition has been a longstanding research
topic in computer vision. In this paper, we investigate the problem of scene
text recognition, which is among the most important and challenging tasks in
image-based sequence recognition. A novel neural network architecture, which
integrates feature extraction, sequence modeling and transcription into a unified
framework, is proposed. Compared with previous systems for scene text recognition,
the proposed architecture possesses four distinctive properties: (1) It is
end-to-end trainable, in contrast to most of the existing algorithms whose
components are separately trained and tuned. (2) It naturally handles sequences
in arbitrary lengths, involving no character segmentation or horizontal scale
normalization. (3) It is not confined to any predefined lexicon and achieves
remarkable performances in both lexicon-free and lexicon-based scene text
recognition tasks. (4) It generates an effective yet much smaller model,
which is more practical for real-world application scenarios. The experiments
on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR
datasets, demonstrate the superiority of the proposed algorithm over the prior
arts. Moreover, the proposed algorithm performs well in the task of image-based
music score recognition, which evidently verifies the generality of it.'
MODELS:
Architecture:
- CTC
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/213174579-e89dbd14-8ace-4f16-9cb6-4b882dbd4e27.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: 8.3M
Experiment:
Training DataSets:
- MJ
Test DataSets:
Avg.: 81.9
IIIT5K:
WAICS: 78.2
SVT:
WAICS: 80.8
IC13:
WAICS: 86.7
IC15:
WAICS: N/A
SVTP:
WAICS: N/A
CUTE:
WAICS: N/A
Bibtex: '@article{shi2016end,
title={An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition},
author={Shi, Baoguang and Bai, Xiang and Yao, Cong},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={39},
number={11},
pages={2298--2304},
year={2016},
publisher={IEEE}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Title: 'Attention after Attention: Reading Text in the Wild with Cross Attention'
Abbreviation: Huang et al.
Tasks:
- TextRecog
Venue: ICDAR
Year: 2019
Lab/Company:
- School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510641, China
URL:
Venue: 'https://ieeexplore.ieee.org/abstract/document/8977967/'
Arxiv: N/A
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Recent methods mostly regarded scene text recognition as a
sequence-to-sequence problem. These methods roughly transform the image into a
feature sequence and use the algorithms for sequence-to-sequence problem like
CTC or attention to decode the characters. However, text in images is distributed
in a two-dimensional (2D) space and roughly converting the features of text
into a feature sequence may introduce extra noise, especially if the text is
irregular. In this paper, we propose a novel framework named cross attention
network, which learns to attend to local features of a 2D feature map
corresponding to individual characters. The network contains two 1D attention
networks, which operates harmoniously in two directions. Thus, one of the
attention modules vertically attends to the features corresponding to the whole
text of 2D features and the other horizontal module selects the local features
to decode individual characters. Extensive experiments are performed on various
regular benchmarks, including SVT, ICDAR2003, ICDAR2013, and IIIT5K-Words,
which demonstrate that the proposed model either outperforms or is comparable
to all previous methods. Moreover, the model is evaluated on irregular benchmarks
including SVTPerspective, CUTE80 and ICDAR 2015. The performance on irregular
benchmarks shows the robustness of our model.'
MODELS:
Architecture:
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/213172663-4c3c5ea1-84b8-40e5-8453-e389c7ee5595.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 86.4
IIIT5K:
WAICS: 94.5
SVT:
WAICS: 90.0
IC13:
WAICS: 94.2
IC15:
WAICS: 75.3
SVTP:
WAICS: 79.8
CUTE:
WAICS: 84.7
Bibtex: '@inproceedings{fang2021read,
title={Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition},
author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={7098--7107},
year={2021}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Title: 'Decoupled Attention Network for Text Recognition'
Abbreviation: DAN
Tasks:
- TextRecog
Venue: AAAI
Year: 2019
Lab/Company:
- School of Electronic and Information Engineering, South China University of Technology
- Lenovo Research
URL:
Venue: 'https://ojs.aaai.org/index.php/AAAI/article/view/6903'
Arxiv: 'https://arxiv.org/abs/1912.10205'
Paper Reading URL: N/A
Code: 'https://github.com/Wang-Tianwei/Decoupled-attentionnetwork'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Text recognition has attracted considerable research interests because
of its various applications. The cutting-edge text recognition methods are
based on attention mechanisms. However, most of attention methods usually
suffer from serious alignment problem due to its recurrency alignment operation,
where the alignment relies on historical decoding results. To remedy this
issue, we propose a decoupled attention network (DAN), which decouples the
alignment operation from using historical decoding results. DAN is an effective,
flexible and robust end-to-end text recognizer, which consists of three
components: 1) a feature encoder that extracts visual features from the input
image; 2) a convolutional alignment module that performs the alignment
operation based on visual features from the encoder; and 3) a decoupled text
decoder that makes final prediction by jointly using the feature map and
attention maps. Experimental results show that DAN achieves state-of-the-art
performance on multiple text recognition tasks, including offline handwritten
text recognition and regular/irregular scene text recognition. Codes will be
released.'
MODELS:
Architecture:
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/213171943-35e9c57c-fdce-4866-91c4-a47dad9a7b3b.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 86.0
IIIT5K:
WAICS: 94.3
SVT:
WAICS: 89.2
IC13:
WAICS: 93.9
IC15:
WAICS: 74.5
SVTP:
WAICS: 80.0
CUTE:
WAICS: 84.4
Bibtex: '@inproceedings{wang2020decoupled,
title={Decoupled attention network for text recognition},
author={Wang, Tianwei and Zhu, Yuanzhi and Jin, Lianwen and Luo, Canjie and Chen, Xiaoxue and Wu, Yaqiang and Wang, Qianying and Cai, Mingxiang},
booktitle={Proceedings of the AAAI conference on artificial intelligence},
volume={34},
number={07},
pages={12216--12224},
year={2020}
}'
Loading