Skip to content

The official training/validation/test dataset repository for the SOTA? task as SimpleText Task4@CLEF2024

License

Notifications You must be signed in to change notification settings

jd-coderepos/sota

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What this repository contains?

This repository hosts the official SimpleText Task4 @ CLEF'24, i.e. the SOTA? Tracking the State-of-the-Art in Scholarly Publications task corpus. The full corpus is released in the dataset repository organized as follows:

[dataset]/	
     |--- [train]/
	     |--- [article-id-folder]/
		 |    |--- [article-id].tex
		 |    |--- annotations.json	
	     |___ ...
     |--- [validation]/
	     |--- [article-id-folder]/
		 |    |--- [article-id].tex
		 |    |--- annotations.json	
	     |___ ...		 
     |--- [test1-few-shot-papers]/	
	     |--- [article-counter-folder]/
		 |    |--- [article-id].tei.xml
	     |___ ...
     |--- [test1-few-shot-annotations]/	   	   # hidden during the competition
	     |--- [article-counter-folder]/
		 |    |--- annotations.txt		
		 |    |--- code-link.txt	   # optional	
	     |___ ...		 
     |--- [test2-zero-shot-papers]/				
	     |--- [article-counter-folder]/
		 |    |--- [article-id].tei.xml
	     |___ ...
     |--- [test2-zero-shot-annotations]/	   # hidden during the competition
	     |--- [article-counter-folder]/
		 |    |--- annotations.txt		
		 |    |--- code-link.txt	   # optional	
	     |___ ...			 

The dataset dump originates from paperswithcode.com.

Each folder in the respective dump corresponds to a scholarly article originally downloaded in LaTeX format from arXiv.

There are 12,288+100 total papers in the train+validation sets,respectively. Furthermore, note each annotations.json file either contains (task, dataset, metric, score) annotations for all papers reporting model scores. Otherwise annotations.json contains the value "unanswerable." This is for those papers that do not report any model scores and therefore leaderboards cannot be populated from them. Models trained on our dataset should, in a first step, distinguish papers with leaderboards and those without, then for the former set of papers, extract their leaderboard tuples as annotations. The train dataset has 7,936 papers with leaderboard annotations and the remaining 4,352 papers without leaderboard annotations and therefore annotated as "unanswerable." The validation dataset has 51 papers with leaderboard and 49 papers without leaderboard annotations.

Below are provided some detailed statistics relevant to the leaderboard annotations in our dataset offering a glimpse into the corpus.

Dataset statistics

Parameter train+validation (counts)
Unique Tasks 1,372
Unique Datasets 4,795
Unique Metrics 2,782
Unique (Task, Dataset, Metric) triples 11,977
Avg. (Task, Dataset, Metric) triples occurrences per paper 6.93

Ten most common Tasks, Datasets, and Metrics in the Train+Validation set:

# Most Common Tasks Most Common Dataset Most Common Metric
Task Frequency Dataset Frequency Metric Frequency
1 image classification 2273 imagenet 1603 accuracy 4383
2 atari games 1448 coco test-dev 792 score 1515
3 node classification 1113 human3.6m 624 f1 1384
4 object detection 1001 cifar-10 585 psnr 1144
5 video retrieval 997 coco minival 310 map 1068
6 link prediction 941 youtube-vos 2018 295 miou 862
7 semantic segmentation 901 cifar-100 252 ssim 799
8 semi-supervised video object segmentation 890 msr-vtt-1ka 247 top 1 accuracy 789
9 3d human pose estimation 889 fb15k-237 244 1:1 accuracy 787
10 question answering 866 msu super-resolution for video compression 225 number of params 759

Ten most common (Task, Dataset, Metric) triples in Train+Validation Set:

(Task, Dataset, Metric) Count
(image classification, imagenet, top 1 accuracy) 524
(image classification, imagenet, number of params) 313
(image classification, imagenet, gflops) 256
(3d human pose estimation, human3.6m, average mpj...) 197
(image classification, cifar-10, percentage correct) 128
(action classification, kinetics-400, acc@1) 108
(object detection, coco test-dev, box map) 106
(image classification, cifar-100, percentage correct) 105
(semantic segmentation, ade20k, validation miou) 92
(neural architecture search, imagenet, top-1 erro...) 83

Since each paper is accompanied with an annotations file, this section concludes with statistics for each of the four types in the tuple, what proportion of those can actually be found in the accompanying full-text.

  • for Tasks, 60.24% of the annotation labels can be found in the accompanying paper full-text.
  • for Datasets, 45.48% of the annotation labels can be found in the accompanying paper full-text.
  • for Metrics, 42.69% of the annotation labels can be found in the accompanying paper full-text.
  • for Scores, 58.86% of the annotations can be found in the accompanying paper full-text.

License

Shield: CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0

Releases

No releases published

Packages

No packages published

Languages