GitHub - jd-coderepos/sota: The official training/validation/test dataset repository for the SOTA? task as SimpleText Task4@CLEF2024

What this repository contains?

This repository hosts the official SimpleText Task4 @ CLEF'24, i.e. the SOTA? Tracking the State-of-the-Art in Scholarly Publications task corpus. The full corpus is released in the dataset repository organized as follows:

[dataset]/	
     |--- [train]/
	     |--- [article-id-folder]/
		 |    |--- [article-id].tex
		 |    |--- annotations.json	
	     |___ ...
     |--- [validation]/
	     |--- [article-id-folder]/
		 |    |--- [article-id].tex
		 |    |--- annotations.json	
	     |___ ...		 
     |--- [test1-few-shot-papers]/	
	     |--- [article-counter-folder]/
		 |    |--- [article-id].tei.xml
	     |___ ...
     |--- [test1-few-shot-annotations]/	   	   # hidden during the competition
	     |--- [article-counter-folder]/
		 |    |--- annotations.txt		
		 |    |--- code-link.txt	   # optional	
	     |___ ...		 
     |--- [test2-zero-shot-papers]/				
	     |--- [article-counter-folder]/
		 |    |--- [article-id].tei.xml
	     |___ ...
     |--- [test2-zero-shot-annotations]/	   # hidden during the competition
	     |--- [article-counter-folder]/
		 |    |--- annotations.txt		
		 |    |--- code-link.txt	   # optional	
	     |___ ...

The dataset dump originates from paperswithcode.com.

Each folder in the respective dump corresponds to a scholarly article originally downloaded in LaTeX format from arXiv.

There are 12,288+100 total papers in the train+validation sets,respectively. Furthermore, note each annotations.json file either contains (task, dataset, metric, score) annotations for all papers reporting model scores. Otherwise annotations.json contains the value "unanswerable." This is for those papers that do not report any model scores and therefore leaderboards cannot be populated from them. Models trained on our dataset should, in a first step, distinguish papers with leaderboards and those without, then for the former set of papers, extract their leaderboard tuples as annotations. The train dataset has 7,936 papers with leaderboard annotations and the remaining 4,352 papers without leaderboard annotations and therefore annotated as "unanswerable." The validation dataset has 51 papers with leaderboard and 49 papers without leaderboard annotations.

Below are provided some detailed statistics relevant to the leaderboard annotations in our dataset offering a glimpse into the corpus.

Dataset statistics

Parameter	train+validation (counts)
Unique Tasks	1,372
Unique Datasets	4,795
Unique Metrics	2,782
Unique (Task, Dataset, Metric) triples	11,977
Avg. (Task, Dataset, Metric) triples occurrences per paper	6.93

Ten most common Tasks, Datasets, and Metrics in the Train+Validation set:

#	Most Common Tasks		Most Common Dataset		Most Common Metric
	Task	Frequency	Dataset	Frequency	Metric	Frequency
1	image classification	2273	imagenet	1603	accuracy	4383
2	atari games	1448	coco test-dev	792	score	1515
3	node classification	1113	human3.6m	624	f1	1384
4	object detection	1001	cifar-10	585	psnr	1144
5	video retrieval	997	coco minival	310	map	1068
6	link prediction	941	youtube-vos 2018	295	miou	862
7	semantic segmentation	901	cifar-100	252	ssim	799
8	semi-supervised video object segmentation	890	msr-vtt-1ka	247	top 1 accuracy	789
9	3d human pose estimation	889	fb15k-237	244	1:1 accuracy	787
10	question answering	866	msu super-resolution for video compression	225	number of params	759

Ten most common (Task, Dataset, Metric) triples in Train+Validation Set:

(Task, Dataset, Metric)	Count
(image classification, imagenet, top 1 accuracy)	524
(image classification, imagenet, number of params)	313
(image classification, imagenet, gflops)	256
(3d human pose estimation, human3.6m, average mpj...)	197
(image classification, cifar-10, percentage correct)	128
(action classification, kinetics-400, acc@1)	108
(object detection, coco test-dev, box map)	106
(image classification, cifar-100, percentage correct)	105
(semantic segmentation, ade20k, validation miou)	92
(neural architecture search, imagenet, top-1 erro...)	83

Since each paper is accompanied with an annotations file, this section concludes with statistics for each of the four types in the tuple, what proportion of those can actually be found in the accompanying full-text.

for Tasks, 60.24% of the annotation labels can be found in the accompanying paper full-text.
for Datasets, 45.48% of the annotation labels can be found in the accompanying paper full-text.
for Metrics, 42.69% of the annotation labels can be found in the accompanying paper full-text.
for Scores, 58.86% of the annotations can be found in the accompanying paper full-text.

License

Shield:

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
codalab		codalab
dataset		dataset
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What this repository contains?

Dataset statistics

License

About

Releases

Packages

Languages

License

jd-coderepos/sota

Folders and files

Latest commit

History

Repository files navigation

What this repository contains?

Dataset statistics

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages