Dataset based on Russian Web Tables (RWT), which is a corpus of Russian language tables from Wikipedia.
Only relational tables were chosen from RWT with headers matching selected 170 DBpedia semantic types.
Dataset contains 1.441.349
columns, and has fixed train / test split.
Split | Columns | Tables | Avg. columns per table |
---|---|---|---|
Test | 115 448 | 55 080 | 2.096 |
Train | 1 325 901 | 633 426 | 2.093 |
Column size | Occurances |
---|---|
1 | 257890 |
2 | 172414 |
3 | 124635 |
4 | 54886 |
5 | 18532 |
6 | 3404 |
7 | 733 |
8 | 254 |
9 | 234 |
18 | 221 |
Column size | Occurances |
---|---|
19 | 6 |
40 | 6 |
16 | 5 |
38 | 5 |
29 | 4 |
20 | 4 |
21 | 4 |
37 | 2 |
39 | 2 |
17 | 2 |
Label | Occurances |
---|---|
год | 230016 |
название | 170812 |
место | 103986 |
дата | 97228 |
команда | 75032 |
результат | 52730 |
примечание | 48635 |
актер | 38959 |
страна | 36754 |
турнир | 33175 |
Label | Occurances |
---|---|
континент | 92 |
роман | 89 |
закон | 89 |
борец | 88 |
колледж | 87 |
музей | 86 |
фирма | 85 |
дорога | 83 |
префектура | 83 |
цитата | 76 |
Column size | Occurances |
---|---|
1 | 22491 |
2 | 14923 |
3 | 10798 |
4 | 4801 |
5 | 1614 |
6 | 299 |
7 | 69 |
18 | 21 |
8 | 19 |
9 | 18 |
Column size | Occurances |
---|---|
13 | 3 |
36 | 2 |
20 | 1 |
16 | 1 |
21 | 1 |
14 | 1 |
39 | 1 |
37 | 1 |
38 | 1 |
11 | 1 |
Label | Occurances |
---|---|
год | 19854 |
название | 14748 |
место | 9004 |
дата | 8408 |
команда | 6653 |
результат | 4653 |
примечание | 4203 |
актер | 3435 |
страна | 3217 |
турнир | 2911 |
Label | Occurances |
---|---|
цитата | 7 |
дорога | 6 |
статья | 6 |
фирма | 6 |
сообщество | 5 |
колледж | 5 |
борец | 5 |
музей | 4 |
банк | 4 |
камера | 4 |
Make sure your PC satisfies these requirements:
- Download and decompress ru-wiki-tables-datset into
./dataset/
directory. - Run
make
command from./dataset/collecting/
directory to compile collecting files. - Run
./dataset/collecting/collect_columns_from_dataset
to collect column headers from dataset. Output will be in./dataset/collecting/columns_headers/
. - Run all cells in
./dataset/research/research.ipynb
. - Run all cells in
./dataset/labelling/labelling.ipynb
. - Run
./dataset/collecting/collect_columns_data
to collect column data from dataset. Output will be in./dataset/collecting/columns_data/
. - Run all cells in
./dataset/cta_dataset/create_cta_dataset.ipynb
. Output train/test splits will be in./dataset/cta_dataset/train[test]
directories.