Dataset #9

xiyan524 · 2019-02-17T01:32:48Z

Thanks for your excellent works.

Would you mind provide XSum dataset directly just like CNN/Dialy Mail that we are familiar with? I believe it may save time and be more convenient for experiments.

I'd be appreciate if you could give any help. Thanks~

shashiongithub · 2019-02-17T10:50:17Z

Could you drop me an email and tell me what problems are you having with the download?

xiyan524 · 2019-02-17T12:07:24Z

Thanks a lot~My email is 644845521@qq.com

Actually I do not met any problem yet, while I am pressed to do some experiments so that direct dataset may be more helpful. I will try by myself to get dataset when I at my leisure.

xiyan524 · 2019-02-17T12:13:06Z

And I have a question about some parameters in your model. As I have understand that some parameters like t_d(the topic distribution of the document D) is obtained from pre-trained LDA. I am curious about if this vector will be trained during the training process? In other words, vector from pre-trained LDA is just a initial value for training or a fixed value which will not be changed? thx~

kedimomo · 2019-02-18T03:15:58Z

@xiyan524 你开始用中文数据训练了吗 , 卡住了, 不知道从哪里入手好. 按照理解 , 作者说过可以用fasttext,bert 效果会更加好,那个应该是说词向量. 作者输入词向量那段代码,在哪里 ,
有没有找到〒▽〒, 已经找到fasttext和bert怎么生成中文的词向量了

xiyan524 · 2019-02-18T07:46:51Z

@chenbaicheng 抱歉，我没有用文章所提出的模型，只是比较感兴趣XSum这个数据集。

kedimomo · 2019-02-18T07:51:20Z

@xiyan524 谢谢

shashiongithub · 2019-02-18T21:22:36Z

"vector from pre-trained LDA is just a initial value for training or a fixed value which will not be changed? ..." Yes, pre-trained LDA vectors are fixed during training. It varies for different documents and for different words in every document.

xiyan524 · 2019-02-19T01:05:17Z

@shashiongithub I got it. thx

artidoro · 2019-11-21T16:38:18Z

Hello @shashiongithub I am also having trouble downloading the dataset. After rerunning the script > 75 times I still have 11 articles that cannot be downloaded. I would like to make a fair comparison with your results that uses exactly the same train/test split.

To facilitate further research experimentation and development with this dataset could you make it available directly?

shashiongithub · 2019-11-25T16:47:57Z

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets.
Let me know if you have any questions.

isabelcachola · 2020-03-09T21:35:44Z

I downloaded the tar file above and it is in a different format than is expected for the script scripts/xsum-preprocessing-convs2s.py. Can you please share instructions for how to convert the data in the tar file to what this script expects? Thanks.

isabelcachola · 2020-03-10T20:12:00Z

For anyone who is trying to format the data in the link above, this is what I did to get it in the right format.

First, I used the following quick script to reformat the data

from os import listdir
from os.path import isfile, join
import re
from tqdm import tqdm 

bbc_dir = '/path/to/bbc-summary-data'
out_dir = '/path/to/XSum/XSum-Dataset/xsum-extracts-from-downloads'

bbc_files = [f for f in listdir(bbc_dir) if isfile(join(bbc_dir, f))]

for fname in tqdm(bbc_files):
    with open(join(out_dir, f'{fname.split(".")[0]}.data'), 'w') as f_out:
        text_in = open(join(bbc_dir, fname)).read()
        text_out = re.sub(r'\[SN\]', '\[XSUM\]', text_in)
        f_out.write(text_out)

From here, you can follow the instructions in the dataset README starting at the section, Postprocessing: Sentence Segmentation, Tokenization, Lemmatization and Final preparation.

As a side note, I am using a different version of the Stanford CoreNLP Toolkit (stanford-corenlp-full-2018-10-05), so I had to change this for loop in scripts/process-corenlp-xml-data.py to the following:

      for doc_sent, doc_sentlemma in zip(doc_sentences, doc_sentlemmas):
        clean_doc_sent = re.sub(r'\\ ', '', doc_sent)
        if "-LSB- XSUM -RSB- URL -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "URL"
          allcovered += 1
        elif "-LSB- XSUM -RSB- FIRST-SENTENCE -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "INTRODUCTION"
          allcovered += 1
        elif "-LSB- XSUM -RSB- RESTBODY -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "RestBody"
          allcovered += 1
        else:
          if modeFlag == "RestBody":
            restbodydata.append(doc_sent)
            restbodylemmadata.append(doc_sentlemma)
          if modeFlag == "INTRODUCTION":
            summarydata.append(doc_sent)

matt9704 · 2020-07-08T22:42:24Z

For anyone who is trying to format the data in the link above, this is what I did to get it in the right format.

First, I used the following quick script to reformat the data

from os import listdir
from os.path import isfile, join
import re
from tqdm import tqdm 

bbc_dir = '/path/to/bbc-summary-data'
out_dir = '/path/to/XSum/XSum-Dataset/xsum-extracts-from-downloads'

bbc_files = [f for f in listdir(bbc_dir) if isfile(join(bbc_dir, f))]

for fname in tqdm(bbc_files):
    with open(join(out_dir, f'{fname.split(".")[0]}.data'), 'w') as f_out:
        text_in = open(join(bbc_dir, fname)).read()
        text_out = re.sub(r'\[SN\]', '\[XSUM\]', text_in)
        f_out.write(text_out)

Hi! Thanks for providing the code. I'm wondering which decoder you used for the text files? When I use the same code as you provided, I have the following error:
`line 14, in
text_in = open(join(bbc_dir, fname)).read()

 UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 12896: illegal multibyte sequence`

fajri91 · 2020-07-13T11:29:58Z

Hi, I cant access the link,
Can you please fix it?

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets.
Let me know if you have any questions.

shashiongithub · 2020-07-14T09:41:38Z

Shay suggested to try this: http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Ricardokevins · 2021-11-01T15:53:18Z

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets. Let me know if you have any questions.

hello ， thanks a lot for sharing data.
After download data and unzip it. the folder contains *.summary. And .summary contains URL,TITLE,FIRST-SENTENCE,RESTBODY which is different from expected format. In README what should i do next? use StanfordNLP toolkit?
It seems that xsum-preprocessing-convs2s Requires 2 kind of file (.document and *.summary which is different from provided data)

sriram487 · 2022-01-11T16:49:18Z

Hi, I just have one question. What is the total number of instances? I got 237002 after preprocessing the files downloaded from the above given bollin.inf.ed.uk website. Is that same in your case because it is found that the number of instances is around 226000 in huggingfaces website

sriram487 · 2022-01-16T07:47:36Z

For anyone who is trying to format the data in the link above, this is what I did to get it in the right format.

First, I used the following quick script to reformat the data

from os import listdir
from os.path import isfile, join
import re
from tqdm import tqdm 

bbc_dir = '/path/to/bbc-summary-data'
out_dir = '/path/to/XSum/XSum-Dataset/xsum-extracts-from-downloads'

bbc_files = [f for f in listdir(bbc_dir) if isfile(join(bbc_dir, f))]

for fname in tqdm(bbc_files):
    with open(join(out_dir, f'{fname.split(".")[0]}.data'), 'w') as f_out:
        text_in = open(join(bbc_dir, fname)).read()
        text_out = re.sub(r'\[SN\]', '\[XSUM\]', text_in)
        f_out.write(text_out)

From here, you can follow the instructions in the dataset README starting at the section, Postprocessing: Sentence Segmentation, Tokenization, Lemmatization and Final preparation.

As a side note, I am using a different version of the Stanford CoreNLP Toolkit (stanford-corenlp-full-2018-10-05), so I had to change this for loop in scripts/process-corenlp-xml-data.py to the following:

      for doc_sent, doc_sentlemma in zip(doc_sentences, doc_sentlemmas):
        clean_doc_sent = re.sub(r'\\ ', '', doc_sent)
        if "-LSB- XSUM -RSB- URL -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "URL"
          allcovered += 1
        elif "-LSB- XSUM -RSB- FIRST-SENTENCE -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "INTRODUCTION"
          allcovered += 1
        elif "-LSB- XSUM -RSB- RESTBODY -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "RestBody"
          allcovered += 1
        else:
          if modeFlag == "RestBody":
            restbodydata.append(doc_sent)
            restbodylemmadata.append(doc_sentlemma)
          if modeFlag == "INTRODUCTION":
            summarydata.append(doc_sent)

Hello, I used the process-corenlp-xml-data.py to process the bbcid.data.xml files but i got a error which says some information is missing. /stanfordOutput/bbcid.data.xml

It will be great if u help me in this issue, Thanks

BaohaoLiao · 2023-03-10T19:14:16Z

If anyone still has problems about:

download and split XSum
evaluate fine-tuned BART on XSum
You might want to check my reproduction repository https://github.com/BaohaoLiao/NLP-reproduction

isabelcachola mentioned this issue Mar 16, 2020

How to use dataset #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset #9

Dataset #9

xiyan524 commented Feb 17, 2019

shashiongithub commented Feb 17, 2019

xiyan524 commented Feb 17, 2019

xiyan524 commented Feb 17, 2019

kedimomo commented Feb 18, 2019

xiyan524 commented Feb 18, 2019

kedimomo commented Feb 18, 2019

shashiongithub commented Feb 18, 2019

xiyan524 commented Feb 19, 2019

artidoro commented Nov 21, 2019

shashiongithub commented Nov 25, 2019

isabelcachola commented Mar 9, 2020

isabelcachola commented Mar 10, 2020

matt9704 commented Jul 8, 2020 •

edited

Loading

fajri91 commented Jul 13, 2020

shashiongithub commented Jul 14, 2020

Ricardokevins commented Nov 1, 2021

sriram487 commented Jan 11, 2022

sriram487 commented Jan 16, 2022

BaohaoLiao commented Mar 10, 2023

Dataset #9

Dataset #9

Comments

xiyan524 commented Feb 17, 2019

shashiongithub commented Feb 17, 2019

xiyan524 commented Feb 17, 2019

xiyan524 commented Feb 17, 2019

kedimomo commented Feb 18, 2019

xiyan524 commented Feb 18, 2019

kedimomo commented Feb 18, 2019

shashiongithub commented Feb 18, 2019

xiyan524 commented Feb 19, 2019

artidoro commented Nov 21, 2019

shashiongithub commented Nov 25, 2019

isabelcachola commented Mar 9, 2020

isabelcachola commented Mar 10, 2020

matt9704 commented Jul 8, 2020 • edited Loading

fajri91 commented Jul 13, 2020

shashiongithub commented Jul 14, 2020

Ricardokevins commented Nov 1, 2021

sriram487 commented Jan 11, 2022

sriram487 commented Jan 16, 2022

BaohaoLiao commented Mar 10, 2023

matt9704 commented Jul 8, 2020 •

edited

Loading