-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset #9
Comments
Could you drop me an email and tell me what problems are you having with the download? |
Thanks a lot~My email is 644845521@qq.com Actually I do not met any problem yet, while I am pressed to do some experiments so that direct dataset may be more helpful. I will try by myself to get dataset when I at my leisure. |
And I have a question about some parameters in your model. As I have understand that some parameters like t_d(the topic distribution of the document D) is obtained from pre-trained LDA. I am curious about if this vector will be trained during the training process? In other words, vector from pre-trained LDA is just a initial value for training or a fixed value which will not be changed? thx~ |
@xiyan524 你开始用中文数据训练了吗 , 卡住了, 不知道从哪里入手好. 按照理解 , 作者说过可以用fasttext,bert 效果会更加好,那个应该是说词向量. 作者输入词向量那段代码,在哪里 , |
@chenbaicheng 抱歉,我没有用文章所提出的模型,只是比较感兴趣XSum这个数据集。 |
@xiyan524 谢谢 |
"vector from pre-trained LDA is just a initial value for training or a fixed value which will not be changed? ..." Yes, pre-trained LDA vectors are fixed during training. It varies for different documents and for different words in every document. |
@shashiongithub I got it. thx |
Hello @shashiongithub I am also having trouble downloading the dataset. After rerunning the script > 75 times I still have 11 articles that cannot be downloaded. I would like to make a fair comparison with your results that uses exactly the same train/test split. To facilitate further research experimentation and development with this dataset could you make it available directly? |
Here is the dataset: http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz Please use train, development and test ids from github to split into subsets. |
I downloaded the tar file above and it is in a different format than is expected for the script |
For anyone who is trying to format the data in the link above, this is what I did to get it in the right format. First, I used the following quick script to reformat the data
From here, you can follow the instructions in the dataset README starting at the section, As a side note, I am using a different version of the Stanford CoreNLP Toolkit (
|
Hi! Thanks for providing the code. I'm wondering which decoder you used for the text files? When I use the same code as you provided, I have the following error:
|
Hi, I cant access the link,
|
Shay suggested to try this: http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz |
hello , thanks a lot for sharing data. |
Hi, I just have one question. What is the total number of instances? I got 237002 after preprocessing the files downloaded from the above given bollin.inf.ed.uk website. Is that same in your case because it is found that the number of instances is around 226000 in huggingfaces website |
Hello, I used the process-corenlp-xml-data.py to process the bbcid.data.xml files but i got a error which says some information is missing. /stanfordOutput/bbcid.data.xml It will be great if u help me in this issue, Thanks |
If anyone still has problems about:
|
Thanks for your excellent works.
Would you mind provide XSum dataset directly just like CNN/Dialy Mail that we are familiar with? I believe it may save time and be more convenient for experiments.
I'd be appreciate if you could give any help. Thanks~
The text was updated successfully, but these errors were encountered: