Dataset Availability #1

GuillemGSubies · 2021-07-20T11:04:05Z

Hi!

Thanks for the work, I love it. I was wandering if the final clean corpus is available somewhere.

asier-gutierrez · 2021-07-21T16:59:23Z

Dear Guillem,

Thanks for your interest in the corpus. We are working to improve the current models and create new ones. Once we finish with these tasks we are planning to upload the corpus to Zenodo.

I will let this issue open so that I can reply it again once the corpus is available. This way you will be notified.

Kind regards,
Asier

alfem · 2021-07-30T07:28:05Z

Awesome work! I am also interested in the dataset and/or the original collected sentences.

I think this would be a huge help to Mozilla Voice project (https://commonvoice.mozilla.org/es), but they need CC0 licensed sentences for their voice collection tool.

boriel · 2021-08-02T20:27:46Z

I wanted to try the scripts too (at least the examples provided in the README.md). Is this possible?

asier-gutierrez · 2021-08-03T10:11:52Z

Dear @alfem, thanks for your interest.

I think this would be a huge help to Mozilla Voice project (https://commonvoice.mozilla.org/es), but they need CC0 licensed sentences for their voice collection tool.

We will try to do our best with the license. However, I think that 570GB of clean text is too much for Common Voice. Maybe you could try with Wikipedia 🙂

Regards,
Asier

asier-gutierrez · 2021-08-03T10:18:38Z

Dear @boriel

I wanted to try the scripts too (at least the examples provided in the README.md). Is this possible?

The scripts from README.md can be run in your PC:

Install Anaconda.
Create an environment: conda create -n <your_environment_name>.
Activate your environment: conda activate <your_environment_name>.
Install transformers library in your environment: conda install -c conda-forge transformers.
Start an interactive python console to check if it works: (type) python.
Copy the following code line by line:

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Just to clarify: fine-tunning scripts will be released very soon.

Regards,
Asier

alfem · 2021-08-05T15:33:14Z

We will try to do our best with the license. However, I think that 570GB of clean text is too much for Common Voice. Maybe you could try with Wikipedia slightly_smiling_face

Common Voice is already using Wikipedia, but they are allowed to grab only a couple of sentences from every page. And they need many more!

asier-gutierrez · 2021-08-05T15:36:09Z

Dear @alfem

Common Voice is already using Wikipedia, but they are allowed to grab only a couple of sentences from every page. And they need many more!

We will take this into consideration when licensing the corpus but we cannot assure anything. Thanks for the information.

asier-gutierrez · 2021-08-05T15:39:44Z

Just to clarify: fine-tunning scripts will be released very soon.

Fine-tunning scripts have been released. However, we are aware of the limited availability of some of the datasets.

PeterKrejzl · 2022-03-03T06:05:57Z

Thanks for sharing the models! Could you share a bit more information about the corpus? In particular, I'm looking for models trained in Spanish used in America (e.g. Mexico, ...). Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Availability #1

Dataset Availability #1

GuillemGSubies commented Jul 20, 2021

asier-gutierrez commented Jul 21, 2021

alfem commented Jul 30, 2021

boriel commented Aug 2, 2021 •

edited

Loading

asier-gutierrez commented Aug 3, 2021

asier-gutierrez commented Aug 3, 2021 •

edited

Loading

alfem commented Aug 5, 2021

asier-gutierrez commented Aug 5, 2021

asier-gutierrez commented Aug 5, 2021 •

edited

Loading

PeterKrejzl commented Mar 3, 2022

Dataset Availability #1

Dataset Availability #1

Comments

GuillemGSubies commented Jul 20, 2021

asier-gutierrez commented Jul 21, 2021

alfem commented Jul 30, 2021

boriel commented Aug 2, 2021 • edited Loading

asier-gutierrez commented Aug 3, 2021

asier-gutierrez commented Aug 3, 2021 • edited Loading

alfem commented Aug 5, 2021

asier-gutierrez commented Aug 5, 2021

asier-gutierrez commented Aug 5, 2021 • edited Loading

PeterKrejzl commented Mar 3, 2022

boriel commented Aug 2, 2021 •

edited

Loading

asier-gutierrez commented Aug 3, 2021 •

edited

Loading

asier-gutierrez commented Aug 5, 2021 •

edited

Loading