Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Availability #1

Open
GuillemGSubies opened this issue Jul 20, 2021 · 9 comments
Open

Dataset Availability #1

GuillemGSubies opened this issue Jul 20, 2021 · 9 comments

Comments

@GuillemGSubies
Copy link

Hi!

Thanks for the work, I love it. I was wandering if the final clean corpus is available somewhere.

@asier-gutierrez
Copy link
Contributor

Dear Guillem,

Thanks for your interest in the corpus. We are working to improve the current models and create new ones. Once we finish with these tasks we are planning to upload the corpus to Zenodo.

I will let this issue open so that I can reply it again once the corpus is available. This way you will be notified.

Kind regards,
Asier

@alfem
Copy link

alfem commented Jul 30, 2021

Awesome work! I am also interested in the dataset and/or the original collected sentences.

I think this would be a huge help to Mozilla Voice project (https://commonvoice.mozilla.org/es), but they need CC0 licensed sentences for their voice collection tool.

@boriel
Copy link

boriel commented Aug 2, 2021

I wanted to try the scripts too (at least the examples provided in the README.md). Is this possible?

@asier-gutierrez
Copy link
Contributor

Dear @alfem, thanks for your interest.

I think this would be a huge help to Mozilla Voice project (https://commonvoice.mozilla.org/es), but they need CC0 licensed sentences for their voice collection tool.

We will try to do our best with the license. However, I think that 570GB of clean text is too much for Common Voice. Maybe you could try with Wikipedia 🙂

Regards,
Asier

@asier-gutierrez
Copy link
Contributor

asier-gutierrez commented Aug 3, 2021

Dear @boriel

I wanted to try the scripts too (at least the examples provided in the README.md). Is this possible?

The scripts from README.md can be run in your PC:

  1. Install Anaconda.
  2. Create an environment: conda create -n <your_environment_name>.
  3. Activate your environment: conda activate <your_environment_name>.
  4. Install transformers library in your environment: conda install -c conda-forge transformers.
  5. Start an interactive python console to check if it works: (type) python.
  6. Copy the following code line by line:
from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Just to clarify: fine-tunning scripts will be released very soon.

Regards,
Asier

@alfem
Copy link

alfem commented Aug 5, 2021

We will try to do our best with the license. However, I think that 570GB of clean text is too much for Common Voice. Maybe you could try with Wikipedia slightly_smiling_face

Common Voice is already using Wikipedia, but they are allowed to grab only a couple of sentences from every page. And they need many more!

@asier-gutierrez
Copy link
Contributor

Dear @alfem

Common Voice is already using Wikipedia, but they are allowed to grab only a couple of sentences from every page. And they need many more!

We will take this into consideration when licensing the corpus but we cannot assure anything. Thanks for the information.

@asier-gutierrez
Copy link
Contributor

asier-gutierrez commented Aug 5, 2021

Just to clarify: fine-tunning scripts will be released very soon.

Fine-tunning scripts have been released. However, we are aware of the limited availability of some of the datasets.

@PeterKrejzl
Copy link

Thanks for sharing the models! Could you share a bit more information about the corpus? In particular, I'm looking for models trained in Spanish used in America (e.g. Mexico, ...). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants