-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Availability #1
Comments
Dear Guillem, Thanks for your interest in the corpus. We are working to improve the current models and create new ones. Once we finish with these tasks we are planning to upload the corpus to Zenodo. I will let this issue open so that I can reply it again once the corpus is available. This way you will be notified. Kind regards, |
Awesome work! I am also interested in the dataset and/or the original collected sentences. I think this would be a huge help to Mozilla Voice project (https://commonvoice.mozilla.org/es), but they need CC0 licensed sentences for their voice collection tool. |
I wanted to try the scripts too (at least the examples provided in the README.md). Is this possible? |
Dear @alfem, thanks for your interest.
We will try to do our best with the license. However, I think that 570GB of clean text is too much for Common Voice. Maybe you could try with Wikipedia 🙂 Regards, |
Dear @boriel
The scripts from README.md can be run in your PC:
from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf]) Just to clarify: fine-tunning scripts will be released very soon. Regards, |
Common Voice is already using Wikipedia, but they are allowed to grab only a couple of sentences from every page. And they need many more! |
Dear @alfem
We will take this into consideration when licensing the corpus but we cannot assure anything. Thanks for the information. |
Fine-tunning scripts have been released. However, we are aware of the limited availability of some of the datasets. |
Thanks for sharing the models! Could you share a bit more information about the corpus? In particular, I'm looking for models trained in Spanish used in America (e.g. Mexico, ...). Thanks! |
Hi!
Thanks for the work, I love it. I was wandering if the final clean corpus is available somewhere.
The text was updated successfully, but these errors were encountered: