Skip to content

Latest commit

 

History

History
51 lines (28 loc) · 2.06 KB

README.md

File metadata and controls

51 lines (28 loc) · 2.06 KB

We're in the process of releasing BERT models as well. Get the first one here: https://github.com/mollerhoj/danish_bert

Scandinavian ULMFiT

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch.

This repository contains the weights for the embedding layer of a UMLFiT language model that can be used as the first step in fine-tuning any Natural Language Processing task.

The weights were trained on 90% of all text in the corresponding language wikipedia as per 3. July 2018. The remaining 10% was used for validation.

Supported Languages:

  • Danish

Trained on 78,373,122 tokens, and validated on 7,837,310 tokens. We achieve a perplexity of 30.9. Download files: Link

  • Norwegian

Trained on 80,284,231 tokens, and validated on 8,920,387 tokens. We achieve a perplexity of 26.31. Download files: Link

  • Finnish

Trained on 68,775,370 tokens, and validated on 7,641,571 tokens. We achieve a perplexity of 27.66

Training even higher performance models is possible, but require more (costly) training time. If you need a model with higher performance, feel free to contact us. Download files: Link

Our servers crashed when training the Swedish model, but if you're in need of it, contact us and we can train it for you.

Paper

See Universal Language Model Fine-tuning for Text Classification, Jeremy Howard, Sebastian Ruder, https://arxiv.org/abs/1801.06146

File descriptions

  • enc.h5 Contains the weights in 'Hierarchical Data Format'

  • enc.pth Contains the weights in 'Pytorch model format'

  • itos.pkl (Integers to Strings) contains the vocabulary mapping from ids (0 - 30000) to strings

Sponsor

This work was sponsored by Danish chatbot company BotXO http://www.botxo.co/

Thanks

Thanks to Tobias Lindberg from Damvad Analytics for converting the vectors to pth-format.