Russian fastText embeddings trained on Araneum web corpus #27

akutuzov · 2018-05-07T17:58:07Z

Name: fasttext-ru_araneum-300
Link: http://rusvectores.org/static/models/rusvectores4/fasttext/araneum_none_fasttextcbow_300_5_2018.tgz
Description: fastText vectors trained on Araneum Russicum Maximum corpus (about 10 billion words). The model contains 196K words and 403K 3-4-5-grams.
License: CC-BY (http://rusvectores.org/en/about/)
Related papers: https://arxiv.org/abs/1801.06407, https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
Preprocessing: The corpus was lemmatized with Mystem.
Parameters: vector size 300, window size 5
Code example:

$ tar xzf araneum_none_fasttextcbow_300_5_2018.tgz
$ python3
model = gensim.models.KeyedVectors.load('araneum_none_fasttextcbow_300_5_2018.model')
for n in model.most_similar(positive=['уточка']):
    print(n[0], round(n[1], 3))
чуточка 0.754
досочка 0.726
пинеточка 0.724
деточка 0.704
улиточка 0.693
нямочка 0.693
белочка 0.69
квочка 0.69
выточка 0.689
козочка 0.683

The text was updated successfully, but these errors were encountered:

akutuzov · 2018-05-07T18:05:22Z

One can lemmatize Russian texts before using this model, with the help of pymystem:

def tag(word):
    from pymystem3 import Mystem
    m = Mystem()
    processed = m.analyze(word)[0]
    lemma = processed["analysis"][0]["lex"].lower().strip()
    return lemma

tag('стульев')
стул

andrei-q · 2019-02-11T12:16:43Z

I got the following error:

>>> model = gensim.models.fasttext.FastText.load('araneum_none_fasttextcbow_300_5_2018.model')
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/fasttext.py", line 936, in load
    model = super(FastText, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/base_any2vec.py", line 1247, in load
    if not hasattr(model.vocabulary, 'ns_exponent'):
AttributeError: 'FastTextKeyedVectors' object has no attribute 'vocabulary'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/fasttext.py", line 945, in load
    return load_old_fasttext(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/fasttext.py", line 53, in load_old_fasttext
    old_model = FastText.load(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/word2vec.py", line 1618, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/old_saveload.py", line 87, in load
    obj = unpickle(fname)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/old_saveload.py", line 380, in unpickle
    return _pickle.loads(file_bytes, encoding='latin1')
AttributeError: Can't get attribute 'FastTextKeyedVectors' on <module 'gensim.models.deprecated.keyedvectors' from '/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/keyedvectors.py'>

akutuzov · 2019-02-11T13:08:45Z

@andrei-q Gensim fastText code has been refactored since the time this issue was created.
In the recent versions of Gensim, you should use gensim.models.KeyedVectors.load() to load this model.
I've changed the code snippet above accordingly.

andrei-q · 2019-02-11T13:51:12Z

Thanks. It works

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Russian fastText embeddings trained on Araneum web corpus #27

Russian fastText embeddings trained on Araneum web corpus #27

akutuzov commented May 7, 2018 •

edited

Loading

akutuzov commented May 7, 2018

andrei-q commented Feb 11, 2019 •

edited

Loading

akutuzov commented Feb 11, 2019 •

edited

Loading

andrei-q commented Feb 11, 2019

Russian fastText embeddings trained on Araneum web corpus #27

Russian fastText embeddings trained on Araneum web corpus #27

Comments

akutuzov commented May 7, 2018 • edited Loading

akutuzov commented May 7, 2018

andrei-q commented Feb 11, 2019 • edited Loading

akutuzov commented Feb 11, 2019 • edited Loading

andrei-q commented Feb 11, 2019

akutuzov commented May 7, 2018 •

edited

Loading

andrei-q commented Feb 11, 2019 •

edited

Loading

akutuzov commented Feb 11, 2019 •

edited

Loading