I implement a classic word2vec model: skip-gram model with negative sampling as the optimization method by hand in pure python3 and used TED-Talks-Dataset as the train set. To test the performence of the final embedding vectors, I used the TOEFL Synonym Questions Dataset to test its accuracy.
Building a skip-gram model with negative sampling to achieve that:
Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose.
Blog:
01.Word2Vec Tutorial - The Skip-Gram Model
02.Word2Vec Tutorial - Negative Sampling
03.Deep Learning实战之word2vec
04.Word2Vec and FastText Word Embedding with Gensim
05.A Gentle Introduction to the Bag-of-Words Model
06.Python implementation of Word2Vec
Paper:
01.Distributed Representations of Words and Phrases and their Compositionality
02.Efficient Estimation of Word Representations in Vector Space
03.Word2vec Parameter Learning Explained
04.Linguistic Regularities in Continuous Space Word Representations
05.Evaluation methods for unsupervised word embeddings
06.Word and Phrase Translation with word2vec
07.word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method
08.How to Generate a Good Word Embedding?
Video:
01.Negative Sampling-Coursera Deeplearning
Code:
01.word2vec_commented_in_C
02.word2vec code in python
01.TED-Talks-Dataset
02.TOEFL Synonym Questions
Other datasets:
WordSim353、SNLI、[NER]、[SQuAD]、[Coref]、[SRL]、[SST-5]、[Parsing]