This repo contains the code for implementation of word embeddings from scratch in python using two methods:
- Frequency-based Embedding - Co-occurrence Matrix method to obtain word embeddings of words occuring in a given corpus.
- Prediction-based Embedding - Word2vec method used for training words representations. Here it is implemented using CBOW method.
- numpy
- collections
- re
- sklearn
- gensim
The models were trained on the following data LINK
python3 part1.py
- To run model1 which uses co-occurrence matrix and svd
python3 part2.py
- To run model2 which uses Word2vec CBOW model.
Link for embeddings - https://drive.google.com/drive/folders/1cK0aUM3likmKcisz2nK9yQlyPqBIioHi?usp=sharing
for review in splitreviews:
for i in range(0,len(review)-1):
matrix[counts[review[i]]][counts[review[i+1]]] += 1
matrix[counts[review[i+1]]][counts[review[i]]] += 1
Where matrix is a
Example of co-occurance matrix shown below.
- I enjoy flying.
- I like NLP.
- I like deep learning.
The co-occurance matrix for these sentences is
from scipy.linalg import svd
U, D, VT = svd(matrix,full_matrices=False)
word_embeddings = {}
index = 0
for word in vocabulary:
word_embeddings[word] = U[index][:K]
index = index + 1
word_embeddings is a dictionary where the keys are the words are values are thier embeddings
To find the top 10 most similar words for a given word use the function find_word_embeddings
def find_word_embeddings(searchword):
topscore = 0
topword = " "
top = []
for i in range(10):
top.append([0," "])
for word in vocabulary:
a = word_embeddings[searchword]
b = word_embeddings[word]
cos_sim = dot(a, b)/(norm(a)*norm(b))
index = 0
for item in top:
if cos_sim > item[0] and word != searchword:
top.insert(index,[cos_sim,word])
top.pop(10)
break
index += 1
return top
# Example
top = find_word_embeddings("camera")
print(top)
TSNE plots for Model 1(Co-occurance Matrix) for the words 'camera', 'product', 'good', 'strong' and 'look'.
TSNE plots for Model 2(CBOW Word2vec) for the words 'camera', 'product', 'good', 'strong' and 'look'.