Description

This repo contains the code for implementation of word embeddings from scratch in python using two methods:

Frequency-based Embedding - Co-occurrence Matrix method to obtain word embeddings of words occuring in a given corpus.
Prediction-based Embedding - Word2vec method used for training words representations. Here it is implemented using CBOW method.

Requirements

numpy
collections
re
sklearn
gensim

Instructions

The models were trained on the following data LINK

python3 part1.py - To run model1 which uses co-occurrence matrix and svd

python3 part2.py - To run model2 which uses Word2vec CBOW model.

Link for embeddings - https://drive.google.com/drive/folders/1cK0aUM3likmKcisz2nK9yQlyPqBIioHi?usp=sharing

Code Explanation

Model 1

Step 1 - Construction of co-occurance matrix

for review in splitreviews:
    for i in range(0,len(review)-1):
        matrix[counts[review[i]]][counts[review[i+1]]] += 1
        matrix[counts[review[i+1]]][counts[review[i]]] += 1

Where matrix is a $vocabsize \times vocabsize$ matrix were all entries are intialised to 0. Split reviews contains sentences tokenised.

Example of co-occurance matrix shown below.

I enjoy flying.
I like NLP.
I like deep learning.

The co-occurance matrix for these sentences is $X$ where

Step 2 - Singular Value Decomposition of the co-occurance matrix.

from scipy.linalg import svd
U, D, VT = svd(matrix,full_matrices=False)

Step 3 - Obtaining the word embeddings from the SVD matrix.

word_embeddings = {}
index = 0
for word in vocabulary:
    word_embeddings[word] = U[index][:K]
    index = index + 1

word_embeddings is a dictionary where the keys are the words are values are thier embeddings

To find the top 10 most similar words for a given word use the function find_word_embeddings

def find_word_embeddings(searchword):
    topscore = 0
    topword = " "
    top = []
    for i in range(10):
        top.append([0," "])

    for word in vocabulary:
        a = word_embeddings[searchword]
        b = word_embeddings[word]
        cos_sim = dot(a, b)/(norm(a)*norm(b))
        index = 0
        for item in top:
            if cos_sim > item[0] and word != searchword:
                top.insert(index,[cos_sim,word])
                top.pop(10)
                break
            index += 1
    return top
# Example
top = find_word_embeddings("camera")
print(top)

Results

TSNE plots for Model 1(Co-occurance Matrix) for the words 'camera', 'product', 'good', 'strong' and 'look'.

TSNE plots for Model 2(CBOW Word2vec) for the words 'camera', 'product', 'good', 'strong' and 'look'.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Images		Images
README.md		README.md
part1.py		part1.py
part2.py		part2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Requirements

Instructions

Code Explanation

Model 1

Step 1 - Construction of co-occurance matrix

Step 2 - Singular Value Decomposition of the co-occurance matrix.

Step 3 - Obtaining the word embeddings from the SVD matrix.

Results

About

Releases

Packages

Languages

Likhith-Asapu/Word-Embedding-Algorithms-in-Python

Folders and files

Latest commit

History

Repository files navigation

Description

Requirements

Instructions

Code Explanation

Model 1

Step 1 - Construction of co-occurance matrix

Step 2 - Singular Value Decomposition of the co-occurance matrix.

Step 3 - Obtaining the word embeddings from the SVD matrix.

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages