ELMo Embeddings for Clustering

To convert text datasets into clusters based on topic. Currently in progress for benchmarking NVIDIA Rapids cuML/pyGDF for performance.

Requirements:

Python3 (>=3.6 for AllenNLP)
AllenNLP
TensorFlow
NumPy
SKLearn (for clustering)
Torch (for AllenNLP)
SciPy (for SKLearn stop words)

Install

To install the necessary dependencies:

apt-get install python3-pip
pip3 install allennlp
pip3 install tensorflow-gpu
pip3 install numpy
pip3 install sklearn
pip3 install torch
# pip3 install scipy

Usage

To generate sentence embeddings, make sure that the sentences.txt file is formatted as such:

<sentence/transcription>
<sentence/transcription>
# and so on...

Run python3 main.py with the following options:

--mode embed to embed the sentence file
--mode sif to enhance sentence embeddings with SIF
--mode cluster to cluster embeddings
--mode project to reduce dimensionality for visualization
--mode metadata to write metadata file
--mode tensorboard to create TensorBoard files

Adjust runtime flags in main.py

A couple auxiliary files:

Run sh clean.sh to convert transcriptions to lower case and remove stop words

Run

To run inside a Docker container:

docker build -t elmo-embeddings .
docker exec -it elmo-embeddings /bin/bash
python3.6 main.py

Outputs

An output folder will be created in the current directory containing:

embeddings.npy: a NumPy array of sentence embeddings (NumPy arrays) in binary format
embeddings_sif.npy: a NumPy array of sentence embeddings after SIF
embeddings_pc.npy: a NumPy array of sentence embeddings after PCA
embeddings_ts.npy: a NumPy array of sentence embeddings after t-SNE
km_labels.json: a list of cluster labels generated by KMeans
metadata.tsv: metadata of sentence labels for visualization

Other nested output folders:

tensorboard: for TensorBoard output logs
kmeans: for clustered sentences with kmeans
trimmed: for another copy of embeddings/sentences with specified clusters removed
hierarchy: for clustered sentences with hierarchical kmeans

GPU-acceleration:

Allow ELMo to use GPU for embedding (UPDATE: GPU speedup by as much as 4x)
Utilize NVIDIA Rapids cuML to GPU-accelerate clustering (by ~10x)

TO-DO

Preprocess transcriptions
Embed each sentence with ELMo
Enhance embeddings with SIF
Cluster using SKLearn KMeans (optional: hierarchically)
Find optimal k using elbow method and silhouette scores (optional)
Reduce dimensionality for visualization (PCA, t-SNE)
Run in TensorBoard
Conclusions

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
clusters		clusters
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
clean.py		clean.py
clean.sh		clean.sh
cluster.py		cluster.py
download.sh		download.sh
embed.py		embed.py
kmeans.ipynb		kmeans.ipynb
main.py		main.py
meta.py		meta.py
project.py		project.py
sif.py		sif.py
tensorboard.py		tensorboard.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ELMo Embeddings for Clustering

Requirements:

Install

Usage

Run

Outputs

GPU-acceleration:

TO-DO

About

Releases

Packages

Languages

vliu15/elmo-kmeans

Folders and files

Latest commit

History

Repository files navigation

ELMo Embeddings for Clustering

Requirements:

Install

Usage

Run

Outputs

GPU-acceleration:

TO-DO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages