To convert text datasets into clusters based on topic. Currently in progress for benchmarking NVIDIA Rapids cuML/pyGDF for performance.
- Python3 (>=3.6 for AllenNLP)
- AllenNLP
- TensorFlow
- NumPy
- SKLearn (for clustering)
- Torch (for AllenNLP)
- SciPy (for SKLearn stop words)
To install the necessary dependencies:
apt-get install python3-pip
pip3 install allennlp
pip3 install tensorflow-gpu
pip3 install numpy
pip3 install sklearn
pip3 install torch
# pip3 install scipy
To generate sentence embeddings, make sure that the sentences.txt
file is formatted as such:
<sentence/transcription>
<sentence/transcription>
# and so on...
Run python3 main.py
with the following options:
--mode embed
to embed the sentence file--mode sif
to enhance sentence embeddings with SIF--mode cluster
to cluster embeddings--mode project
to reduce dimensionality for visualization--mode metadata
to write metadata file--mode tensorboard
to create TensorBoard files
Adjust runtime flags in main.py
A couple auxiliary files:
- Run
sh clean.sh
to convert transcriptions to lower case and remove stop words
To run inside a Docker container:
docker build -t elmo-embeddings .
docker exec -it elmo-embeddings /bin/bash
python3.6 main.py
An output folder will be created in the current directory containing:
embeddings.npy
: a NumPy array of sentence embeddings (NumPy arrays) in binary formatembeddings_sif.npy
: a NumPy array of sentence embeddings after SIFembeddings_pc.npy
: a NumPy array of sentence embeddings after PCAembeddings_ts.npy
: a NumPy array of sentence embeddings after t-SNEkm_labels.json
: a list of cluster labels generated by KMeansmetadata.tsv
: metadata of sentence labels for visualization
Other nested output folders:
tensorboard
: for TensorBoard output logskmeans
: for clustered sentences with kmeanstrimmed
: for another copy of embeddings/sentences with specified clusters removedhierarchy
: for clustered sentences with hierarchical kmeans
- Allow ELMo to use GPU for embedding (UPDATE: GPU speedup by as much as 4x)
- Utilize NVIDIA Rapids cuML to GPU-accelerate clustering (by ~10x)
- Preprocess transcriptions
- Embed each sentence with ELMo
- Enhance embeddings with SIF
- Cluster using SKLearn KMeans (optional: hierarchically)
- Find optimal k using elbow method and silhouette scores (optional)
- Reduce dimensionality for visualization (PCA, t-SNE)
- Run in TensorBoard
- Conclusions