Imagine that you are building a software for transcribing speech to text. The speech transcription part works perfectly, but cannot transcribe punctuations. The task is to train a predictive model to ingest a sequence of text and add punctuation (period, comma or question mark) in the appropriate locations. This task is important for all downstream data processing jobs.
Example input:
this is a string of text with no punctuation this is a new sentence
Example output:
this is a string of text with no punctuation <period> this is a new sentence <period>
My solution is largely based on Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration.
The architecture is defined as follows:
- Obtain words embeddings from GloVe.
- The word embeddings are then processed by densely connected Bi-LSTM layers.
- These Bi-LSTM layers are followed by a RNN with an attention mechanism and conditional random field (CRF) log likelihood loss.
The experiments are performed on the IWSLT dataset which consists of TED Talks transcript.
The detailed analysis can be found in this notebook.
First step, clone the repo:
https://github.com/k9luo/Punctuation-Restoration.git
Second step, you can download pretrained GloVe word embeddings and create a new conda virutal environment with setup.sh
. Or you can manually do these steps yourself. Note that the running setup.sh
will install the GPU version of TensorFlow:
sh setup.sh
Third step, activate the virtual environment:
conda activate restore_punct
Fourth step, add the new virutal environment to Jupyter Notebook:
python -m ipykernel install --user --name=restore_punct
Please run python main.py
.