This research is based on the toturial BERT Fine-Tuning Tutorial with PyTorch.
Under training-bert folder you can find a Jupyter notebook. There I show how I fined-tune base-uncased bert model to solve the classification problem of duplication questions from Quora website.
In this research I'd like to use BERT with the huggingface PyTorch library to fine-tune a model which will perform best in question pairs classification. The app is build using Streamlit.
So firstly let's talk about the model and the dataset:
Bidirectional Encoder Representations from Transformers (BERT) was released, and pretrained, in late 2018 by Google (see original model code here) for NLP (Natural Language Processing) tasks. Bert was created originally by Jacob Devlin with two corpora in pre-training: BookCorpus and English Wikipedia.
BERT consists of 12 Transformer Encoding layers (or 24 for large BERT). If you stack Transformer Decoding layers you'll GPT model to generate senetances.
You can more information inthe those videos:
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
BERT Neural Network - EXPLAINED!
Quora is a question-and-answer website where questions are asked, answered, followed, and edited by Internet users, either factually or in the form of opinions. Quora was co-founded by former Facebook employees Adam D'Angelo and Charlie Cheever in June 2009. website was made available to the public for the first time on June 21, 2010. Today the website is available in many languages.
Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question.
The goal is to predict which of the provided pairs of questions contain two questions with the same meaning. The ground truth is the set of labels that have been supplied by human experts. The dataset itself can be downloaded from kaggle: here.
see the following video:
Clone the repo:
git clone https://github.com/idanmoradarthas/Quora-Questions-Pairs-App.git
cd Quora-Questions-Pairs-App
go to the training folder, install the requirements and run the notebook in order to create the model:
cd training-bert
pip install -r requirements.txt
jupyter notebook
Install the requirements in the main folder:
cd ..
pip install -r requirements.txt
Run Streamlit:
streamlit run app.py