This code is for NLP final report Prospectus Summarization Using BertSum
We aim to use a deep-learning model called BertSum to summarize IPO prospectus. IPO prospectus is an important source of information. It provides information on business model, competition, risks and opportunities, and financial situations. However, it is usually a very long legal document. It could be very time-consuming to go through the whole legal document. Therefore, in order to help generate a big picture of the company, we tried to use BertSum to further condense summary part. This would be helpful especially when you are not familiar with the company and the industry that it operates in.
See .\get_prospectus\get_prospectus.py
. You need the Company Name as well as CIK to get the 424B prospectus from U.S. SECURITIES AND EXCHANGE COMMISSION website. Commonly, it will create 6 text files:
- raw prospectus
- raw summary
- cleaned prospectus
- cleaned summary
- cleaned prospectus with [CLS][SEP] tokens
- cleaned summary with [CLS][SEP] tokens
We compare the performance of BertSum with TextRank & TextTeaser. In ./textrank
and ./textteaser
folder, you will see code, raw text and summarized text. Notice that this two models are just for reference. We would like to show how powerful BertSum is when comparing with obsolete algos.
models | TextTeaser & TextRank | BertSum |
---|---|---|
Categories | Extractive Only | Extractive & Abstractive |
Unit of Summarization | Sentences | Flexible |
Dealing with Polysemy Problem | No | Yes |
Training series for different cases | No | Yes |
We borrowed the idea from Fine-tune BERT for Extractive Summarization and Text Summarization with Pretrained Encoders. The author has given us well-trained models and preprocessed data already. We can directly download from her Google Drive.
Pre-processed data, unzip the zipfile and put all .pt files into ./bert_data
CNN/DM Extractive, unzip the zipfile and put .pt file into ./models/ext
CNN/DM Abstractive, unzip the zipfile and put .pt file into ./models/abs
Python 3.6
torch==1.1.0
pytorch_transformers
tensorboardX
multiprocess pyrouge
python train.py -task ext -mode test_text -test_from MODEL_PATH -text_src RAW_TEXT_PATH -result_path OUTPUT_PATH -log_file LOG_PATH -visible_gpus -1
python train.py -task abs -mode test_text -test_from MODEL_PATH -text_src RAW_TEXT_PATH -result_path OUTPUT_PATH -log_file LOG_PATH -visible_gpus -1