GitHub - guozhonghao1994/Prospectus_Summarization_Using_BertSum: MFIN7036 NLP Course Project

Prospectus Summarization Using BertSum

This code is for NLP final report Prospectus Summarization Using BertSum

Goal

We aim to use a deep-learning model called BertSum to summarize IPO prospectus. IPO prospectus is an important source of information. It provides information on business model, competition, risks and opportunities, and financial situations. However, it is usually a very long legal document. It could be very time-consuming to go through the whole legal document. Therefore, in order to help generate a big picture of the company, we tried to use BertSum to further condense summary part. This would be helpful especially when you are not familiar with the company and the industry that it operates in.

Web Scraping and Data Cleaning

See .\get_prospectus\get_prospectus.py. You need the Company Name as well as CIK to get the 424B prospectus from U.S. SECURITIES AND EXCHANGE COMMISSION website. Commonly, it will create 6 text files:

raw prospectus
raw summary
cleaned prospectus
cleaned summary
cleaned prospectus with [CLS][SEP] tokens
cleaned summary with [CLS][SEP] tokens

TextRank & TextTeaser

We compare the performance of BertSum with TextRank & TextTeaser. In ./textrank and ./textteaser folder, you will see code, raw text and summarized text. Notice that this two models are just for reference. We would like to show how powerful BertSum is when comparing with obsolete algos.

models	TextTeaser & TextRank	BertSum
Categories	Extractive Only	Extractive & Abstractive
Unit of Summarization	Sentences	Flexible
Dealing with Polysemy Problem	No	Yes
Training series for different cases	No	Yes

Preparation

We borrowed the idea from Fine-tune BERT for Extractive Summarization and Text Summarization with Pretrained Encoders. The author has given us well-trained models and preprocessed data already. We can directly download from her Google Drive.

Download the processed data

Pre-processed data, unzip the zipfile and put all .pt files into ./bert_data

Trained models

CNN/DM Extractive, unzip the zipfile and put .pt file into ./models/ext

CNN/DM Abstractive, unzip the zipfile and put .pt file into ./models/abs

Package Requirements

Python 3.6

torch==1.1.0 pytorch_transformers tensorboardX multiprocess pyrouge

Usage

Extractive

python train.py -task ext -mode test_text -test_from MODEL_PATH -text_src RAW_TEXT_PATH -result_path OUTPUT_PATH -log_file LOG_PATH -visible_gpus -1

Absractive

python train.py -task abs -mode test_text -test_from MODEL_PATH -text_src RAW_TEXT_PATH -result_path OUTPUT_PATH -log_file LOG_PATH -visible_gpus -1

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
bert_data		bert_data
get_prospectus		get_prospectus
json_data		json_data
logs		logs
models		models
raw_data		raw_data
results		results
src		src
textrank		textrank
textteaser		textteaser
urls		urls
.gitignore		.gitignore
Final Report.pdf		Final Report.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prospectus Summarization Using BertSum

Goal

Web Scraping and Data Cleaning

TextRank & TextTeaser

Preparation

Download the processed data

Trained models

Package Requirements

Usage

Extractive

Absractive

About

Releases

Packages

Languages

License

guozhonghao1994/Prospectus_Summarization_Using_BertSum

Folders and files

Latest commit

History

Repository files navigation

Prospectus Summarization Using BertSum

Goal

Web Scraping and Data Cleaning

TextRank & TextTeaser

Preparation

Download the processed data

Trained models

Package Requirements

Usage

Extractive

Absractive

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages