Chapterize by Jonathan Reeve is a command-line tool that breaks up Gutenberg Project English plain text e-books into chapters, removing both the chapter headings and the text not included between headings.
It-Chapterize is an adaptation of Chapterize for the Italian language with additional minor changes concerning the output.
- All regular expressions were modified so as to detect the most likely Italian chapters headings
- Chapter headings are included at the beginning of each extracted chapter
- The value of the delta variable for removing chapter headings that are likely to be part of a Table of Contents was increased
- An additional function removes short detected chapters, that are likely to be false positive chapters/spurious text
# Clone the repository
git clone https://github.com/GiuseppeDellaCorte/It-Chapterize.git
# Grab a copy of "I tre Moschettieri - Volume 1 " from Project Gutenberg:
wget https://www.gutenberg.org/files/60641/60641-0.txt
# Run It-Chapterize on it as it follows:
python /path-to/itchapterize/itchapterize.py /path-to/60641-0.txt
It will output a new directory in the current working directory named 60641-0.txt-chapters
, containing files ranging from 01.txt to 16.txt.
It-Chapterize has been tested on a few set of Italian e-books, which means that the tool does not handle many possible Italian chapter headings.
It-Chapterize has been tested successfully on these Italian Gutenberg Project files:
- I tre moschettieri, vol. I
- I tre moschettieri, vol. II
- I tre moschettieri, vol. III
- Le avventure d'Alice nel paese delle meraviglie
- L'arte di far debiti
- Una sfida al Polo
It-Chapterize has also been tested on the Gutenberg Project files that follows this paragraph. It worked relatively well on them, but not perfectly: the output text files include between one and two false positives chapters. In addition, for a few of them, sometimes spurious information are included usually in the first or last detected extracted chapters. Manual correction of false negatives requires around 1/2 minutes per parsed file.