This project was developed to train a neural network to detect tables inside documents. Once the model is trained, it can be used in IntelligentOCR to detect tables.
The project uses the pre-trained neural network offered by Tensorflow. In addition, a config file was used, according to the choosen pre-trained model, to train with object detections tensorflow API
Before we go on make sure you have everything installed to be able to use the project:
- Python 3
- Tensorflow (tested on r1.8)
- Its object-detection API (remember to install COCO API. If you are on Windows see at the bottom of the readme)
- Pillow
- opencv-python
- pandas
- pyprind
Notice: to complete the installation of the Coco API, you will have to clone the tensorflow/models. This will also be useful later to train your model, since all the training files lie in tensorflow/models.
if you missed any package, be safe with pip install -r requirements.txt
.
The dataset used is:
TableBank. Have a read at the
paper if you want more information.
In our care, only a subset of the dataset was used. Which one?
The one contained in the folder TableBank/TableBank_data/Detection_data
,
which had a proper annotation.
The other pictures from the Recognition_data
folder are intended to
understand the structure of a table, like the position of header and cells,
so it's not used here.
Notice: .
If you wish, pick the .json labels and use this instead.
The original dataset comes into a big .json file. That was transformed in multiple .xml files for convenience. The annotations have the following.
<?xml version='1.0' encoding='UTF-8'?>
<document filename="1401.0007_15">
<tableRegion>
<Coords points="85,396 510,396 85,495 510,495" />
</tableRegion>
<tableRegion>
<Coords points="50,400 510,400 50,495 510,495" />
</tableRegion>
</document>
The coordinates of the points are in this order:
x0,y0 x1,y0 x0,y1 x1,y1
The project hosted here contains the annotations, but not the pictures. The TableBank dataset is very big (around 25 GB) so it can't be stored here. So, to make it work, you have to load the correct pictures. How?
First get the dataset from TableBank guys
here
Now, from the root folder of the TableBank archive, go to
TableBank_data/Detection_data
, both in the folders data
and latex
,
and put them in this project at path data/Images
Images in, you're ready to train!
There are also other datasets around, smaller but easier to get. You can use any dataset you want and train your model with it. Just put the pictures (.jpg) and annotations (.xml, see the format above) in the folders
dataset/Annotations
dataset/Images
Some datasets or ways to gather them here:
- ICDAR2019
- UNLV dataset
- UNLV ground truth
- Marmot Dataset
- kaggle
- digitisation.eu
- IUPR
- NIST Datasets
- PRImA Datasets
- IUPR Dataset: camera captured document images
The project is made up of different parts that acts together as a pipeline.
I have prepared two "costants" files: dataset_costants.py
and inference_constants.py
.
The first contains all those costants that are useful to use to create dataset, the second to make
inference with the frozen graph. If you just want to run the project you should modify only those two files.
Since colors are not useful for table detection, we can convert all the images in .jpeg
8-bit single channel images.
This)
transformation is still under testing.
Use python dataset/img_to_jpeg.py
after setting dataset_costants.py
:
DPI_EXTRACTION
: output quality of the images;PATH_TO_IMAGES
: path/to/datase/images;IMAGES_EXTENSION
: extension of the extracted images. The only one tested is.jpeg
.
Tensorflow instead can build its own TFRecord from csv informations, so we need to convert
the xml
files into a csv
one.
Use python dataset/generate_database_csv.py
to do this conversion after setting dataset_costants.py
:
TRAIN_CSV_NAME
: name for.csv
train output file;TEST_CSV_NAME
: name for.csv
test output file;TRAIN_CSV_TO_PATH
: folder path forTRAIN_CSV_NAME
;TEST_CSV_TO_PATH
: folder path forTEST_CSV_NAME
;ANNOTATIONS_EXTENSION
: extension of annotations. In our case is.xml
;TRAINING_PERCENTAGE
: percentage of images for trainingTEST_PERCENTAGE
: percentage of images for testingTABLE_DICT
: dictionary for data labels. For this project there is no reason to change it;MIN_WIDTH_BOX
,MIN_HEIGHT_BOX
: minimum dimension to consider a box valid; Some networks don't digest well little boxes, so I put this check.
csv
files and images are ready: now we need to create our TF record file to feed Tensorflow.
Use python generate_tf_records.py
to create the train and test.record
files that we will need later. No need to configure
dataset_costants.py
This part can be a little tricky. Let's divide it in steps.
Look in Model Zoo
where there are pre-trained models for you to use.
All the info about how to configure and start a model can be found
here
First of all, choose a model, download it and save it in the folder trained_nets
, where are your pre-trained models will be.
Go to trained_models and create a new folder with the name you prefer. Create two files inside it:
command.txt
: here you will write the commands useful to start the training and export the graphmyconfig.config
: this is a config file. You can find some examples here. Be careful to choose the correct config for your network. There are a few that you must to set to set in the config.fine_tune_checkpoint
: this is the path to themodel.ckpt
file of the frozen graph.input_path
x2: path to the record filestrain_jpeg.record
andtest_jpeg.record
we made beforelabel_path
x2: path to the labels contained in the fileobject-detection.pbtxt
.
Other parameters can be chosen. It's not very clear how to do it: you can find little info here Once you've done your config, you can go to step 3.
Write this command in the command
file to reuse it.
Example of the command to launch from tensorflow/models/research/object-detection
python model_main.py \
--pipeline_config_path="path/to/your/myconfig.config" \
--model_dir="path where you save your model, its nice if it is the same folder of myconfig.config" \
--num_train_steps=10000 \
--num_eval_steps=500 \
--alsologtostderr
Other options are inside tensorflow/models/research/object-detection/model_main.py
Notice: the paths must be your local paths
Notice: if you have already trained this network,
num_train_steps must be > of the steps the network has been trained already.
Monitor the training with Tensorboard to check how it is going.
Use the command, from tensorflow/models/research/object_detection
:
python export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=path/to/automatically/created/pipeline.config \
--trained_checkpoint_prefix=path/to/last/model.ckpt-xxx \
--output_directory=path/to/output/dir
Now that you have your graph you can try it out:
Run inference_with_net.py
and set inference_costants.py
:
- PATH_TO_GRAPH = os.path.abspath('trained_models/adam2_tb/frozen_adam2_tb/frozen_inference_graph.pb')
- TEST_SCORES: it's already set.
- NUM_CLASSES = the number of classes to detect. In our case it's 1. It's already set
- PATH_TO_LABELS = path to
.pbtxt
label file; - MAX_NUM_BOXES = max number of boxes to be considered;
- PATH_TO_TEST_IMAGES_INPUT_FOLDER: take the input here to apply inference
- PATH_TO_TEST_IMAGES_OUTPUT_FOLDER the output images go here
The output will be a set of images with boxes drawn on it representing the inferred tables, one image for each score.
In addition it will print a "merged" version of the boxes, in which
all the best vertically overlapping boxes are merged together to gain accuracy. TEST_SCORES
is a list of
numbers that tells the program which scores must be merged together.
The procedure is better described in inference_with_net.py
.
For every execution a .log
file will be produced and put in logs
.
This comment will probably solve your problem.
This clone will provide a working source for COCO API in Windows and Python3
Metrics of the best network trained with TableTrain net: Adam2_tb In orange, the training set; in blue, the eval set