Extraction skipping every alternate line and also not extracting headers #4017

chelsud123 · 2024-11-04T14:31:02Z

Description of the bug

In the attached PDF, I am trying to extract the last table. PyMuPDF is extracting it, but skipping every alternate line and is also not extracting the column headers. The following items are attached:

Problem_document.pdf : The PDF from which the table has been extracted
Table_error.png : The erroneous extraction
Problem_document.pdf

How to reproduce the bug

import pymupdf
from IPython.display import display, Image

doc = pymupdf.open('Problem_document.pdf')
page = doc[idx]
tabs = page.find_tables(add_lines=None)
print(f"{len(tabs.tables)} found on page {idx}")
if tabs.tables:
   for idt in range(0, len(tabs.tables)):
       df_test = pd.DataFrame(tabs[idt].extract())
       display(df_test)

PyMuPDF version

1.24.13

Operating system

Windows

Python version

3.9

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-12-01T07:21:22Z

Sorry for the delay.
We have checked out various settings and came to the result that the tables situation on this page as a whole and also the making of each single table cannot be (correctly) detected by the table finder.
You will have to develop own code which for instance adds information via the add_lines parameter to help the table finder.

JorjMcKie added the wontfix no intention to resolve label Dec 1, 2024

JorjMcKie closed this as completed Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction skipping every alternate line and also not extracting headers #4017

Extraction skipping every alternate line and also not extracting headers #4017

chelsud123 commented Nov 4, 2024 •

edited

Loading

JorjMcKie commented Dec 1, 2024

Extraction skipping every alternate line and also not extracting headers #4017

Extraction skipping every alternate line and also not extracting headers #4017

Comments

chelsud123 commented Nov 4, 2024 • edited Loading

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Dec 1, 2024

chelsud123 commented Nov 4, 2024 •

edited

Loading