Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction of text stops in the middle while working fine with PyMuPDF #191

Open
sebastiaanvduijn opened this issue Nov 21, 2024 · 1 comment

Comments

@sebastiaanvduijn
Copy link

as example converting this PDF to markdown https://cache.industry.siemens.com/dl/files/702/109768702/att_998757/v4/109768702_UserAdministration_WinCC_V7.5_en.pdf results in:

4.2.1 Configuration of Access Protection

The following section describes how to configure the access protection of a button.

  1. Open the WinCC "Graphics Designer" with a double click on the entry in the

project directory.

Figure 4-16

**2. The "Graphics Designer" opens with an empty image. Drag and drop a "button"

from the standard library into the image (1) and assign a suitable name (2).**

Note You can link different objects that have access protection with different
permissions. You can only assign one permission per object.

User Administration WinCC V7 5


In section 2 it stops extracting text in the middle of the block, the highlighted block should be this instead:

  1. The "Graphics Designer" opens with an empty image. Drag and drop a "button"
    from the standard library into the image (1) and assign a suitable name (2).
    Click on the "Authorizations" button (3). The window with the available
    permissions of the project opens. Select the "User Administration" permission
    (4). Confirm the selection of permissions with the "OK" button (5). Confirm the
    configuration of the button with the "OK" button (6).

It does this for multiple PDFs, the data extracted is not complete, the text extraction works for PyMuPDF

@sebastiaanvduijn
Copy link
Author

I have been trying to debug where this is coming from, but it seems the text not shown is not in the correct position, due to the clip position taken from the middle. this code is working better but now not all whitespaces are stripped out. hope this helps:

            for sno, s in enumerate(line["spans"]):  # the numered spans
                sbbox = pymupdf.Rect(s["bbox"])  # span bbox as a Rect
                mpoint = (sbbox.tl + sbbox.br) / 2 # middle point

                margin = 0
                if not sbbox.intersects(clip):
                    # expand clip if the span is near the edge
                    if sbbox.x0 < clip.x0 or sbbox.x1 > clip.x1 or sbbox.y0 < clip.y0 or sbbox.y1 > clip.y1:
                        clip = pymupdf.Rect(clip.x0 - margin, clip.y0 - margin, clip.x1 + margin, clip.y1 + margin)
                    else:
                        print(f"Span {s['text']} skipped due to position")
                        continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant