Extraction of text stops in the middle while working fine with PyMuPDF #191

sebastiaanvduijn · 2024-11-21T15:38:48Z

as example converting this PDF to markdown https://cache.industry.siemens.com/dl/files/702/109768702/att_998757/v4/109768702_UserAdministration_WinCC_V7.5_en.pdf results in:

4.2.1 Configuration of Access Protection

The following section describes how to configure the access protection of a button.

Open the WinCC "Graphics Designer" with a double click on the entry in the

project directory.

Figure 4-16

**2. The "Graphics Designer" opens with an empty image. Drag and drop a "button"

from the standard library into the image (1) and assign a suitable name (2).**

Note You can link different objects that have access protection with different
permissions. You can only assign one permission per object.

User Administration WinCC V7 5

In section 2 it stops extracting text in the middle of the block, the highlighted block should be this instead:

The "Graphics Designer" opens with an empty image. Drag and drop a "button"
from the standard library into the image (1) and assign a suitable name (2).
Click on the "Authorizations" button (3). The window with the available
permissions of the project opens. Select the "User Administration" permission
(4). Confirm the selection of permissions with the "OK" button (5). Confirm the
configuration of the button with the "OK" button (6).

It does this for multiple PDFs, the data extracted is not complete, the text extraction works for PyMuPDF

sebastiaanvduijn · 2024-11-21T22:31:04Z

I have been trying to debug where this is coming from, but it seems the text not shown is not in the correct position, due to the clip position taken from the middle. this code is working better but now not all whitespaces are stripped out. hope this helps:

            for sno, s in enumerate(line["spans"]):  # the numered spans
                sbbox = pymupdf.Rect(s["bbox"])  # span bbox as a Rect
                mpoint = (sbbox.tl + sbbox.br) / 2 # middle point

                margin = 0
                if not sbbox.intersects(clip):
                    # expand clip if the span is near the edge
                    if sbbox.x0 < clip.x0 or sbbox.x1 > clip.x1 or sbbox.y0 < clip.y0 or sbbox.y1 > clip.y1:
                        clip = pymupdf.Rect(clip.x0 - margin, clip.y0 - margin, clip.x1 + margin, clip.y1 + margin)
                    else:
                        print(f"Span {s['text']} skipped due to position")
                        continue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction of text stops in the middle while working fine with PyMuPDF #191

Extraction of text stops in the middle while working fine with PyMuPDF #191

sebastiaanvduijn commented Nov 21, 2024

sebastiaanvduijn commented Nov 21, 2024

Extraction of text stops in the middle while working fine with PyMuPDF #191

Extraction of text stops in the middle while working fine with PyMuPDF #191

Comments

sebastiaanvduijn commented Nov 21, 2024

4.2.1 Configuration of Access Protection

sebastiaanvduijn commented Nov 21, 2024