-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page.delete_widget() doesn't fully remove the widget, other programs still detect the widgets #3478
Comments
This not a bug! import pymupdf as fitz, os
OLD_FILE = "exported_from_libreoffice.pdf"
NEW_FILE = "deleted_signatures2.pdf"
if os.path.exists(NEW_FILE):
os.remove(NEW_FILE)
_fdoc = fitz.open(OLD_FILE)
# iterate the pages
for page in _fdoc:
# iterate the fields on this page
# BUT: use widget xrefs for iteration!!!
xrefs = [w.xref for w in page.widgets()]
for xref in xrefs:
field = page.load_widget(xref)
n = field.field_name
# if it's a signature, remove it
if n.startswith("sig") or n.startswith("init"):
page.delete_widget(field)
# save the document updates
_fdoc.ez_save(NEW_FILE)
_fdoc.close()
# now re-opening the document to check if the fields I removed are still there or not according to PyMuPDF
_check_doc = fitz.open(NEW_FILE)
# iterate the pages again
for page in _check_doc:
# iterate the fields on this page
for field in page.widgets():
n = field.field_name
# if it's a signature, print to console
if n.startswith("sig") or n.startswith("init"):
print(f"...'{n}' is still present")
_check_doc.close() Another safe way of iteration: field = page.first_widget
while field:
if n.startswith("sig") or n.startswith("init"):
field = page.delete_widget(field)
else:
field = field.next |
Unfortunately, both of the presented solutions give me the same end result as my original bug post. Even when I use the xrefs to iterate as you suggested in the modification, the fields that I mark for deletion are still present in some way in the file. The eSign platforms and pyHanko are still able to find them and try to utilize them as form fields, even when Acrobat doesn't display them. Is there something extra I can do while saving the file to make sure that any lingering references are removed from the xref table? |
But the file check iteration shows that they are gone! |
I see that the result PDF has some yellow and red rectangle graphics where there were fields before. These have nothing to do with widgets / fields - they are just vector graphics. |
The graphic boxes are meant to stay, it's only the widgets/fields that I am trying to remove. I understand that PyMuPDF says they are gone, I'm just confused on if they are gone, how are the other programs able to see the removed widgets and their names/dimensions? In pyHanko, I can still retrieve that info even if PyMuPDF says they are not there, and the eSign platforms do the same. I must be misunderstanding something, or there is something off about the way LibreOffice is setting up the widgets in the first place. I'm not sure how to troubleshoot that at the moment. |
Attached my output |
First, my apologies for not doing this initially. There might be a slight language barrier and I didn't give the most complete example that I could have given. I just tried your PDF and I get the same results. When I upload to an eSign platform, the removed fields are detected during the import. When I attempt to add signatures in those same locations with pyHanko, it also complains that the fields already exist in the PDF. Here is my complete testing script for this that prints out some data along the way. This is a watered down version of what my tool is doing. First it stores the coords of signature/initial fields, deletes them, then uses pyHanko to add proper signature fields. requirements.txt
test.py import fitz, os
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
from pyhanko.sign.fields import SigFieldSpec, append_signature_field
OLD_FILE="exported_from_libreoffice.pdf"
NEW_FILE="deleted_signatures.pdf"
if os.path.exists(NEW_FILE):
os.remove(NEW_FILE)
boxes = {}
doc = fitz.open(OLD_FILE)
# iterate the pages
print("Removing fields with PyMuPDF")
for page in doc:
# store the page's height for placement
_page_rect = page.bound()
_page_height = _page_rect.y1
# iterate the fields on this page
field = page.first_widget
while field:
n = field.field_name
# if it's a signature, remove it
if n.startswith("sig") or n.startswith("init"):
# PyMuPDF y coords go top-to-bottom, but pyHanko goes bottom-to-top
# Subtract the y coords from the current page height for pyHanko
boxes[n] = {
"page": page.number,
"box": (
field.rect.x0,
_page_height-field.rect.y0,
field.rect.x1,
_page_height-field.rect.y1
)
}
print("Removing field: ", n)
field = page.delete_widget(field)
else:
field = field.next
# save the document updates
doc.ez_save(NEW_FILE)
doc.close()
# now re-opening the document to check if the fields I removed are still there or not
check_doc = fitz.open(NEW_FILE)
# iterate the pages again
print("Checking PDF for removed fields with PyMuPDF")
found = 0
for page in check_doc:
# iterate the fields on this page
field = page.first_widget
while field:
n = field.field_name
# if it's a signature, print to console
if n.startswith("sig") or n.startswith("init"):
found += 1
print(f"...'{n}' is still present")
field = field.next
if found == 0:
print("PyMuPDF did not find any fields")
check_doc.close()
# now let's try to use pyHanko to add new signatures
# if we find that a field already exists, print the error
print("Adding signatures to new PDF with pyHanko")
found = 0
with open(NEW_FILE, 'rb+') as sig_doc:
writer = IncrementalPdfFileWriter(sig_doc, strict=False)
for name in boxes.keys():
_dict = boxes[name]
try:
append_signature_field(writer, SigFieldSpec(
sig_field_name=name,
on_page=_dict["page"],
box=_dict["box"]
))
except Exception as e:
found += 1
print("ERROR: ", e)
writer.write_in_place()
if found > 0:
print(f"pyHanko found {found} fields") Even when I run only the last for loop on the PDF you uploaded, I get the same results where pyHanko can still see the fields. Example of the full output that I see when I run this against my originally uploaded
I mentioned this before, but it's possible that something in LibreOffice is part of my issue. I'm trying to do some more testing on that today when I can to see if I can figure anything out. My bad if I'm annoying you, but I'm just baffled at how the other apps are able to still detect the fields if they are removed from the PDF. I will admit it could completely be a misunderstanding on my part about how things are done in PDFs, but from a basic logic standpoint it seems like some sort of reference to the deleted fields remain in the PDF. |
I've tracked this down a little more specifically. Looks like the delete method deletes the widget from the annotations list of the PDF, but it remains in the fields list as an object. When pyHanko and the eSign platforms iterate, they are using the fields list and not the annotations list, so this is why it seems like the fields still remain in the PDF, but they don't appear in regular viewers. Would it not make sense to remove all references to a widget from the PDF entirely if it's meant to be deleted? Or is that operation not possible in PDF for some reason? |
I was beginning to suspect something like this. As to my own impression, 50% of the PDF viewers I am using look at the We should probably indeed make sure to also either remove the entry there too (perfectly possible) or empty the PDF object definition. Let me re-open this issue as an enhancement request. |
Here is an upfront solution - I hope. |
There might have been a github malfunction when you posted the comment, looks like it is linking back to this issue instead of a zip file. I'll definitely check it out when I can download it. Is it a dev version of the module, or code examples on how to empty an object definition using existing methods? And thanks for bearing with me on this. I know my terminology isn't completely accurate, I'm still learning PDF/python and clearly have a ways to go. |
My internet connection is terrible at the moment. So the ZIP upload was interrupted / incomplete. It is actually 2 statements only that you must add:
|
Yes! Absolutely perfect. When I first started hunting this issue down, this exact process is something I thought might need to be done, I just didn't do enough reading to connect all the dots, I guess. At least in my use case, adding these two lines remove the widgets entirely and pyHanko/eSign platforms no longer see the erroneous fields. |
Thanks for the feedback! |
Description of the bug
I am unable to completely delete widgets in any of my documents using
Page.delete_widget(widget)
and thenDocument.ez_save()
. Although the resulting PDF looks like the fields are removed when viewed in a reader like Acrobat, the fields are still present in some lingering form and get picked up by other programs like eSign platforms.This is causing me issues in my process, as I need to add proper signature fields in my documents created from LibreOffice, which doesn't have the ability to add signature fields to PDFs. What my app is doing is getting the box coords for the sig and init fields, deleting them from the PDF entirely, and then appending new signature fields using the box coords in
pyHanko
.My only work around for the moment is to rename the fields I want to delete so that pyHanko creates brand new fields with new names. The issue with this, however, is that there is an extra text field that is underneath my signatures when I upload them to an eSign platform, even though PyMuPDF does not detect the fields are there.
There must be some remaining aspect of the deleted fields that is still present after the delete and save. I thought it was an issue in
pyHanko
for a while, but now I am seeing these "deleted" fields pop up in 4 different eSign platforms when importing fields automatically from the PDF. This leads me to believe that there is something not being removed from the PDF when a widget is deleted with PyMuPDF.Note: the widgets being removed are basic text fields with nothing special about them other than a specific name scheme. I am able to affect everything else about these fields in PyMuPDF like the name, flags, etc. without issue.
How to reproduce the bug
My Process:
Expected:
Actual:
Screenshot from signNow after uploading
deleted_signatures.pdf
and having their system auto-import fields:The same location in the same file, viewed through Acrobat, looks like there is no field there:
Notes:
pyHanko
, the fields are still present in the file somewhere.PyMuPDF version
1.24.3
Operating system
Windows
Python version
3.12
The text was updated successfully, but these errors were encountered: