Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Retention with fitz.page.get_pixmap() #3625

Open
nataliia-obraztsova opened this issue Jun 26, 2024 · 9 comments
Open

Memory Retention with fitz.page.get_pixmap() #3625

nataliia-obraztsova opened this issue Jun 26, 2024 · 9 comments

Comments

@nataliia-obraztsova
Copy link

nataliia-obraztsova commented Jun 26, 2024

Description of the bug

When processing larger PDF files the page.get_pixmap() method significantly increases memory usage and does not release it properly after completion. It results in a high memory footprint that persists until an even larger file is processed. This behavior can be observed from the memory profiling data provided below.

I implemented the operation as a function that is called in cycle for each file. I set pix = None for each page and call doc.close() and fitz.TOOLS.store_shrink(100) for each document as was suggested in a similar issue here #1430
One can see that sugnificant increase in memory usage occurred while processing file f1 and a high memory footprint persisted while processing later files.

If there is a method I could call to release the memory please let me know.

Relevant closed issue #1430.

processing file f0

Memory usage before function: 34.70 MB

Line # Mem usage Increment Occurrences Line Contents

34     34.9 MiB     34.9 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     35.1 MiB      0.1 MiB           1       file_stream = read_file(file_name)
37     35.7 MiB      0.6 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     35.7 MiB      0.0 MiB           1       try:
39     35.7 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     46.6 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     46.6 MiB      0.2 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     46.6 MiB      6.4 MiB           3               pix = page.get_pixmap()
45     46.6 MiB      4.8 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     46.6 MiB     -1.3 MiB           3               pix = None
47                                                     # Convert the PIL Image to a bytes-like object
48     46.6 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
49     46.6 MiB      0.8 MiB           3               img.save(img_byte_buff, format='JPEG')
50     46.6 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
51                                         
52                                                     # Encode the image bytes in base64 and decode to UTF-8 string
53     46.6 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
54                                         
55                                             except Exception as e:
56                                                 raise Exception(e.args)
57                                             finally:
58     46.6 MiB      0.0 MiB           1           doc.close()
59     46.6 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 39.10 MB
Memory usage difference total: 4.41 MB

processing file f1

Memory usage before function: 39.10 MB

Line # Mem usage Increment Occurrences Line Contents

34     39.1 MiB     39.1 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     44.4 MiB      5.2 MiB           1       file_stream = read_file(file_name)
37     44.4 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     44.4 MiB      0.0 MiB           1       try:
39     44.4 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    343.5 MiB    -11.1 MiB          33           for i in range(number_of_pages):
41    343.3 MiB    -11.1 MiB          32               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    343.5 MiB    288.0 MiB          32               pix = page.get_pixmap()
45    343.5 MiB    -11.1 MiB          32               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    343.5 MiB    -11.1 MiB          32               pix = None
47                                                     # Convert the PIL Image to a bytes-like object
48    343.5 MiB    -11.1 MiB          32               img_byte_buff = BytesIO()
49    343.5 MiB    -11.1 MiB          32               img.save(img_byte_buff, format='JPEG')
50    343.5 MiB    -11.1 MiB          32               img_byte_arr = img_byte_buff.getvalue()
51                                         
52                                                     # Encode the image bytes in base64 and decode to UTF-8 string
53    343.5 MiB    -11.1 MiB          32               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
54                                         
55                                             except Exception as e:
56                                                 raise Exception(e.args)
57                                             finally:
58    306.7 MiB    -36.8 MiB           1           doc.close()
59    306.7 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 301.36 MB
Memory usage difference total: 262.26 MB

processing file f2

Memory usage before function: 301.36 MB

Line # Mem usage Increment Occurrences Line Contents

34    301.4 MiB    301.4 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36    301.4 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37    301.4 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38    301.4 MiB      0.0 MiB           1       try:
39    301.4 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    301.4 MiB      0.0 MiB           4           for i in range(number_of_pages):
41    301.4 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    301.4 MiB      0.0 MiB           3               pix = page.get_pixmap()
45    301.4 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    301.4 MiB      0.0 MiB           3               pix = None
47                                                     # Convert the PIL Image to a bytes-like object
48    301.4 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
49    301.4 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
50    301.4 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
51                                         
52                                                     # Encode the image bytes in base64 and decode to UTF-8 string
53    301.4 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
54                                         
55                                             except Exception as e:
56                                                 raise Exception(e.args)
57                                             finally:
58    301.4 MiB      0.0 MiB           1           doc.close()
59    301.4 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 301.36 MB
Memory usage difference total: 0.00 MB

processing file f3

Memory usage before function: 301.36 MB

Line # Mem usage Increment Occurrences Line Contents

34    301.4 MiB    301.4 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36    301.4 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37    301.4 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38    301.4 MiB      0.0 MiB           1       try:
39    301.4 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    301.4 MiB      0.0 MiB           4           for i in range(number_of_pages):
41    301.4 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    301.4 MiB      0.0 MiB           3               pix = page.get_pixmap()
45    301.4 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    301.4 MiB      0.0 MiB           3               pix = None
47                                                     # Convert the PIL Image to a bytes-like object
48    301.4 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
49    301.4 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
50    301.4 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
51                                         
52                                                     # Encode the image bytes in base64 and decode to UTF-8 string
53    301.4 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
54                                         
55                                             except Exception as e:
56                                                 raise Exception(e.args)
57                                             finally:
58    301.4 MiB      0.0 MiB           1           doc.close()
59    301.4 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 301.36 MB
Memory usage difference total: 0.00 MB

How to reproduce the bug

def read_file(file_name):
    try:
        file = open(file_name, 'rb')
        file_content = file.read()
        file_stream = BytesIO(file_content)
        return file_stream
    except Exception as e:
        raise Exception(f"There was an error processing the file(s) {e.args}")
    finally:
        if file:
            file.close()

def render_page_to_image(file_name):
    file_stream = read_file(file_name)
    doc = fitz.open(stream=file_stream, filetype="pdf")
    try:
        number_of_pages = doc.page_count
        for i in range(number_of_pages):
            page = doc.load_page(i)

            # Render the page to a pixmap (an image)
            pix = page.get_pixmap()
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            pix = None
            # Convert the PIL Image to a bytes-like object
            img_byte_buff = BytesIO()
            img.save(img_byte_buff, format='JPEG')
            img_byte_arr = img_byte_buff.getvalue()

            # Encode the image bytes in base64 and decode to UTF-8 string
            rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')

    except Exception as e:
        raise Exception(e.args)
    finally:
        doc.close()
        fitz.TOOLS.store_shrink(100)


for i in range (4):
    file_name = 'xxx.pdf_{i}.pdf'
    render_page_to_image(file_name)

PyMuPDF version

1.23.x or earlier

Operating system

Linux

Python version

3.11

@nataliia-obraztsova
Copy link
Author

Adding fitz.TOOLS.store_shrink(100) after pix = None actually helped a lot. Here is a link to an older issue which I missed at first
#130
I still have some gradual increase so I'll leave the issue open for now.

@JorjMcKie
Copy link
Collaborator

Can you please provide printouts with numbers updated after the mentioned adjustments?

In general, if a permanently low memory footprint is desired (for whatever reasons), shrinking the store usage should be used generously.
This is because of a number of reasons:

  1. MuPDF's strategy is to keep things in memory - especially objects that are prone to be large like images and fonts
  2. Deleting Python objects is only one side of the medal: the shadowing C-object in MuPDF is not necessarily also removed in each case.

@nataliia-obraztsova
Copy link
Author

nataliia-obraztsova commented Jun 27, 2024

Below you can see memory profiling after adjustments. The interesting thing is that while processing the file f0 fitz.TOOLS.store_shrink(100) in line 47 seems to made no difference, but memory usage increased only by 7MiB. And didn't shrink back to initial number. While processing file f1, fitz.TOOLS.store_shrink(100) in line 47 reduced memory usage a lot. But still not all of it. Additional 20.12 MB added up. Then it seems to plateau.

P.S. I have upgraded PyMuPDF to 1.24.7

memory profiling after adjustments

processing file f0

Memory usage before function: 53.28 MB

Line # Mem usage Increment Occurrences Line Contents

34     53.5 MiB     53.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     53.7 MiB      0.1 MiB           1       file_stream = read_file(file_name)
37     56.0 MiB      2.4 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     56.0 MiB      0.0 MiB           1       try:
39     56.0 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     67.4 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     67.4 MiB      0.2 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     67.4 MiB      7.0 MiB           3               pix = page.get_pixmap()
45     67.4 MiB      3.5 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     67.4 MiB      0.0 MiB           3               pix = None
47     67.4 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     67.4 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     67.4 MiB      0.6 MiB           3               img.save(img_byte_buff, format='JPEG')
51     67.4 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     67.4 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     67.4 MiB      0.0 MiB           1           doc.close()
60     67.4 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 60.41 MB
Memory usage difference total: 7.13 MB

processing file f1

Memory usage before function: 60.41 MB

Line # Mem usage Increment Occurrences Line Contents

34     60.4 MiB     60.4 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     65.7 MiB      5.2 MiB           1       file_stream = read_file(file_name)
37     65.7 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     65.7 MiB      0.0 MiB           1       try:
39     65.7 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    100.4 MiB    -70.7 MiB          33           for i in range(number_of_pages):
41    100.4 MiB    -56.1 MiB          32               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    145.3 MiB    194.0 MiB          32               pix = page.get_pixmap()
45    145.3 MiB   -289.6 MiB          32               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    145.3 MiB   -289.6 MiB          32               pix = None
47    100.4 MiB   -519.4 MiB          32               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49    100.4 MiB    -70.7 MiB          32               img_byte_buff = BytesIO()
50    100.4 MiB    -70.7 MiB          32               img.save(img_byte_buff, format='JPEG')
51    100.4 MiB    -70.7 MiB          32               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54    100.4 MiB    -70.7 MiB          32               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     85.8 MiB    -14.6 MiB           1           doc.close()
60     85.8 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB
Memory usage difference total: 20.12 MB

processing file f2

Memory usage before function: 80.53 MB

Line # Mem usage Increment Occurrences Line Contents

34     80.5 MiB     80.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     80.5 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37     80.5 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     80.5 MiB      0.0 MiB           1       try:
39     80.5 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     80.5 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     80.5 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     80.5 MiB      0.0 MiB           3               pix = page.get_pixmap()
45     80.5 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     80.5 MiB      0.0 MiB           3               pix = None
47     80.5 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     80.5 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     80.5 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
51     80.5 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     80.5 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     80.5 MiB      0.0 MiB           1           doc.close()
60     80.5 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB
Memory usage difference total: 0.00 MB

processing file f3

Memory usage before function: 80.53 MB

Line # Mem usage Increment Occurrences Line Contents

34     80.5 MiB     80.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     80.5 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37     80.5 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     80.5 MiB      0.0 MiB           1       try:
39     80.5 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     80.5 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     80.5 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     80.5 MiB      0.0 MiB           3               pix = page.get_pixmap()
45     80.5 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     80.5 MiB      0.0 MiB           3               pix = None
47     80.5 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     80.5 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     80.5 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
51     80.5 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     80.5 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     80.5 MiB      0.0 MiB           1           doc.close()
60     80.5 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB
Memory usage difference total: 0.00 MB

@yoliax
Copy link

yoliax commented Jun 28, 2024

I encountered the same issue! Memory leak!
I wrote a service using PyMuPDF to parse PDFs. Despite using fitz.TOOLS.store_shrink(100) each time, the service crashes due to memory leak after running for a period of time.

try:
    with fitz.Document(stream=data, filetype="pdf") as doc:
        ...
except Exception as e:
    logging...
finally:
    fitz.TOOLS.store_shrink(100)
    gc.collect()

other code:

zoom_x = request.imgsz / page_width
zoom_y = request.imgsz  / page_height
zoom = min(zoom_x, zoom_y)

mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, colorspace="rgb", alpha=False)

@yoliax
Copy link

yoliax commented Jun 28, 2024

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

@JorjMcKie
Copy link
Collaborator

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

Please do not mix different things in the same report!
If you find that example please open a separate issue.

@JorjMcKie
Copy link
Collaborator

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

It seems that your "issue" goes back to that Page.get_image_infos() uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained).
Whereas text extractions restrict results (text or image) to objects contained in the MediaBox.
If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.

@yoliax
Copy link

yoliax commented Jun 29, 2024

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?
I'm looking for this PDF. I'll share it once I find it.

It seems that your "issue" goes back to that Page.get_image_infos() uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained). Whereas text extractions restrict results (text or image) to objects contained in the MediaBox. If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.

Thank you very much, I will give it a try.

@HuJianE
Copy link

HuJianE commented Nov 27, 2024

I am also using that api page.get_pixmap() and a lot others, I am also facing memory leak now
And I am checking one by one to know what is causing my memory increasing as that is a python web project.
If you guys are have this, I guess it is also on me, but I have not tested to this part.
I am suspecting another one page.get_text('rawdict')
Just for your info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants