enhancement: use pdfium for rendering pdf #455

qued · 2026-01-13T21:24:33Z

Switch to pypdfium2 for rendering.

Note

Switches PDF rendering backend from pdf2image to pypdfium2 and updates APIs/call sites accordingly.

Replaces pdf2image with pypdfium2 in inference/layout.py; imports and calls updated
Introduces a new convert_pdf_to_image that supports filename or file-like input, page ranges (first_page/last_page), optional password, and DPI defaulting to PDF_RENDER_DPI (fallback 400); saves to output_folder when provided and returns paths when path_only
Updates DocumentLayout.from_file and PageLayout._get_image to use the new function
Bumps version to 1.1.5 and updates CHANGELOG.md noting performance improvements

^{Written by Cursor Bugbot for commit a9e68f4. This will update automatically on new commits. Configure here.}

…fium

unstructured_inference/inference/layout.py

cursor · 2026-01-15T01:10:25Z

unstructured_inference/inference/layout.py

+    try:
+        images: dict[int, Image.Image] = {}
+        if dpi is None:
+            dpi = int(os.environ.get("PDF_RENDER_DPI", 400))


Default DPI changed from 200 to 400

Medium Severity

The convert_pdf_to_image function's default DPI changed from 200 to 400. The old signature had dpi: int = 200, but the new signature uses dpi: Optional[int] = None with a fallback to int(os.environ.get("PDF_RENDER_DPI", 400)). External callers relying on the previous default will now receive images rendered at 4x the resolution (2x in each dimension), causing significantly higher memory usage and different output quality. Internal callers are unaffected since they explicitly pass dpi=pdf_image_dpi which defaults to 200.

unstructured_inference/inference/layout.py

cragwolfe · 2026-01-16T18:42:25Z

@claude , review

claude · 2026-01-16T18:42:40Z

Claude finished @cragwolfe's task —— View job

Code Review for pypdfium2 Migration

Reviewing the switch from pdf2image to pypdfium2 for PDF rendering.

Todo List:

Read repository guidelines (CLAUDE.md)
Read and analyze changed files
Review existing feedback from automated tools
Provide comprehensive code review feedback
Post final review

High-Priority Issues That Need Fixing

🔴 Critical Runtime Error - `unstructured_inference/inference/layout.py:102`

images: list[Image.Image] = []
# Later...
list[Image.Image](images.values())  # Line causing the error

Issue: list[Image.Image] is a type hint, not a constructor. This will raise TypeError: 'types.GenericAlias' object is not callable at runtime.
Fix: Should be list(images.values())

🔴 Resource Leak - `unstructured_inference/inference/layout.py:427-445`

Issue: Page objects are not properly closed after rendering, causing memory accumulation for large PDFs.
Fix: Add page.close() in the finally block:

try:
    # render bitmap
finally:
    page.close()

🔴 Exception Handling Bug - `unstructured_inference/inference/layout.py:443`

Issue: If page.render() raises an exception, bitmap is undefined but bitmap.close() is still called in the finally block, causing a NameError that masks the original error.
Fix: Initialize bitmap = None before the try block and check if it exists before closing.

Medium-Priority Issues

🟡 Breaking API Change - Default DPI Changed

Issue: The default DPI changed from 200 to 400 (via PDF_RENDER_DPI fallback), which will cause different rendering behavior for external callers.
Impact: Images will be rendered at 4x resolution (2x each dimension), significantly increasing memory usage.
Recommendation: Either maintain backward compatibility by defaulting to 200 or document this breaking change prominently.

🟡 Scale Precision Loss - `unstructured_inference/inference/layout.py:426`

scale = dpi / 72.0  # Should be used as-is
# But then it's cast to int elsewhere in some contexts

Issue: Converting scale to integer truncates precision, causing incorrect DPI rendering. With default DPI=200, scale becomes 3 instead of ~2.78, resulting in ~216 DPI instead of 200.
Fix: Use the float scale directly without integer conversion.

🟡 Performance Regression - Delayed Parameter Validation

Issue: The validation if path_only and not output_folder was moved from the function start to after all PDF processing. For large PDFs, this wastes CPU and memory before failing.
Fix: Move validation back to the beginning of the function.

Code Quality Observations

✅ Good Practices

Proper resource management with try/finally blocks for PDF and bitmap objects
Support for both file paths and file-like objects
Password protection support maintained
Comprehensive parameter validation

📝 Minor Improvements

Type Consistency: Mixed usage of list[Type] and List[Type] - consider standardizing on one approach
Error Messages: Could be more specific about what failed (e.g., which page failed to render)
Documentation: The function docstring could better explain the pypdfium2 migration benefits

Performance & Memory Impact

✅ Positive Changes

pypdfium2 typically has better performance than pdf2image
Proper resource cleanup (when bugs are fixed)

⚠️ Potential Concerns

Default DPI increase will use 4x more memory
Resource leaks from unclosed page objects
Delayed validation could waste resources on invalid parameters

Testing Recommendations

Unit Tests: Verify the type annotation fix with runtime calls
Integration Tests: Test with large PDFs to ensure memory doesn't accumulate
Error Handling Tests: Verify proper cleanup when rendering fails
Performance Tests: Compare memory usage and speed vs. pdf2image
API Compatibility Tests: Ensure external callers get expected DPI

Overall Assessment

This migration to pypdfium2 is a good architectural choice for performance, but the implementation has several critical bugs that need immediate attention. The resource leaks and runtime errors will cause production issues. Once these are fixed, this should be a solid improvement.

Recommendation: Fix the critical issues before merging, as they will cause runtime failures and memory problems in production.

cragwolfe

(though did Claude get confused by resolved cursor comments?)

qued added 7 commits January 9, 2026 18:10

refactor all rendering to be done using the same function, and use pd…

0b69e2a

…fium

update changelog and version

f7a7d2c

explicit call

2e158e2

format

ee60936

fix typing

97d5c85

update for parity w unstructured

0546503

needs to be int

cafb185

qued marked this pull request as ready for review January 15, 2026 00:40

cursor bot reviewed Jan 15, 2026

View reviewed changes

unstructured_inference/inference/layout.py Outdated Show resolved Hide resolved

unstructured_inference/inference/layout.py Outdated Show resolved Hide resolved

unstructured_inference/inference/layout.py Outdated Show resolved Hide resolved

qued added 3 commits January 14, 2026 18:48

casting to int doesn't make sense for scale, gah

d2ab4aa

typing

08dcf3e

early return

c39c393

cursor bot reviewed Jan 15, 2026

View reviewed changes

qued enabled auto-merge (squash) January 15, 2026 02:42

close pages after use

fbd81df

cursor bot reviewed Jan 15, 2026

View reviewed changes

unstructured_inference/inference/layout.py Show resolved Hide resolved

bitmap needs to exist outside the try

a9e68f4

cragwolfe approved these changes Jan 16, 2026

View reviewed changes

qued merged commit 72ba3d8 into main Jan 16, 2026
17 checks passed

qued deleted the enhancement/use-pdfium-for-rendering-pdf branch January 16, 2026 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enhancement: use pdfium for rendering pdf #455

enhancement: use pdfium for rendering pdf #455

qued commented Jan 13, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 15, 2026

Uh oh!

Uh oh!

cragwolfe commented Jan 16, 2026

Uh oh!

claude bot commented Jan 16, 2026 •

edited

Loading

Uh oh!

cragwolfe left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

enhancement: use pdfium for rendering pdf #455

enhancement: use pdfium for rendering pdf #455

Conversation

qued commented Jan 13, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 15, 2026

Choose a reason for hiding this comment

Default DPI changed from 200 to 400

Uh oh!

Uh oh!

cragwolfe commented Jan 16, 2026

Uh oh!

claude bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for pypdfium2 Migration

Todo List:

High-Priority Issues That Need Fixing

🔴 Critical Runtime Error - unstructured_inference/inference/layout.py:102

🔴 Resource Leak - unstructured_inference/inference/layout.py:427-445

🔴 Exception Handling Bug - unstructured_inference/inference/layout.py:443

Medium-Priority Issues

🟡 Breaking API Change - Default DPI Changed

🟡 Scale Precision Loss - unstructured_inference/inference/layout.py:426

🟡 Performance Regression - Delayed Parameter Validation

Code Quality Observations

✅ Good Practices

📝 Minor Improvements

Performance & Memory Impact

✅ Positive Changes

⚠️ Potential Concerns

Testing Recommendations

Overall Assessment

Uh oh!

cragwolfe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qued commented Jan 13, 2026 •

edited by cursor bot

Loading

claude bot commented Jan 16, 2026 •

edited

Loading

🔴 Critical Runtime Error - `unstructured_inference/inference/layout.py:102`

🔴 Resource Leak - `unstructured_inference/inference/layout.py:427-445`

🔴 Exception Handling Bug - `unstructured_inference/inference/layout.py:443`

🟡 Scale Precision Loss - `unstructured_inference/inference/layout.py:426`