Skip to content

Conversation

@qued
Copy link
Contributor

@qued qued commented Jan 13, 2026

Switch to pypdfium2 for rendering.


Note

Switches PDF rendering backend from pdf2image to pypdfium2 and updates APIs/call sites accordingly.

  • Replaces pdf2image with pypdfium2 in inference/layout.py; imports and calls updated
  • Introduces a new convert_pdf_to_image that supports filename or file-like input, page ranges (first_page/last_page), optional password, and DPI defaulting to PDF_RENDER_DPI (fallback 400); saves to output_folder when provided and returns paths when path_only
  • Updates DocumentLayout.from_file and PageLayout._get_image to use the new function
  • Bumps version to 1.1.5 and updates CHANGELOG.md noting performance improvements

Written by Cursor Bugbot for commit a9e68f4. This will update automatically on new commits. Configure here.

@qued qued marked this pull request as ready for review January 15, 2026 00:40
try:
images: dict[int, Image.Image] = {}
if dpi is None:
dpi = int(os.environ.get("PDF_RENDER_DPI", 400))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default DPI changed from 200 to 400

Medium Severity

The convert_pdf_to_image function's default DPI changed from 200 to 400. The old signature had dpi: int = 200, but the new signature uses dpi: Optional[int] = None with a fallback to int(os.environ.get("PDF_RENDER_DPI", 400)). External callers relying on the previous default will now receive images rendered at 4x the resolution (2x in each dimension), causing significantly higher memory usage and different output quality. Internal callers are unaffected since they explicitly pass dpi=pdf_image_dpi which defaults to 200.

Fix in Cursor Fix in Web

@qued qued enabled auto-merge (squash) January 15, 2026 02:42
@cragwolfe
Copy link
Contributor

@claude , review

@claude
Copy link

claude bot commented Jan 16, 2026

Claude finished @cragwolfe's task —— View job


Code Review for pypdfium2 Migration

Reviewing the switch from pdf2image to pypdfium2 for PDF rendering.

Todo List:

  • Read repository guidelines (CLAUDE.md)
  • Read and analyze changed files
  • Review existing feedback from automated tools
  • Provide comprehensive code review feedback
  • Post final review

High-Priority Issues That Need Fixing

🔴 Critical Runtime Error - unstructured_inference/inference/layout.py:102

images: list[Image.Image] = []
# Later...
list[Image.Image](images.values())  # Line causing the error

Issue: list[Image.Image] is a type hint, not a constructor. This will raise TypeError: 'types.GenericAlias' object is not callable at runtime.
Fix: Should be list(images.values())

🔴 Resource Leak - unstructured_inference/inference/layout.py:427-445

Issue: Page objects are not properly closed after rendering, causing memory accumulation for large PDFs.
Fix: Add page.close() in the finally block:

try:
    # render bitmap
finally:
    page.close()

🔴 Exception Handling Bug - unstructured_inference/inference/layout.py:443

Issue: If page.render() raises an exception, bitmap is undefined but bitmap.close() is still called in the finally block, causing a NameError that masks the original error.
Fix: Initialize bitmap = None before the try block and check if it exists before closing.

Medium-Priority Issues

🟡 Breaking API Change - Default DPI Changed

Issue: The default DPI changed from 200 to 400 (via PDF_RENDER_DPI fallback), which will cause different rendering behavior for external callers.
Impact: Images will be rendered at 4x resolution (2x each dimension), significantly increasing memory usage.
Recommendation: Either maintain backward compatibility by defaulting to 200 or document this breaking change prominently.

🟡 Scale Precision Loss - unstructured_inference/inference/layout.py:426

scale = dpi / 72.0  # Should be used as-is
# But then it's cast to int elsewhere in some contexts

Issue: Converting scale to integer truncates precision, causing incorrect DPI rendering. With default DPI=200, scale becomes 3 instead of ~2.78, resulting in ~216 DPI instead of 200.
Fix: Use the float scale directly without integer conversion.

🟡 Performance Regression - Delayed Parameter Validation

Issue: The validation if path_only and not output_folder was moved from the function start to after all PDF processing. For large PDFs, this wastes CPU and memory before failing.
Fix: Move validation back to the beginning of the function.

Code Quality Observations

Good Practices

  • Proper resource management with try/finally blocks for PDF and bitmap objects
  • Support for both file paths and file-like objects
  • Password protection support maintained
  • Comprehensive parameter validation

📝 Minor Improvements

  1. Type Consistency: Mixed usage of list[Type] and List[Type] - consider standardizing on one approach
  2. Error Messages: Could be more specific about what failed (e.g., which page failed to render)
  3. Documentation: The function docstring could better explain the pypdfium2 migration benefits

Performance & Memory Impact

Positive Changes

  • pypdfium2 typically has better performance than pdf2image
  • Proper resource cleanup (when bugs are fixed)

⚠️ Potential Concerns

  • Default DPI increase will use 4x more memory
  • Resource leaks from unclosed page objects
  • Delayed validation could waste resources on invalid parameters

Testing Recommendations

  1. Unit Tests: Verify the type annotation fix with runtime calls
  2. Integration Tests: Test with large PDFs to ensure memory doesn't accumulate
  3. Error Handling Tests: Verify proper cleanup when rendering fails
  4. Performance Tests: Compare memory usage and speed vs. pdf2image
  5. API Compatibility Tests: Ensure external callers get expected DPI

Overall Assessment

This migration to pypdfium2 is a good architectural choice for performance, but the implementation has several critical bugs that need immediate attention. The resource leaks and runtime errors will cause production issues. Once these are fixed, this should be a solid improvement.

Recommendation: Fix the critical issues before merging, as they will cause runtime failures and memory problems in production.

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though did Claude get confused by resolved cursor comments?)

@qued qued merged commit 72ba3d8 into main Jan 16, 2026
17 checks passed
@qued qued deleted the enhancement/use-pdfium-for-rendering-pdf branch January 16, 2026 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants