Skip to content

Conversation

@silverl
Copy link

@silverl silverl commented Jan 9, 2026

This PR was researched and coded by Claude Code.

Summary

Fixes #187 - MacRomanEncoding fonts incorrectly decoded using WinAnsi mappings.

When a PDF embeds a subsetted font (e.g., PXAAAB+ArialMT) with a declared encoding like /Encoding /MacRomanEncoding, the parser was ignoring this declaration and using character mappings from the matched base font's .afm file instead.

Root cause: In get_correct_character(), when bfonts.has_corresponding_font() matches a known font, the code used fm.to_utf8(c) which applies the base font's internal mapping (from the .afm file). For Arial, this is effectively WinAnsi encoding, causing byte 0xa1 to become ¡ (exclamdown) instead of ° (degree symbol) for MacRoman-encoded fonts.

The Fix

Part 1 (main fix): In get_correct_character(), check if the font declares a specific encoding (MacRoman, WinAnsi, MacExpert, or Standard) before falling back to base font mappings. If a declared encoding exists, use the encoding table instead.

Part 2 (defensive): In init_encoding(), extract /BaseEncoding from encoding dictionary objects. Added .is_string() type check for safety.

Test Impact

Some regression tests will fail because their ground truth files were generated with the buggy code. Affected PDFs (like form_fields.pdf) contain MacRoman-encoded fonts that now decode correctly. To update:

  1. Set GENERATE = True in test_parse.py and test_parse_v2.py
  2. Run tests to regenerate ground truth
  3. Set GENERATE = False
  4. Commit updated ground truth files

A minimal test PDF (macroman_encoding_bug_demo.pdf) demonstrating this bug is attached to issue #187.

Example

PDF with /Encoding /MacRomanEncoding and byte 0xa1:

  • Before: ¡ (WinAnsi mapping from Arial.afm)
  • After: ° (correct MacRoman mapping)

Why This Is Correct

Per PDF specification (Section 5.5.5), a declared encoding overrides any default encoding associated with the font. The fix ensures the parser respects this hierarchy:

  1. ToUnicode CMap (if present) - unchanged
  2. Differences array (if present) - unchanged
  3. Declared encoding (MacRoman/WinAnsi/etc.) - NEW: takes precedence
  4. Base font mapping - fallback for UNKNOWN encoding only

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

DCO Check Passed

Thanks @silverl, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Jan 9, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

When a PDF font declares a specific encoding (MacRoman, WinAnsi, etc.)
but matches a known base font like Arial, the code was using the base
font's built-in character mapping instead of the declared encoding.

This caused character 0xa1 to be extracted as ¡ (WinAnsi) instead of °
(MacRoman) for fonts declaring MacRomanEncoding.

The fix checks if the font declares a specific encoding before falling
back to the base font's mapping.

Fixes: docling-project#187
Signed-off-by: Larry Silverman <lsilverman@trackabout.com>
@silverl silverl force-pushed the fix/respect-declared-font-encoding branch from 7b34741 to 5fe1caa Compare January 9, 2026 20:32
Ground truth files were generated with the buggy WinAnsi fallback
behavior. Regenerating them captures the correct character decoding
after respecting declared font encodings.

Signed-off-by: Larry Silverman <lsilverman@trackabout.com>
@silverl
Copy link
Author

silverl commented Jan 9, 2026

Note on Ground Truth Diff Size

The ground truth regeneration commit shows a large diff (+430k/-110k lines, 87 files). This is expected and here's why:

Encoding Fix Changes (the actual fix)

These are the semantic changes from the MacRoman encoding fix:

  • l'ºlel'Île (French: Prince Edward Island)
  • (French: "numéro" abbreviation)
  • ¸À (A with grave accent)

Structural Changes (unrelated to this PR)

The v2 ground truth was last regenerated at v4.1.0 (commit 8872e73). Since then, the parser has added new output fields:

Version Feature
v4.3.0 line_cells field added to page output
v4.4.0 Sanitation parameters updated
v4.5.0 Performance tooling additions

Regenerating ground truth now captures these accumulated improvements, causing the large diff. The encoding fix is correct and all 14 tests pass.

@PeterStaar-IBM
Copy link
Member

@silverl Can you provide also a test-case? Add one (or more) single page pdf's in which this occurs. I always like to have regression tests for these font updates.

@PeterStaar-IBM PeterStaar-IBM self-requested a review January 11, 2026 05:54
@PeterStaar-IBM
Copy link
Member

would be interesting to see if it also solves this issue: #140

@silverl
Copy link
Author

silverl commented Jan 12, 2026

@silverl Can you provide also a test-case? Add one (or more) single page pdf's in which this occurs. I always like to have regression tests for these font updates.

Hi, @PeterStaar-IBM. I attached such a file to Issue #187 when logging the issue. Further, there are already files in the existing test suite that have the same issue (French language), but were approved as passing tests. I believe this is called out in my PR. Is there something more you're looking for? Perhaps I'm misunderstanding your request. Thanks.

@PeterStaar-IBM
Copy link
Member

@silverl I took your changes in this PR so we can see the real differences in the regression.

@PeterStaar-IBM
Copy link
Member

@silverl Thanks for this PR, it gave me the good direction and I have now

  1. updated your code slightly here (fix: updated the font-parsing #193). It was missing some small part to make sure we dont break other regression tests
  2. added the test pdf as a regression test

@silverl
Copy link
Author

silverl commented Jan 13, 2026

That's great news, thanks @PeterStaar-IBM. I checked on pypi, but no build has appeared yet. It looks like the GitHub action might have stalled out. The 4.7.2 took 1h15m, but the current build is still going after 6 hours.

@PeterStaar-IBM
Copy link
Member

@silverl seems the macosx x86 was blocking us, it should now be published

@silverl
Copy link
Author

silverl commented Jan 14, 2026

Confirmed fixed! I've synced up with pypi and docling-parse 4.7.3 and it's working great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MacRomanEncoding ignored when font matches known base font (e.g., Arial)

2 participants