fix: respect declared font encoding over base font mapping #188

silverl · 2026-01-09T20:12:33Z

This PR was researched and coded by Claude Code.

Summary

Fixes #187 - MacRomanEncoding fonts incorrectly decoded using WinAnsi mappings.

When a PDF embeds a subsetted font (e.g., PXAAAB+ArialMT) with a declared encoding like /Encoding /MacRomanEncoding, the parser was ignoring this declaration and using character mappings from the matched base font's .afm file instead.

Root cause: In get_correct_character(), when bfonts.has_corresponding_font() matches a known font, the code used fm.to_utf8(c) which applies the base font's internal mapping (from the .afm file). For Arial, this is effectively WinAnsi encoding, causing byte 0xa1 to become ¡ (exclamdown) instead of ° (degree symbol) for MacRoman-encoded fonts.

The Fix

Part 1 (main fix): In get_correct_character(), check if the font declares a specific encoding (MacRoman, WinAnsi, MacExpert, or Standard) before falling back to base font mappings. If a declared encoding exists, use the encoding table instead.

Part 2 (defensive): In init_encoding(), extract /BaseEncoding from encoding dictionary objects. Added .is_string() type check for safety.

Test Impact

Some regression tests will fail because their ground truth files were generated with the buggy code. Affected PDFs (like form_fields.pdf) contain MacRoman-encoded fonts that now decode correctly. To update:

Set GENERATE = True in test_parse.py and test_parse_v2.py
Run tests to regenerate ground truth
Set GENERATE = False
Commit updated ground truth files

A minimal test PDF (macroman_encoding_bug_demo.pdf) demonstrating this bug is attached to issue #187.

Example

PDF with /Encoding /MacRomanEncoding and byte 0xa1:

Before: ¡ (WinAnsi mapping from Arial.afm)
After: ° (correct MacRoman mapping)

Why This Is Correct

Per PDF specification (Section 5.5.5), a declared encoding overrides any default encoding associated with the font. The fix ensures the parser respects this hierarchy:

ToUnicode CMap (if present) - unchanged
Differences array (if present) - unchanged
Declared encoding (MacRoman/WinAnsi/etc.) - NEW: takes precedence
Base font mapping - fallback for UNKNOWN encoding only

github-actions · 2026-01-09T20:12:44Z

✅ DCO Check Passed

Thanks @silverl, all your commits are properly signed off. 🎉

mergify · 2026-01-09T20:13:08Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

When a PDF font declares a specific encoding (MacRoman, WinAnsi, etc.) but matches a known base font like Arial, the code was using the base font's built-in character mapping instead of the declared encoding. This caused character 0xa1 to be extracted as ¡ (WinAnsi) instead of ° (MacRoman) for fonts declaring MacRomanEncoding. The fix checks if the font declares a specific encoding before falling back to the base font's mapping. Fixes: docling-project#187 Signed-off-by: Larry Silverman <lsilverman@trackabout.com>

Ground truth files were generated with the buggy WinAnsi fallback behavior. Regenerating them captures the correct character decoding after respecting declared font encodings. Signed-off-by: Larry Silverman <lsilverman@trackabout.com>

silverl · 2026-01-09T21:33:56Z

Note on Ground Truth Diff Size

The ground truth regeneration commit shows a large diff (+430k/-110k lines, 87 files). This is expected and here's why:

Encoding Fix Changes (the actual fix)

These are the semantic changes from the MacRoman encoding fix:

l'ºle → l'Île (French: Prince Edward Island)
N¡ → N° (French: "numéro" abbreviation)
¸ → À (A with grave accent)

Structural Changes (unrelated to this PR)

The v2 ground truth was last regenerated at v4.1.0 (commit 8872e73). Since then, the parser has added new output fields:

Version	Feature
v4.3.0	`line_cells` field added to page output
v4.4.0	Sanitation parameters updated
v4.5.0	Performance tooling additions

Regenerating ground truth now captures these accumulated improvements, causing the large diff. The encoding fix is correct and all 14 tests pass.

PeterStaar-IBM · 2026-01-11T05:54:44Z

@silverl Can you provide also a test-case? Add one (or more) single page pdf's in which this occurs. I always like to have regression tests for these font updates.

PeterStaar-IBM · 2026-01-11T07:32:40Z

would be interesting to see if it also solves this issue: #140

silverl · 2026-01-12T14:18:03Z

@silverl Can you provide also a test-case? Add one (or more) single page pdf's in which this occurs. I always like to have regression tests for these font updates.

Hi, @PeterStaar-IBM. I attached such a file to Issue #187 when logging the issue. Further, there are already files in the existing test suite that have the same issue (French language), but were approved as passing tests. I believe this is called out in my PR. Is there something more you're looking for? Perhaps I'm misunderstanding your request. Thanks.

PeterStaar-IBM · 2026-01-12T16:19:18Z

@silverl I took your changes in this PR so we can see the real differences in the regression.

PeterStaar-IBM · 2026-01-13T05:26:17Z

@silverl Thanks for this PR, it gave me the good direction and I have now

updated your code slightly here (fix: updated the font-parsing #193). It was missing some small part to make sure we dont break other regression tests
added the test pdf as a regression test

silverl · 2026-01-13T15:35:19Z

That's great news, thanks @PeterStaar-IBM. I checked on pypi, but no build has appeared yet. It looks like the GitHub action might have stalled out. The 4.7.2 took 1h15m, but the current build is still going after 6 hours.

PeterStaar-IBM · 2026-01-14T15:44:11Z

@silverl seems the macosx x86 was blocking us, it should now be published

silverl · 2026-01-14T19:45:40Z

Confirmed fixed! I've synced up with pypi and docling-parse 4.7.3 and it's working great.

silverl force-pushed the fix/respect-declared-font-encoding branch from 7b34741 to 5fe1caa Compare January 9, 2026 20:32

PeterStaar-IBM self-requested a review January 11, 2026 05:54

PeterStaar-IBM closed this Jan 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: respect declared font encoding over base font mapping #188

fix: respect declared font encoding over base font mapping #188

Uh oh!

silverl commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 9, 2026 •

edited

Loading

Uh oh!

mergify bot commented Jan 9, 2026

Uh oh!

silverl commented Jan 9, 2026

Uh oh!

PeterStaar-IBM commented Jan 11, 2026

Uh oh!

PeterStaar-IBM commented Jan 11, 2026

Uh oh!

silverl commented Jan 12, 2026

Uh oh!

PeterStaar-IBM commented Jan 12, 2026

Uh oh!

PeterStaar-IBM commented Jan 13, 2026

Uh oh!

silverl commented Jan 13, 2026 •

edited

Loading

Uh oh!

PeterStaar-IBM commented Jan 14, 2026

Uh oh!

silverl commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: respect declared font encoding over base font mapping #188

fix: respect declared font encoding over base font mapping #188

Uh oh!

Conversation

silverl commented Jan 9, 2026

Summary

The Fix

Test Impact

Example

Why This Is Correct

Uh oh!

github-actions bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jan 9, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

silverl commented Jan 9, 2026

Note on Ground Truth Diff Size

Encoding Fix Changes (the actual fix)

Structural Changes (unrelated to this PR)

Uh oh!

PeterStaar-IBM commented Jan 11, 2026

Uh oh!

PeterStaar-IBM commented Jan 11, 2026

Uh oh!

silverl commented Jan 12, 2026

Uh oh!

PeterStaar-IBM commented Jan 12, 2026

Uh oh!

PeterStaar-IBM commented Jan 13, 2026

Uh oh!

silverl commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeterStaar-IBM commented Jan 14, 2026

Uh oh!

silverl commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Jan 9, 2026 •

edited

Loading

silverl commented Jan 13, 2026 •

edited

Loading