-
Notifications
You must be signed in to change notification settings - Fork 51
fix: respect declared font encoding over base font mapping #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: respect declared font encoding over base font mapping #188
Conversation
|
✅ DCO Check Passed Thanks @silverl, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
When a PDF font declares a specific encoding (MacRoman, WinAnsi, etc.) but matches a known base font like Arial, the code was using the base font's built-in character mapping instead of the declared encoding. This caused character 0xa1 to be extracted as ¡ (WinAnsi) instead of ° (MacRoman) for fonts declaring MacRomanEncoding. The fix checks if the font declares a specific encoding before falling back to the base font's mapping. Fixes: docling-project#187 Signed-off-by: Larry Silverman <lsilverman@trackabout.com>
7b34741 to
5fe1caa
Compare
Ground truth files were generated with the buggy WinAnsi fallback behavior. Regenerating them captures the correct character decoding after respecting declared font encodings. Signed-off-by: Larry Silverman <lsilverman@trackabout.com>
Note on Ground Truth Diff SizeThe ground truth regeneration commit shows a large diff (+430k/-110k lines, 87 files). This is expected and here's why: Encoding Fix Changes (the actual fix)These are the semantic changes from the MacRoman encoding fix:
Structural Changes (unrelated to this PR)The v2 ground truth was last regenerated at v4.1.0 (commit
Regenerating ground truth now captures these accumulated improvements, causing the large diff. The encoding fix is correct and all 14 tests pass. |
|
@silverl Can you provide also a test-case? Add one (or more) single page pdf's in which this occurs. I always like to have regression tests for these font updates. |
|
would be interesting to see if it also solves this issue: #140 |
Hi, @PeterStaar-IBM. I attached such a file to Issue #187 when logging the issue. Further, there are already files in the existing test suite that have the same issue (French language), but were approved as passing tests. I believe this is called out in my PR. Is there something more you're looking for? Perhaps I'm misunderstanding your request. Thanks. |
|
@silverl I took your changes in this PR so we can see the real differences in the regression. |
|
@silverl Thanks for this PR, it gave me the good direction and I have now
|
|
That's great news, thanks @PeterStaar-IBM. I checked on pypi, but no build has appeared yet. It looks like the GitHub action might have stalled out. The 4.7.2 took 1h15m, but the current build is still going after 6 hours. |
|
@silverl seems the macosx x86 was blocking us, it should now be published |
|
Confirmed fixed! I've synced up with pypi and docling-parse 4.7.3 and it's working great. |
Summary
Fixes #187 - MacRomanEncoding fonts incorrectly decoded using WinAnsi mappings.
When a PDF embeds a subsetted font (e.g.,
PXAAAB+ArialMT) with a declared encoding like/Encoding /MacRomanEncoding, the parser was ignoring this declaration and using character mappings from the matched base font's.afmfile instead.Root cause: In
get_correct_character(), whenbfonts.has_corresponding_font()matches a known font, the code usedfm.to_utf8(c)which applies the base font's internal mapping (from the.afmfile). For Arial, this is effectively WinAnsi encoding, causing byte0xa1to become¡(exclamdown) instead of°(degree symbol) for MacRoman-encoded fonts.The Fix
Part 1 (main fix): In
get_correct_character(), check if the font declares a specific encoding (MacRoman, WinAnsi, MacExpert, or Standard) before falling back to base font mappings. If a declared encoding exists, use the encoding table instead.Part 2 (defensive): In
init_encoding(), extract/BaseEncodingfrom encoding dictionary objects. Added.is_string()type check for safety.Test Impact
Some regression tests will fail because their ground truth files were generated with the buggy code. Affected PDFs (like
form_fields.pdf) contain MacRoman-encoded fonts that now decode correctly. To update:GENERATE = Trueintest_parse.pyandtest_parse_v2.pyGENERATE = FalseA minimal test PDF (
macroman_encoding_bug_demo.pdf) demonstrating this bug is attached to issue #187.Example
PDF with
/Encoding /MacRomanEncodingand byte0xa1:¡(WinAnsi mapping from Arial.afm)°(correct MacRoman mapping)Why This Is Correct
Per PDF specification (Section 5.5.5), a declared encoding overrides any default encoding associated with the font. The fix ensures the parser respects this hierarchy: