Skip to content

Conversation

@PeterStaar-IBM
Copy link
Member

@PeterStaar-IBM PeterStaar-IBM commented Jan 10, 2026

Removal of v1 dependencies

All references to v1 have been removed.

Removal of bare v2 test dependencies

  1. removed the test_parse_v2 and brought test_parse to the same level.

  2. Exposed Annotations in PdfDocument API (docling_parse/pdf_parser.py)

Added structured Pydantic models:

  • PdfTocEntry: Recursive model for table of contents entries with title, level, page, and children
  • PdfAnnotations: Main model containing form (dict), language (str), meta_xml (str), and table_of_contents (list of PdfTocEntry)

Added to PdfDocument class:

  • _annotations cache field in init
  • get_annotations() method that retrieves and caches annotations from the underlying parser
  • _to_pdf_toc_entry() helper method to convert raw TOC dicts to structured PdfTocEntry objects
  1. Added 9 Comprehensive Tests (tests/test_parse.py)

All tests now match or exceed the coverage of test_parse_v2.py:

  1. test_load_from_bytesio_lazy: Tests loading PDFs from BytesIO with lazy=True
  2. test_load_from_bytesio_eager: Tests loading PDFs from BytesIO with lazy=False
  3. test_list_loaded_keys_lifecycle: Tests document key management (load/unload lifecycle)
  4. test_get_page_individually: Tests accessing specific pages without loading all pages
  5. test_unload_individual_pages: Tests unloading specific page ranges
  6. test_boundary_types: Tests loading with CROP_BOX and MEDIA_BOX boundary types
  7. test_lazy_vs_eager_pages_identical: Verifies lazy and eager loading produce identical results
  8. test_get_annotations: Tests the new annotations API
  9. test_annotations_match_v2_groundtruth: Verifies annotations match v2 parser groundtruth files

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@mergify
Copy link

mergify bot commented Jan 10, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@PeterStaar-IBM PeterStaar-IBM changed the title Refactor to remove v1 feat!: Refactor to remove v1 Jan 10, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 10, 2026

DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@dolfim-ibm dolfim-ibm changed the title feat!: Refactor to remove v1 feat!: Remove deprecated v1 api Jan 12, 2026
dolfim-ibm
dolfim-ibm previously approved these changes Jan 12, 2026
Copy link
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants