Skip to content

fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata #2039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AndrewTsai0406
Copy link
Contributor

I fixed the OCR bounding box misalignment issue by reading the rotation metadata from scanned PDF pages and applying the necessary adjustments in the pypdfium2 backend. This ensures that bounding boxes align correctly with the visually rotated pages during OCR detection.

Issue resolved by this Pull Request:
Resolves #2038

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Copy link
Contributor

github-actions bot commented Aug 5, 2025

DCO Check Passed

Thanks @AndrewTsai0406, all your commits are properly signed off. 🎉

Copy link

mergify bot commented Aug 5, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@AndrewTsai0406 AndrewTsai0406 changed the title Fix OCR bounding box misalignment caused by rotation metadata fix(pypdfium2) Fix OCR bounding box misalignment caused by rotation metadata Aug 5, 2025
@AndrewTsai0406 AndrewTsai0406 changed the title fix(pypdfium2) Fix OCR bounding box misalignment caused by rotation metadata fix(pypdfium2): Fix OCR bounding box misalignment caused by rotation metadata Aug 5, 2025
Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/pypdfium2_backend.py 0.00% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

@AndrewTsai0406
Copy link
Contributor Author

AndrewTsai0406 commented Aug 5, 2025

Screenshot 2025-08-05 at 19 22 07

Added a test case that looks for word that doesn't exist in result from orginal pypdfium2 backend ocr pipeline

@AndrewTsai0406 AndrewTsai0406 changed the title fix(pypdfium2): Fix OCR bounding box misalignment caused by rotation metadata fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata Aug 5, 2025
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
@AndrewTsai0406 AndrewTsai0406 force-pushed the bug/fix_pypdfium2_scanned_pdf_ocr_bbox_detection branch from 2a95dbd to ff094b1 Compare August 5, 2025 13:01
Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>
@AndrewTsai0406
Copy link
Contributor Author

AndrewTsai0406 commented Aug 7, 2025

The result of ocr is usually not deterministic, can anyone help me with testing please.😢

doc_pred = DoclingDocument(schema_name='DoclingDocument', version='1.5.0', name='ocr_test_rotation_mismatch', origin=DocumentOrig...], key_value_items=[], form_items=[], pages={1: PageItem(size=Size(width=595.0, height=842.0), image=None, page_no=1)})
doc_true = DoclingDocument(schema_name='DoclingDocument', version='1.5.0', name='ocr_test_rotation_mismatch', origin=DocumentOrig...], key_value_items=[], form_items=[], pages={1: PageItem(size=Size(width=595.0, height=842.0), image=None, page_no=1)})
fuzzy = True

    def verify_docitems(doc_pred: DoclingDocument, doc_true: DoclingDocument, fuzzy: bool):
>       assert len(doc_pred.texts) == len(doc_true.texts), "Text lengths do not match."
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       AssertionError: Text lengths do not match.

tests/verify_utils.py:231: AssertionError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rotation metadata on scanned PDFs not handled by get_bitmap_rects in pypdfium2 backend
1 participant