fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata #2039

AndrewTsai0406 · 2025-08-05T09:29:02Z

I fixed the OCR bounding box misalignment issue by reading the rotation metadata from scanned PDF pages and applying the necessary adjustments in the pypdfium2 backend. This ensures that bounding boxes align correctly with the visually rotated pages during OCR detection.

Issue resolved by this Pull Request:
Resolves #2038

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

github-actions · 2025-08-05T09:29:12Z

✅ DCO Check Passed

Thanks @AndrewTsai0406, all your commits are properly signed off. 🎉

mergify · 2025-08-05T09:29:36Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

codecov · 2025-08-05T10:00:44Z

Codecov Report

❌ Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/pypdfium2_backend.py	0.00%	7 Missing ⚠️

📢 Thoughts on this report? Let us know!

AndrewTsai0406 · 2025-08-05T11:39:59Z

Added a test case that looks for word that doesn't exist in result from orginal pypdfium2 backend ocr pipeline

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

AndrewTsai0406 · 2025-08-07T06:54:55Z

The result of ocr is usually not deterministic, can anyone help me with testing please.😢

doc_pred = DoclingDocument(schema_name='DoclingDocument', version='1.5.0', name='ocr_test_rotation_mismatch', origin=DocumentOrig...], key_value_items=[], form_items=[], pages={1: PageItem(size=Size(width=595.0, height=842.0), image=None, page_no=1)})
doc_true = DoclingDocument(schema_name='DoclingDocument', version='1.5.0', name='ocr_test_rotation_mismatch', origin=DocumentOrig...], key_value_items=[], form_items=[], pages={1: PageItem(size=Size(width=595.0, height=842.0), image=None, page_no=1)})
fuzzy = True

    def verify_docitems(doc_pred: DoclingDocument, doc_true: DoclingDocument, fuzzy: bool):
>       assert len(doc_pred.texts) == len(doc_true.texts), "Text lengths do not match."
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       AssertionError: Text lengths do not match.

tests/verify_utils.py:231: AssertionError

AndrewTsai0406 changed the title ~~Fix OCR bounding box misalignment caused by rotation metadata~~ fix(pypdfium2) Fix OCR bounding box misalignment caused by rotation metadata Aug 5, 2025

AndrewTsai0406 changed the title ~~fix(pypdfium2) Fix OCR bounding box misalignment caused by rotation metadata~~ fix(pypdfium2): Fix OCR bounding box misalignment caused by rotation metadata Aug 5, 2025

AndrewTsai0406 changed the title ~~fix(pypdfium2): Fix OCR bounding box misalignment caused by rotation metadata~~ fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata Aug 5, 2025

AndrewTsai0406 added 3 commits August 5, 2025 21:01

Fix OCR bounding box misalignment caused by rotation metadata

f6ea505

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

Add rotation-mismatch scanned pdf test case

a6ee0a8

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

add ground truth for ocr_test_rotation_mismatch.pdf

ff094b1

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

AndrewTsai0406 force-pushed the bug/fix_pypdfium2_scanned_pdf_ocr_bbox_detection branch from 2a95dbd to ff094b1 Compare August 5, 2025 13:01

add ground truth for ocr_test_rotation_mismatch.pdf

706bce9

Signed-off-by: AndrewTsai0406 <tsai247365@gmail.com>

PeterStaar-IBM requested review from cau-git, vagenas and maxmnemonic August 11, 2025 12:39

PeterStaar-IBM assigned AndrewTsai0406 Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata #2039

fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata #2039

AndrewTsai0406 commented Aug 5, 2025

Uh oh!

github-actions bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

mergify bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 5, 2025

Uh oh!

AndrewTsai0406 commented Aug 5, 2025 •

edited

Loading

Uh oh!

AndrewTsai0406 commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata #2039

Are you sure you want to change the base?

fix(pypdfium2): Fix OCR bounding box misalignment caused by mismatched rotation metadata #2039

Conversation

AndrewTsai0406 commented Aug 5, 2025

Uh oh!

github-actions bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

codecov bot commented Aug 5, 2025

Codecov Report

Uh oh!

AndrewTsai0406 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndrewTsai0406 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 5, 2025 •

edited

Loading

mergify bot commented Aug 5, 2025 •

edited

Loading

AndrewTsai0406 commented Aug 5, 2025 •

edited

Loading

AndrewTsai0406 commented Aug 7, 2025 •

edited

Loading