Skip to content

Rotation metadata on scanned PDFs not handled by get_bitmap_rects in pypdfium2 backend #2038

@AndrewTsai0406

Description

@AndrewTsai0406

Bug

While working with scanned PDF files, I noticed that some pages include rotation metadata rather than being physically rotated. In other words, the page you see in a viewer is visually rotated, but the underlying bitmap remains unrotated.

This causes an issue when performing OCR page detection with the pypdfium2 backend, as its get_bitmap_rects method does not account for the rotation metadata. As a result, the detected coordinates are misaligned with the displayed orientation of the page.

For example, I have two scanned PDFs — one includes rotation metadata while the other does not. When I enable settings.debug.visualize_ocr = True and inspect the debug output, the bounding boxes are significantly misaligned. This misalignment leads to many result items missing information, since there are no directly extractable text elements on the page.

Expected:

Image

What I get:

Image

Steps to reproduce

...

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import FigureElement, InputFormat, Table
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableStructureOptions, TableFormerMode, RapidOcrOptions, EasyOcrOptions, TableStructureOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.settings import settings
# 
settings.debug.visualize_layout = True
settings.debug.visualize_raw_layout = True
# settings.debug.visualize_tables = True
settings.debug.visualize_cells = True
settings.debug.visualize_ocr = True

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.images_scale = 1.0
pipeline_options.generate_picture_images = True
pipeline_options.generate_table_images = True
pipeline_options.table_structure_options = TableStructureOptions(mode=TableFormerMode.ACCURATE)
ocr_options = RapidOcrOptions(force_full_page_ocr=False)
pipeline_options.ocr_options = ocr_options

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend)
    }
)

pdf_paths = [
    './pdfs/scan_smpl.pdf',
    './pdfs/scan_smpl_rotated.pdf'
]

for pdf_path in pdf_paths:
    conv_res = doc_converter.convert(pdf_path)

print(conv_res.document.export_to_markdown())

use the following two pdfs as comparison

scan_smpl.pdf
scan_smpl_rotated.pdf

Docling version

v2.43.0

Python version

3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions