-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Bug
While working with scanned PDF files, I noticed that some pages include rotation metadata rather than being physically rotated. In other words, the page you see in a viewer is visually rotated, but the underlying bitmap remains unrotated.
This causes an issue when performing OCR page detection with the pypdfium2 backend, as its get_bitmap_rects method does not account for the rotation metadata. As a result, the detected coordinates are misaligned with the displayed orientation of the page.
For example, I have two scanned PDFs — one includes rotation metadata while the other does not. When I enable settings.debug.visualize_ocr = True and inspect the debug output, the bounding boxes are significantly misaligned. This misalignment leads to many result items missing information, since there are no directly extractable text elements on the page.
Expected:

What I get:

Steps to reproduce
...
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import FigureElement, InputFormat, Table
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableStructureOptions, TableFormerMode, RapidOcrOptions, EasyOcrOptions, TableStructureOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.settings import settings
#
settings.debug.visualize_layout = True
settings.debug.visualize_raw_layout = True
# settings.debug.visualize_tables = True
settings.debug.visualize_cells = True
settings.debug.visualize_ocr = True
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.images_scale = 1.0
pipeline_options.generate_picture_images = True
pipeline_options.generate_table_images = True
pipeline_options.table_structure_options = TableStructureOptions(mode=TableFormerMode.ACCURATE)
ocr_options = RapidOcrOptions(force_full_page_ocr=False)
pipeline_options.ocr_options = ocr_options
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend)
}
)
pdf_paths = [
'./pdfs/scan_smpl.pdf',
'./pdfs/scan_smpl_rotated.pdf'
]
for pdf_path in pdf_paths:
conv_res = doc_converter.convert(pdf_path)
print(conv_res.document.export_to_markdown())
use the following two pdfs as comparison
scan_smpl.pdf
scan_smpl_rotated.pdf
Docling version
v2.43.0
Python version
3.12