Markdown conversion of legal documents with line-numbers in front of every line of text #2046

andytolle · 2025-08-06T13:16:59Z

andytolle
Aug 6, 2025

I'm trying to get some ideas on how to better convert these kinds of documents to markdown:

The issue I'm running in to, is that in markkdown this is recognized as a table, which would be fine, but about 50% of the times, each paragraph is duplicated. An example is this:

| 548       | that one or more of the criteria for ...  | that one or more of the criteria for ... |
|           | within the Center should discuss their ... | within the Center should discuss their ... |
| 549       | officials resolution. Such discussions ... | officials resolution. Such discussions ... |

I someone has an idea... no matter how wild: please let me know.

Thanks

@andytolle · 2025-08-06T13:24:47Z

dosubot[bot]
bot Aug 6, 2025

Hi @andytolle! I'm Dosu and I’m helping the docling team.

The duplication you're seeing in Markdown tables is a known issue that often comes from how the document structure is parsed and serialized—especially when handling tables, line numbers, or formatted text. In Docling and similar tools, this can happen if the chunking or serialization logic merges multiple paragraphs or lines, or if the backend emits both raw and processed lines for the same content. Sometimes, duplication is also caused by the way bold or formatted text is handled in the backend, as seen in some PDF and DOCX conversions, but recent updates have fixed many of these bugs.

If you're using Docling or a similar pipeline, you can address this by customizing the Markdown table serializer. For example, subclassing MarkdownTableSerializer lets you control how paragraphs and line breaks are handled in table cells, which can help prevent duplication or merging issues. Here’s a minimal example:

from docling_core.transforms.serializer.markdown import MarkdownTableSerializer

class CustomMarkdownTableSerializer(MarkdownTableSerializer):
    def serialize(self, *, item, doc_serializer, doc, **kwargs):
        rows = [
            [
                col.text.replace('\n\n', ' ')  # Add space between paragraphs
                for col in row
            ]
            for row in item.data.grid
        ]
        table_text = tabulate(rows[1:], headers=rows[0], tablefmt="github")
        return doc_serializer.create_ser_result(text=table_text, span_source=item)

This approach helps preserve paragraph separation and avoids merging or duplicating content in table cells, though pure Markdown tables don't support true line breaks inside cells. If you need more separation, you can use <br> tags, but that makes the Markdown less "pure" and may not render everywhere the same way. More details and discussion are available here.

If customizing the serializer isn't an option, you can also try a post-processing step to deduplicate repeated lines or paragraphs after conversion. Tools like IBM's data-prep-kit have deduplication steps that can help with this.

If you can share more about your input format, the tool or script you're using, and whether you need to keep the table format, I can suggest more targeted solutions or alternative Markdown representations (like code blocks or lists), each with their own trade-offs.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

andytolle · 2025-08-11T06:54:51Z

andytolle
Aug 11, 2025
Author

@dosu I attached a pdf file that was challenging to convert: 58100932dft_0.pdf

I don't need table formatting for 'normal text lines' that are prepended with a left-margin line number.
However, if there are tables in the document, I would like to properly markdown format them.

Post-processing is from what I saw a bad idea: given that the tables can mess with detection of some parts of the document as well as changed document flow (text being gone or in the wrong order): it should be a pre-processing solution.

In the long run: I understand that docling has models for internal marking of regions. A possible solution to make these kinds of tasks easier would maybe be to allow a manual definition of a region (per page) that could be excluded.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Markdown conversion of legal documents with line-numbers in front of every line of text #2046

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Markdown conversion of legal documents with line-numbers in front of every line of text #2046

Uh oh!

andytolle Aug 6, 2025

Replies: 2 comments

Uh oh!

dosubot[bot] bot Aug 6, 2025

Uh oh!

andytolle Aug 11, 2025 Author

andytolle
Aug 6, 2025

dosubot[bot]
bot Aug 6, 2025

andytolle
Aug 11, 2025
Author