Replies: 2 comments
-
Hi @andytolle! I'm Dosu and I’m helping the docling team. The duplication you're seeing in Markdown tables is a known issue that often comes from how the document structure is parsed and serialized—especially when handling tables, line numbers, or formatted text. In Docling and similar tools, this can happen if the chunking or serialization logic merges multiple paragraphs or lines, or if the backend emits both raw and processed lines for the same content. Sometimes, duplication is also caused by the way bold or formatted text is handled in the backend, as seen in some PDF and DOCX conversions, but recent updates have fixed many of these bugs. If you're using Docling or a similar pipeline, you can address this by customizing the Markdown table serializer. For example, subclassing from docling_core.transforms.serializer.markdown import MarkdownTableSerializer
class CustomMarkdownTableSerializer(MarkdownTableSerializer):
def serialize(self, *, item, doc_serializer, doc, **kwargs):
rows = [
[
col.text.replace('\n\n', ' ') # Add space between paragraphs
for col in row
]
for row in item.data.grid
]
table_text = tabulate(rows[1:], headers=rows[0], tablefmt="github")
return doc_serializer.create_ser_result(text=table_text, span_source=item) This approach helps preserve paragraph separation and avoids merging or duplicating content in table cells, though pure Markdown tables don't support true line breaks inside cells. If you need more separation, you can use If customizing the serializer isn't an option, you can also try a post-processing step to deduplicate repeated lines or paragraphs after conversion. Tools like IBM's data-prep-kit have deduplication steps that can help with this. If you can share more about your input format, the tool or script you're using, and whether you need to keep the table format, I can suggest more targeted solutions or alternative Markdown representations (like code blocks or lists), each with their own trade-offs. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
@dosu I attached a pdf file that was challenging to convert: 58100932dft_0.pdf I don't need table formatting for 'normal text lines' that are prepended with a left-margin line number. Post-processing is from what I saw a bad idea: given that the tables can mess with detection of some parts of the document as well as changed document flow (text being gone or in the wrong order): it should be a pre-processing solution. In the long run: I understand that docling has models for internal marking of regions. A possible solution to make these kinds of tasks easier would maybe be to allow a manual definition of a region (per page) that could be excluded. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to get some ideas on how to better convert these kinds of documents to markdown:

The issue I'm running in to, is that in markkdown this is recognized as a table, which would be fine, but about 50% of the times, each paragraph is duplicated. An example is this:
I someone has an idea... no matter how wild: please let me know.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions