Skip to content

KAG Document Builder using Layout Extraction

Chris Sweet edited this page Feb 25, 2025 · 13 revisions

Introduction from Chuck's KAG-DG-Builder.ipynb

KAG (Knowledge Augmented Generation) is a framework that enhances traditional RAG systems through several key innovations:

  1. Knowledge Representation:
  • Uses LLMFriSPG, a knowledge representation framework that bridges text chunks and structured knowledge
  • Maintains hierarchy between data, information and knowledge layers
  • Supports both schema-free information extraction and schema-constrained expert knowledge
  1. Mutual Indexing:
  • Creates bidirectional links between graph structures and text chunks
  • Combines semantic chunking with information extraction
  • Enables cross-document linking through entities and relations
  1. Logical Form Solver:
  • Breaks down complex queries into executable logical forms
  • Integrates multiple types of reasoning:
    • Graph-based reasoning over structured knowledge
    • Text retrieval for unstructured content
    • Mathematical and logical operations
  • Uses multi-round reflection to refine answers
  1. Knowledge Alignment:
  • Enhances both indexing and retrieval through semantic reasoning
  • Standardizes knowledge instances and links them to concepts
  • Completes semantic relations between concepts
  • Reduces misalignment between knowledge granularities
  1. Model Capabilities:
  • Enhances three core abilities:
    • Natural Language Understanding (NLU)
    • Natural Language Inference (NLI)
    • Natural Language Generation (NLG)
  • Integrates retrieval capabilities for one-pass inference

KAG shows significant improvements over traditional RAG approaches in experimental results, with particular strengths in professional domain applications like e-government and healthcare Q&A where knowledge accuracy and logical reasoning are crucial.

Structures

Extraction from Clusters (chunks)

{
    "document_type": "form",
    "layout": {
        "regions": [
            {
                "id": "r1",
                "name": "header",
                "type": "header",
                "order": 1,
                "position": "top of page"
            }
        ],
        "reading_flow": ["r1", "r2"],
        "spatial_relationships": [
            {
                "from_region": "r1",
                "to_region": "r2",
                "relationship": "above"
            },
            {
                "from_region": "r3",
                "to_region": "r4",
                "relationship": "left_of"
            }
        ]
    }
}

Graph

Document chunk with relationships

    {
        'id': mk_did('chunk', content['text']),
        'type': content['content_type'],
        'content': content['text'],
        'region_id': content['region_id'],
        'relationships': [
            {'type': r.relationship, 'target': r.to_region}
            for r in rels if r.from_region == content['region_id']
        ]
    }

Document metadata

    {
        'source': img_name,
        'created': datetime.now().isoformat(),
        'document_type': doc_type,
        'num_regions': len(contents),
        'analysis_version': '1.0',
        'region_ids': [c['region_id'] for c in contents],
        'content_types': list(set(c['content_type'] for c in contents))
    }

Context

DOCUMENT_CONTEXT = {
    "@vocab": "http://schema.org/",
    "doc": "http://example.org/document#",
    "chunk": "http://example.org/chunk#",
    "id": "@id",
    "type": "@type",
    "relationships": {
        "@id": "chunk:hasRelation",
        "@type": "@id"
    }
}

OCR/Layout

{
  "6": {
    "name": "Cluster 6",
    "text": "Date: 4/16/90\nDate 5/2/90\n,Date,5/2/90\n,Date,5/3/90\n,Date,\n,Date,\n,Date,5/3/90\nDate 5/3/90\nDate 5/3/90",
    "boundingBox": [
      80,
      69,
      623,
      805
    ],
    "key_value_pairs": [
      {
        "key": "Date",
        "value": "5/2/90"
      },
      [
        {
          "key": "Date",
          "value": "5/2/90"
        }
      ],
      [
        {
          "key": "Date",
          "value": "5/3/90"
        }
      ],
      [
        {
          "key": "Date",
          "value": "5/3/90"
        }
      ],
      {
        "key": "Date",
        "value": "5/3/90"
      },
      {
        "key": "Date",
        "value": "5/3/90"
      }
    ]
  }
}

Pydantic structure

Chunks

#| export
class DocumentChunk(BaseModel):
    id: str  # KAG format: articleID#paraCode#idInPara
    mainText: str
    summary: str
    description: str
    type: str  # Keep for now, will evolve in KGfr
    region_id: str  # Keep for layout tracking
    relationships: List[Dict]
    # KAG required properties
    supporting_chunks: List[str] = Field(default_factory=list)
    belongTo: str  # Basic concept for now
DOCUMENT_CONTEXT = {
    "@vocab": "http://schema.org/",
    "doc": "http://example.org/document#",
    "chunk": "http://example.org/chunk#",
    "id": "@id",
    "type": "@type",
    "relationships": {
        "@id": "chunk:hasRelation",
        "@type": "@id"
    }
}

class DocumentGraph(BaseModel):
    id: str
    document_type: str
    chunks: List[DocumentChunk]
    metadata: Dict
    context: Dict = DOCUMENT_CONTEXT

Code

  1. cluster_paragraphs -> Region region_id, paragraphs
  2. group_paragraphs_by_cluster -> Region mainText , boundingBox, key_value_pairs
  3. compute_cluster_relationships -> Region relationships
  4. get_semantic_designation -> Region type
  5. sort_clusters_reading_order -> Document Metadata readingOrder

Region requires: description , belongTo, supporting_chunks, id.

Document requires: chunks, metadata, context, document_type, id.

    "context": {
        "@vocab": "http://schema.org/",
        "doc": "http://example.org/document#",
        "chunk": "http://example.org/chunk#",
        "id": "@id",
        "type": "@type",
        "relationships": {
            "@id": "chunk:hasRelation",
            "@type": "@id"
        }
    }
"metadata": {
        "source": "82092117.png",
        "created": "2025-02-22T16:53:01.911217",
        "document_type": "fax cover sheet",
        "num_regions": 7,
        "analysis_version": "1.0",
        "region_ids": [
            "r1",
            "r2",
            "r3",
            "r4",
            "r5",
            "r6",
            "r7"
        ],
        "content_types": [
            "contact information",
            "header",
            "instructions",
            "notice",
            "note",
            "footer"
        ]
    }
  1. Will add keyConcepts to Region.
  2. Add embeddingVector to Region?

Region Embeddings

Below are a few popular embedding models you might consider for generating vector representations of your clusters. Each has different trade-offs in terms of accuracy, dimensionality, domain suitability, and cost.


1. Sentence Transformers (Hugging Face)

  • all-MiniLM-L6-v2
    A lightweight and fast model (~ 384-dimensional embeddings) that still achieves reasonable performance across many semantic tasks. Ideal if you have limited compute resources or need quicker inference.

  • all-mpnet-base-v2
    A larger model (768-dimensional embeddings) with strong performance on semantic similarity tasks. Good if you can afford slightly more compute for better quality.

Pros:

  • Easy to use with the sentence-transformers library in Python.
  • Good community support; many pre-trained variants.
  • Often free to run locally (once you’ve downloaded the model).

Cons:

  • Must run them on your own hardware/cloud GPU (or CPU at slower speeds).
  • If domain is very specialized, you might prefer a domain-specific model or do fine-tuning.

2. OpenAI Embeddings

  • text-embedding-ada-002
    A general-purpose embedding model from OpenAI that offers strong performance on semantic similarity, clustering, and classification tasks.

Pros:

  • High-quality embeddings with minimal setup or tuning.
  • Simple API usage (just send text via their endpoint, get vectors).
  • Often 1536-dimensional embeddings with strong consistency across tasks.

Cons:

  • Paid model (though relatively cost-effective).
  • Data must be sent to OpenAI’s servers (which may be a privacy concern unless you have an enterprise agreement or handle ephemeral usage).

3. Domain-Specific or Fine-Tuned Models

If your document clusters are in a highly specialized domain (e.g., legal, biomedical), you might want:

  • BioBERT or LegalBERT variants (if relevant).
  • Fine-tuning a general model (e.g., all-mpnet-base-v2) on a small domain-specific dataset to boost performance.

4. Recommendation

  1. Start Simple: If you have moderate GPU resources or are comfortable with cloud solutions, try a Sentence Transformers model like all-mpnet-base-v2 or all-MiniLM-L6-v2.
  2. OpenAI’s text-embedding-ada-002 if you prefer an off-the-shelf cloud API with minimal infrastructure (and the data can be safely uploaded).
  3. Domain-Specific: If your documents are niche (medical, legal, etc.), check for specialized models in sentence-transformers or Hugging Face.

5. Usage Tips

  1. Chunk Preprocessing:
    • Ensure each cluster’s text is concise yet representative.
    • Convert any stray line breaks or non-UTF-8 chars to avoid tokenization issues.
  2. Normalization:
    • Decide on consistent lowercasing, punctuation removal, etc., if beneficial.
    • Keep in mind some embeddings handle text casing well; check the model’s best practices.
  3. Vector Storage:
    • For many clusters, store embeddings in a vector database (e.g., FAISS, Milvus, Pinecone) to quickly do similarity queries or clustering.
  4. Evaluation:
    • Test whether embeddings meaningfully separate or group clusters in the ways you expect. If not, consider a different model or domain adaptation.

Bottom line: For general English text across many domains, OpenAI’s text-embedding-ada-002 or all-mpnet-base-v2 from sentence-transformers are top picks. They strike a good balance of performance, dimension size, and ease of use for cluster-level embeddings.