KAG Document Builder using Layout Extraction

Introduction from Chuck's `KAG-DG-Builder.ipynb`

KAG (Knowledge Augmented Generation) is a framework that enhances traditional RAG systems through several key innovations:

Knowledge Representation:

Uses LLMFriSPG, a knowledge representation framework that bridges text chunks and structured knowledge
Maintains hierarchy between data, information and knowledge layers
Supports both schema-free information extraction and schema-constrained expert knowledge

Mutual Indexing:

Creates bidirectional links between graph structures and text chunks
Combines semantic chunking with information extraction
Enables cross-document linking through entities and relations

Logical Form Solver:

Breaks down complex queries into executable logical forms
Integrates multiple types of reasoning:
- Graph-based reasoning over structured knowledge
- Text retrieval for unstructured content
- Mathematical and logical operations
Uses multi-round reflection to refine answers

Knowledge Alignment:

Enhances both indexing and retrieval through semantic reasoning
Standardizes knowledge instances and links them to concepts
Completes semantic relations between concepts
Reduces misalignment between knowledge granularities

Model Capabilities:

Enhances three core abilities:
- Natural Language Understanding (NLU)
- Natural Language Inference (NLI)
- Natural Language Generation (NLG)
Integrates retrieval capabilities for one-pass inference

KAG shows significant improvements over traditional RAG approaches in experimental results, with particular strengths in professional domain applications like e-government and healthcare Q&A where knowledge accuracy and logical reasoning are crucial.

Structures

Extraction from Clusters (chunks)

{
    "document_type": "form",
    "layout": {
        "regions": [
            {
                "id": "r1",
                "name": "header",
                "type": "header",
                "order": 1,
                "position": "top of page"
            }
        ],
        "reading_flow": ["r1", "r2"],
        "spatial_relationships": [
            {
                "from_region": "r1",
                "to_region": "r2",
                "relationship": "above"
            },
            {
                "from_region": "r3",
                "to_region": "r4",
                "relationship": "left_of"
            }
        ]
    }
}

Graph

Document chunk with relationships

    {
        'id': mk_did('chunk', content['text']),
        'type': content['content_type'],
        'content': content['text'],
        'region_id': content['region_id'],
        'relationships': [
            {'type': r.relationship, 'target': r.to_region}
            for r in rels if r.from_region == content['region_id']
        ]
    }

Document metadata

    {
        'source': img_name,
        'created': datetime.now().isoformat(),
        'document_type': doc_type,
        'num_regions': len(contents),
        'analysis_version': '1.0',
        'region_ids': [c['region_id'] for c in contents],
        'content_types': list(set(c['content_type'] for c in contents))
    }

Context

DOCUMENT_CONTEXT = {
    "@vocab": "http://schema.org/",
    "doc": "http://example.org/document#",
    "chunk": "http://example.org/chunk#",
    "id": "@id",
    "type": "@type",
    "relationships": {
        "@id": "chunk:hasRelation",
        "@type": "@id"
    }
}

OCR/Layout

{
  "6": {
    "name": "Cluster 6",
    "text": "Date: 4/16/90\nDate 5/2/90\n,Date,5/2/90\n,Date,5/3/90\n,Date,\n,Date,\n,Date,5/3/90\nDate 5/3/90\nDate 5/3/90",
    "boundingBox": [
      80,
      69,
      623,
      805
    ],
    "key_value_pairs": [
      {
        "key": "Date",
        "value": "5/2/90"
      },
      [
        {
          "key": "Date",
          "value": "5/2/90"
        }
      ],
      [
        {
          "key": "Date",
          "value": "5/3/90"
        }
      ],
      [
        {
          "key": "Date",
          "value": "5/3/90"
        }
      ],
      {
        "key": "Date",
        "value": "5/3/90"
      },
      {
        "key": "Date",
        "value": "5/3/90"
      }
    ]
  }
}

Pydantic structure

Chunks

#| export
class DocumentChunk(BaseModel):
    id: str  # KAG format: articleID#paraCode#idInPara
    mainText: str
    summary: str
    description: str
    type: str  # Keep for now, will evolve in KGfr
    region_id: str  # Keep for layout tracking
    relationships: List[Dict]
    # KAG required properties
    supporting_chunks: List[str] = Field(default_factory=list)
    belongTo: str  # Basic concept for now

DOCUMENT_CONTEXT = {
    "@vocab": "http://schema.org/",
    "doc": "http://example.org/document#",
    "chunk": "http://example.org/chunk#",
    "id": "@id",
    "type": "@type",
    "relationships": {
        "@id": "chunk:hasRelation",
        "@type": "@id"
    }
}

class DocumentGraph(BaseModel):
    id: str
    document_type: str
    chunks: List[DocumentChunk]
    metadata: Dict
    context: Dict = DOCUMENT_CONTEXT

Code

cluster_paragraphs -> Region region_id, paragraphs
group_paragraphs_by_cluster -> Region mainText , boundingBox, key_value_pairs
compute_cluster_relationships -> Region relationships
get_semantic_designation -> Region type
sort_clusters_reading_order -> Document Metadata readingOrder

Region requires: description , belongTo, supporting_chunks, id.

Document requires: chunks, metadata, context, document_type, id.

    "context": {
        "@vocab": "http://schema.org/",
        "doc": "http://example.org/document#",
        "chunk": "http://example.org/chunk#",
        "id": "@id",
        "type": "@type",
        "relationships": {
            "@id": "chunk:hasRelation",
            "@type": "@id"
        }
    }

"metadata": {
        "source": "82092117.png",
        "created": "2025-02-22T16:53:01.911217",
        "document_type": "fax cover sheet",
        "num_regions": 7,
        "analysis_version": "1.0",
        "region_ids": [
            "r1",
            "r2",
            "r3",
            "r4",
            "r5",
            "r6",
            "r7"
        ],
        "content_types": [
            "contact information",
            "header",
            "instructions",
            "notice",
            "note",
            "footer"
        ]
    }

Will add keyConcepts to Region.
Add embeddingVector to Region?

Region Embeddings

Below are a few popular embedding models you might consider for generating vector representations of your clusters. Each has different trade-offs in terms of accuracy, dimensionality, domain suitability, and cost.

1. Sentence Transformers (Hugging Face)

all-MiniLM-L6-v2
A lightweight and fast model (~ 384-dimensional embeddings) that still achieves reasonable performance across many semantic tasks. Ideal if you have limited compute resources or need quicker inference.
all-mpnet-base-v2
A larger model (768-dimensional embeddings) with strong performance on semantic similarity tasks. Good if you can afford slightly more compute for better quality.

Pros:

Easy to use with the sentence-transformers library in Python.
Good community support; many pre-trained variants.
Often free to run locally (once you’ve downloaded the model).

Cons:

Must run them on your own hardware/cloud GPU (or CPU at slower speeds).
If domain is very specialized, you might prefer a domain-specific model or do fine-tuning.

2. OpenAI Embeddings

text-embedding-ada-002
A general-purpose embedding model from OpenAI that offers strong performance on semantic similarity, clustering, and classification tasks.

Pros:

High-quality embeddings with minimal setup or tuning.
Simple API usage (just send text via their endpoint, get vectors).
Often 1536-dimensional embeddings with strong consistency across tasks.

Cons:

Paid model (though relatively cost-effective).
Data must be sent to OpenAI’s servers (which may be a privacy concern unless you have an enterprise agreement or handle ephemeral usage).

3. Domain-Specific or Fine-Tuned Models

If your document clusters are in a highly specialized domain (e.g., legal, biomedical), you might want:

BioBERT or LegalBERT variants (if relevant).
Fine-tuning a general model (e.g., all-mpnet-base-v2) on a small domain-specific dataset to boost performance.

4. Recommendation

Start Simple: If you have moderate GPU resources or are comfortable with cloud solutions, try a Sentence Transformers model like all-mpnet-base-v2 or all-MiniLM-L6-v2.
OpenAI’s text-embedding-ada-002 if you prefer an off-the-shelf cloud API with minimal infrastructure (and the data can be safely uploaded).
Domain-Specific: If your documents are niche (medical, legal, etc.), check for specialized models in sentence-transformers or Hugging Face.

5. Usage Tips

Chunk Preprocessing:
- Ensure each cluster’s text is concise yet representative.
- Convert any stray line breaks or non-UTF-8 chars to avoid tokenization issues.
Normalization:
- Decide on consistent lowercasing, punctuation removal, etc., if beneficial.
- Keep in mind some embeddings handle text casing well; check the model’s best practices.
Vector Storage:
- For many clusters, store embeddings in a vector database (e.g., FAISS, Milvus, Pinecone) to quickly do similarity queries or clustering.
Evaluation:
- Test whether embeddings meaningfully separate or group clusters in the ways you expect. If not, consider a different model or domain adaptation.

Bottom line: For general English text across many domains, OpenAI’s text-embedding-ada-002 or all-mpnet-base-v2 from sentence-transformers are top picks. They strike a good balance of performance, dimension size, and ease of use for cluster-level embeddings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAG Document Builder using Layout Extraction

Introduction from Chuck's `KAG-DG-Builder.ipynb`

KAG (Knowledge Augmented Generation) is a framework that enhances traditional RAG systems through several key innovations:

Structures

Extraction from Clusters (chunks)

Graph

Document chunk with relationships

Document metadata

Context

OCR/Layout

Pydantic structure

Chunks

Code

Region Embeddings

1. Sentence Transformers (Hugging Face)

2. OpenAI Embeddings

3. Domain-Specific or Fine-Tuned Models

4. Recommendation

5. Usage Tips

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

KAG Document Builder using Layout Extraction

Introduction from Chuck's KAG-DG-Builder.ipynb

KAG (Knowledge Augmented Generation) is a framework that enhances traditional RAG systems through several key innovations:

Structures

Extraction from Clusters (chunks)

Graph

Document chunk with relationships

Document metadata

Context

OCR/Layout

Pydantic structure

Chunks

Code

Region Embeddings

1. Sentence Transformers (Hugging Face)

2. OpenAI Embeddings

3. Domain-Specific or Fine-Tuned Models

4. Recommendation

5. Usage Tips

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Introduction from Chuck's `KAG-DG-Builder.ipynb`