-
Notifications
You must be signed in to change notification settings - Fork 0
KAG Document Builder using Layout Extraction
KAG (Knowledge Augmented Generation) is a framework that enhances traditional RAG systems through several key innovations:
- Knowledge Representation:
- Uses LLMFriSPG, a knowledge representation framework that bridges text chunks and structured knowledge
- Maintains hierarchy between data, information and knowledge layers
- Supports both schema-free information extraction and schema-constrained expert knowledge
- Mutual Indexing:
- Creates bidirectional links between graph structures and text chunks
- Combines semantic chunking with information extraction
- Enables cross-document linking through entities and relations
- Logical Form Solver:
- Breaks down complex queries into executable logical forms
- Integrates multiple types of reasoning:
- Graph-based reasoning over structured knowledge
- Text retrieval for unstructured content
- Mathematical and logical operations
- Uses multi-round reflection to refine answers
- Knowledge Alignment:
- Enhances both indexing and retrieval through semantic reasoning
- Standardizes knowledge instances and links them to concepts
- Completes semantic relations between concepts
- Reduces misalignment between knowledge granularities
- Model Capabilities:
- Enhances three core abilities:
- Natural Language Understanding (NLU)
- Natural Language Inference (NLI)
- Natural Language Generation (NLG)
- Integrates retrieval capabilities for one-pass inference
KAG shows significant improvements over traditional RAG approaches in experimental results, with particular strengths in professional domain applications like e-government and healthcare Q&A where knowledge accuracy and logical reasoning are crucial.
{
"document_type": "form",
"layout": {
"regions": [
{
"id": "r1",
"name": "header",
"type": "header",
"order": 1,
"position": "top of page"
}
],
"reading_flow": ["r1", "r2"],
"spatial_relationships": [
{
"from_region": "r1",
"to_region": "r2",
"relationship": "above"
},
{
"from_region": "r3",
"to_region": "r4",
"relationship": "left_of"
}
]
}
}
{
'id': mk_did('chunk', content['text']),
'type': content['content_type'],
'content': content['text'],
'region_id': content['region_id'],
'relationships': [
{'type': r.relationship, 'target': r.to_region}
for r in rels if r.from_region == content['region_id']
]
}
{
'source': img_name,
'created': datetime.now().isoformat(),
'document_type': doc_type,
'num_regions': len(contents),
'analysis_version': '1.0',
'region_ids': [c['region_id'] for c in contents],
'content_types': list(set(c['content_type'] for c in contents))
}
DOCUMENT_CONTEXT = {
"@vocab": "http://schema.org/",
"doc": "http://example.org/document#",
"chunk": "http://example.org/chunk#",
"id": "@id",
"type": "@type",
"relationships": {
"@id": "chunk:hasRelation",
"@type": "@id"
}
}
{
"6": {
"name": "Cluster 6",
"text": "Date: 4/16/90\nDate 5/2/90\n,Date,5/2/90\n,Date,5/3/90\n,Date,\n,Date,\n,Date,5/3/90\nDate 5/3/90\nDate 5/3/90",
"boundingBox": [
80,
69,
623,
805
],
"key_value_pairs": [
{
"key": "Date",
"value": "5/2/90"
},
[
{
"key": "Date",
"value": "5/2/90"
}
],
[
{
"key": "Date",
"value": "5/3/90"
}
],
[
{
"key": "Date",
"value": "5/3/90"
}
],
{
"key": "Date",
"value": "5/3/90"
},
{
"key": "Date",
"value": "5/3/90"
}
]
}
}
#| export
class DocumentChunk(BaseModel):
id: str # KAG format: articleID#paraCode#idInPara
mainText: str
summary: str
description: str
type: str # Keep for now, will evolve in KGfr
region_id: str # Keep for layout tracking
relationships: List[Dict]
# KAG required properties
supporting_chunks: List[str] = Field(default_factory=list)
belongTo: str # Basic concept for now
DOCUMENT_CONTEXT = {
"@vocab": "http://schema.org/",
"doc": "http://example.org/document#",
"chunk": "http://example.org/chunk#",
"id": "@id",
"type": "@type",
"relationships": {
"@id": "chunk:hasRelation",
"@type": "@id"
}
}
class DocumentGraph(BaseModel):
id: str
document_type: str
chunks: List[DocumentChunk]
metadata: Dict
context: Dict = DOCUMENT_CONTEXT
-
cluster_paragraphs
-> Regionregion_id
,paragraphs
-
group_paragraphs_by_cluster
-> RegionmainText
,boundingBox
,key_value_pairs
-
compute_cluster_relationships
-> Regionrelationships
-
get_semantic_designation
-> Regiontype
-
sort_clusters_reading_order
-> Document MetadatareadingOrder
Region requires: description
, belongTo
, supporting_chunks
, id
.
Document requires: chunks
, metadata
, context
, document_type
, id
.
"context": {
"@vocab": "http://schema.org/",
"doc": "http://example.org/document#",
"chunk": "http://example.org/chunk#",
"id": "@id",
"type": "@type",
"relationships": {
"@id": "chunk:hasRelation",
"@type": "@id"
}
}
"metadata": {
"source": "82092117.png",
"created": "2025-02-22T16:53:01.911217",
"document_type": "fax cover sheet",
"num_regions": 7,
"analysis_version": "1.0",
"region_ids": [
"r1",
"r2",
"r3",
"r4",
"r5",
"r6",
"r7"
],
"content_types": [
"contact information",
"header",
"instructions",
"notice",
"note",
"footer"
]
}
- Will add
keyConcepts
to Region. - Add
embeddingVector
to Region?
Below are a few popular embedding models you might consider for generating vector representations of your clusters. Each has different trade-offs in terms of accuracy, dimensionality, domain suitability, and cost.
-
all-MiniLM-L6-v2
A lightweight and fast model (~ 384-dimensional embeddings) that still achieves reasonable performance across many semantic tasks. Ideal if you have limited compute resources or need quicker inference. -
all-mpnet-base-v2
A larger model (768-dimensional embeddings) with strong performance on semantic similarity tasks. Good if you can afford slightly more compute for better quality.
Pros:
- Easy to use with the
sentence-transformers
library in Python. - Good community support; many pre-trained variants.
- Often free to run locally (once you’ve downloaded the model).
Cons:
- Must run them on your own hardware/cloud GPU (or CPU at slower speeds).
- If domain is very specialized, you might prefer a domain-specific model or do fine-tuning.
-
text-embedding-ada-002
A general-purpose embedding model from OpenAI that offers strong performance on semantic similarity, clustering, and classification tasks.
Pros:
- High-quality embeddings with minimal setup or tuning.
- Simple API usage (just send text via their endpoint, get vectors).
- Often 1536-dimensional embeddings with strong consistency across tasks.
Cons:
- Paid model (though relatively cost-effective).
- Data must be sent to OpenAI’s servers (which may be a privacy concern unless you have an enterprise agreement or handle ephemeral usage).
If your document clusters are in a highly specialized domain (e.g., legal, biomedical), you might want:
- BioBERT or LegalBERT variants (if relevant).
- Fine-tuning a general model (e.g.,
all-mpnet-base-v2
) on a small domain-specific dataset to boost performance.
-
Start Simple: If you have moderate GPU resources or are comfortable with cloud solutions, try a Sentence Transformers model like
all-mpnet-base-v2
orall-MiniLM-L6-v2
. - OpenAI’s text-embedding-ada-002 if you prefer an off-the-shelf cloud API with minimal infrastructure (and the data can be safely uploaded).
-
Domain-Specific: If your documents are niche (medical, legal, etc.), check for specialized models in
sentence-transformers
or Hugging Face.
-
Chunk Preprocessing:
- Ensure each cluster’s text is concise yet representative.
- Convert any stray line breaks or non-UTF-8 chars to avoid tokenization issues.
-
Normalization:
- Decide on consistent lowercasing, punctuation removal, etc., if beneficial.
- Keep in mind some embeddings handle text casing well; check the model’s best practices.
-
Vector Storage:
- For many clusters, store embeddings in a vector database (e.g., FAISS, Milvus, Pinecone) to quickly do similarity queries or clustering.
-
Evaluation:
- Test whether embeddings meaningfully separate or group clusters in the ways you expect. If not, consider a different model or domain adaptation.
Bottom line: For general English text across many domains, OpenAI’s text-embedding-ada-002
or all-mpnet-base-v2
from sentence-transformers are top picks. They strike a good balance of performance, dimension size, and ease of use for cluster-level embeddings.