ContextGem - Effortless LLM extraction from documents ==================================================================================================== Copyright (c) 2025 Shcherbak AI AS All rights reserved Developed by Sergii Shcherbak This software is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. # ==== Documentation Content ==== # ==== motivation ==== Why ContextGem? *************** ContextGem is an LLM framework designed to strike the right balance between ease of use, customizability, and accuracy for structured data and insights extraction from documents. ContextGem offers the **easiest and fastest way** to build LLM extraction workflows for document analysis through powerful abstractions of most time consuming parts. ⏱️ Development Overhead of Other Frameworks =========================================== Most popular LLM frameworks for extracting structured data from documents require extensive boilerplate code to extract even basic information. As a developer using these frameworks, you're typically expected to: 📝 Prompt Engineering * Write custom prompts from scratch for each extraction scenario * Maintain different prompt templates for different extraction workflows * Adapt prompts manually when extraction requirements change 🔧 Technical Implementation * Define your own data models and implement validation logic * Implement complex chaining for multi-LLM workflows * Implement nested context extraction logic (*e.g. document > sections > paragraphs > entities*) * Configure text segmentation logic for correct reference mapping * Configure concurrent I/O processing logic to speed up complex extraction workflows **Result:** All these limitations significantly increase development time and complexity. 💡 The ContextGem Solution ========================== ContextGem addresses these challenges by providing a flexible, intuitive framework that extracts structured data and insights from documents with minimal effort. Complex, most time-consuming parts are handled with **powerful abstractions**, eliminating boilerplate code and reducing development overhead. With ContextGem, you benefit from a "batteries included" approach, coupled with simple, intuitive syntax. ContextGem and Other Open-Source LLM Frameworks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +-----+-----------------------------------------------+------------+----------------------+ | | Key built-in abstractions | **Context | Other frameworks* | | | | Gem** | | |=====|===============================================|============|======================| | 💎 | **Automated dynamic prompts** Automatically | 🟢 | ◯ | | | constructs comprehensive prompts for your | | | | | specific extraction needs. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Automated data modelling and validators** | 🟢 | ◯ | | | Automatically creates data models and | | | | | validation logic. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Precise granular reference mapping | 🟢 | ◯ | | | (paragraphs & sentences)** Automatically | | | | | maps extracted data to the relevant parts of | | | | | the document, which will always match in the | | | | | source document, with customizable | | | | | granularity. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Justifications (reasoning backing the | 🟢 | ◯ | | | extraction)** Automatically provides | | | | | justifications for each extraction, with | | | | | customizable granularity. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Neural segmentation (SaT)** Automatically | 🟢 | ◯ | | | segments the document into paragraphs and | | | | | sentences using state-of-the-art SaT models, | | | | | compatible with many languages. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Multilingual support (I/O without | 🟢 | ◯ | | | prompting)** Supports multiple languages in | | | | | input and output without additional | | | | | prompting. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Single, unified extraction pipeline | 🟢 | 🟡 | | | (declarative, reusable, fully serializable)** | | | | | Allows to define a complete extraction | | | | | workflow in a single, unified, reusable | | | | | pipeline, using simple declarative syntax. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Grouped LLMs with role-specific tasks** | 🟢 | 🟡 | | | Allows to easily group LLMs with different | | | | | roles to process role- specific tasks in the | | | | | pipeline. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Nested context extraction** Automatically | 🟢 | 🟡 | | | manages nested context based on the pipeline | | | | | definition (e.g. document > aspects > sub- | | | | | aspects > concepts). | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Unified, fully serializable results storage | 🟢 | 🟡 | | | model (document)** All extraction results | | | | | are stored on the document object, including | | | | | aspects, sub-aspects, and concepts. This | | | | | object is fully serializable, and all the | | | | | extraction results can be restored, with just | | | | | one line of code. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Extraction task calibration with examples** | 🟢 | 🟡 | | | Allows to easily define and attach output | | | | | examples that guide the LLM's extraction | | | | | behavior, without manually modifying prompts. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Built-in concurrent I/O processing** | 🟢 | 🟡 | | | Automatically manages concurrent I/O | | | | | processing to speed up complex extraction | | | | | workflows, with a simple switch | | | | | ("use_concurrency=True"). | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Automated usage & costs tracking** | 🟢 | 🟡 | | | Automatically tracks usage (calls, tokens, | | | | | costs) of all LLM calls. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Fallback and retry logic** Built-in retry | 🟢 | 🟢 | | | logic and easily attachable fallback LLMs. | | | +-----+-----------------------------------------------+------------+----------------------+ | 💎 | **Multiple LLM providers** Compatible with a | 🟢 | 🟢 | | | wide range of commercial and locally hosted | | | | | LLMs. | | | +-----+-----------------------------------------------+------------+----------------------+ 🟢 - fully supported - no additional setup required 🟡 - partially supported - requires additional setup ◯ - not supported - requires custom logic * See ContextGem and other frameworks for specific implementation examples comparing ContextGem with other popular open-source LLM frameworks. (Comparison as of 24 March 2025.) 🎯 Focused Approach =================== ContextGem is intentionally optimized for **in-depth single-document analysis** to deliver maximum extraction accuracy and precision. While this focused approach enables superior results for individual documents, ContextGem currently does not support cross-document querying or corpus-wide information retrieval. For these use cases, modern RAG frameworks (e.g. LlamaIndex) remain more appropriate. # ==== vs_other_frameworks ==== ContextGem and other frameworks ******************************* Due to ContextGem's powerful abstractions, it is the **easiest and fastest way** to build LLM extraction workflows for document analysis. ✏️ Basic Example ================ Below is a basic example of an extraction workflow - *extraction of anomalies from a document* - implemented side-by-side in ContextGem and other frameworks. (All implementations are self-contained. Comparison as of 24 March 2025.) Even implementing this basic extraction workflow requires significantly more effort in other frameworks: * 🔧 **Manual model definition**: Developers must define Pydantic validation models for structured output * 📝 **Prompt engineering**: Crafting comprehensive prompts that guide the LLM effectively * 🔄 **Output parsing logic**: Setting up parsers to handle the LLM's response * 📄 **Reference mapping**: Writing custom logic for mapping references in the source document In contrast, ContextGem handles all these complexities automatically. Users simply describe what to extract in natural language, provide basic configuration parameters, and the framework takes care of the rest. -[ **ContextGem** ]- ⚡ Fastest way ContextGem is the fastest and easiest way to implement an LLM extraction workflow. All the boilerplate code is handled behind the scenes. **Major time savers:** * ⌨️ **Simple syntax**: ContextGem uses a simple, intuitive API that requires minimal code * 📝 **Automatic prompt engineering**: ContextGem automatically constructs a prompt tailored to the extraction task * 🔄 **Automatic model definition**: ContextGem automatically defines the Pydantic model for structured output * 🧩 **Automatic output parsing**: ContextGem automatically parses the LLM's response * 🔍 **Automatic reference tracking**: Precise references are automatically extracted and mapped to the original document * 📏 **Flexible reference granularity**: References can be tracked at different levels (paragraphs, sentences) Anomaly extraction example (ContextGem) # Quick Start Example - Extracting anomalies from a document, with source references and justifications import os from contextgem import Document, DocumentLLM, StringConcept # Sample document text (shortened for brevity) doc = Document( raw_text=( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "The purple elephant danced gracefully on the moon while eating ice cream.\n" # 💎 anomaly "Time-traveling dinosaurs will review all deliverables before acceptance.\n" # 💎 another anomaly "This agreement is governed by the laws of Norway...\n" ), ) # Attach a document-level concept doc.concepts = [ StringConcept( name="Anomalies", # in longer contexts, this concept is hard to capture with RAG description="Anomalies in the document", add_references=True, reference_depth="sentences", add_justifications=True, justification_depth="brief", # see the docs for more configuration options ) # add more concepts to the document, if needed # see the docs for available concepts: StringConcept, JsonObjectConcept, etc. ] # Or use `doc.add_concepts([...])` # Define an LLM for extracting information from the document llm = DocumentLLM( model="openai/gpt-4o-mini", # or another provider/LLM api_key=os.environ.get( "CONTEXTGEM_OPENAI_API_KEY" ), # your API key for the LLM provider # see the docs for more configuration options ) # Extract information from the document doc = llm.extract_all(doc) # or use async version `await llm.extract_all_async(doc)` # Access extracted information in the document object anomalies_concept = doc.concepts[0] # or `doc.get_concept_by_name("Anomalies")` for item in anomalies_concept.extracted_items: print("Anomaly:") print(f" {item.value}") print("Justification:") print(f" {item.justification}") print("Reference paragraphs:") for p in item.reference_paragraphs: print(f" - {p.raw_text}") print("Reference sentences:") for s in item.reference_sentences: print(f" - {s.raw_text}") print() -[ LangChain ]- LangChain is a popular and versatile framework for building LLM applications through composable components. It offers excellent flexibility and a rich ecosystem of integrations. While powerful, feature-rich, and widely adopted in the industry, it requires more manual configuration and setup work for structured data extraction tasks compared to ContextGem's streamlined approach. **Development overhead:** * 📝 **Manual prompt engineering**: Crafting comprehensive prompts that guide the LLM effectively * 🔧 **Manual model definition**: Developers must define Pydantic validation models for structured output * 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's response * 🔍 **Manual reference mapping**: Writing custom logic for mapping references Anomaly extraction example (LangChain) # LangChain implementation for extracting anomalies from a document, with source references and justifications import os from textwrap import dedent from langchain.output_parsers import PydanticOutputParser from langchain.prompts import PromptTemplate from langchain_core.runnables import RunnableLambda, RunnablePassthrough from langchain_openai import ChatOpenAI from pydantic import BaseModel, Field # Pydantic models must be manually defined class Anomaly(BaseModel): """An anomaly found in the document.""" text: str = Field(description="The anomalous text found in the document") justification: str = Field( description="Brief justification for why this is an anomaly" ) reference: str = Field( description="The sentence containing the anomaly" ) # LLM reciting a reference is error-prone and unreliable class AnomaliesList(BaseModel): """List of anomalies found in the document.""" anomalies: list[Anomaly] = Field( description="List of anomalies found in the document" ) def extract_anomalies_with_langchain( document_text: str, api_key: str | None = None ) -> list[Anomaly]: """ Extract anomalies from a document using LangChain. Args: document_text: The text content of the document api_key: OpenAI API key (defaults to environment variable) Returns: List of extracted anomalies with justifications and references """ openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY") llm = ChatOpenAI(model="gpt-4o-mini", openai_api_key=openai_api_key, temperature=0) # Create a parser for structured output parser = PydanticOutputParser(pydantic_object=AnomaliesList) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. template = dedent( """ You are an expert document analyzer. Your task is to identify any anomalies in the document. Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent with the rest of the document's context and purpose. Document: {document_text} Identify all anomalies in the document. For each anomaly, provide: 1. The anomalous text 2. A brief justification explaining why it's an anomaly 3. The complete sentence containing the anomaly for reference {format_instructions} """ ) prompt = PromptTemplate( template=template, input_variables=["document_text"], partial_variables={"format_instructions": parser.get_format_instructions()}, ) # Create a runnable chain chain = ( {"document_text": lambda x: x} | RunnablePassthrough.assign() | prompt | llm | RunnableLambda(lambda x: parser.parse(x.content)) ) # Run the chain and extract anomalies parsed_output = chain.invoke(document_text) return parsed_output.anomalies # Example usage # Sample document text (shortened for brevity) document_text = ( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "The purple elephant danced gracefully on the moon while eating ice cream.\n" # out-of-context / anomaly "This agreement is governed by the laws of Norway...\n" ) # Extract anomalies anomalies = extract_anomalies_with_langchain(document_text) # Print results for anomaly in anomalies: print(f"Anomaly: {anomaly}") -[ LlamaIndex ]- LlamaIndex is a powerful and versatile framework for building LLM applications with data, particularly excelling at RAG workflows and document retrieval. It offers a comprehensive set of tools for data indexing and querying. While highly effective for its intended use cases, for structured data extraction tasks (non-RAG setup), it requires more manual configuration and setup work compared to ContextGem's streamlined approach. **Development overhead:** * 📝 **Manual prompt engineering**: Crafting comprehensive prompts that guide the LLM effectively * 🔧 **Manual model definition**: Developers must define Pydantic validation models for structured output * 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's response * 🔍 **Manual reference mapping**: Writing custom logic for mapping references Anomaly extraction example (LlamaIndex) # LlamaIndex implementation for extracting anomalies from a document, with source references and justifications import os from textwrap import dedent from llama_index.core.output_parsers import PydanticOutputParser from llama_index.core.program import LLMTextCompletionProgram from llama_index.llms.openai import OpenAI from pydantic import BaseModel, Field # Pydantic models must be manually defined class Anomaly(BaseModel): """An anomaly found in the document.""" text: str = Field(description="The anomalous text found in the document") justification: str = Field( description="Brief justification for why this is an anomaly" ) reference: str = Field( description="The sentence containing the anomaly" ) # LLM reciting a reference is error-prone and unreliable class AnomaliesList(BaseModel): """List of anomalies found in the document.""" anomalies: list[Anomaly] = Field( description="List of anomalies found in the document" ) def extract_anomalies_with_llama_index( document_text: str, api_key: str | None = None ) -> list[Anomaly]: """ Extract anomalies from a document using LlamaIndex. Args: document_text: The text content of the document api_key: OpenAI API key (defaults to environment variable) Returns: List of extracted anomalies with justifications and references """ openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY") llm = OpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt_template = dedent( """ You are an expert document analyzer. Your task is to identify any anomalies in the document. Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent with the rest of the document's context and purpose. Document: {document_text} Identify all anomalies in the document. For each anomaly, provide: 1. The anomalous text 2. A brief justification explaining why it's an anomaly 3. The complete sentence containing the anomaly for reference """ ) # Use PydanticOutputParser to directly parse the LLM output into our structured format program = LLMTextCompletionProgram.from_defaults( output_parser=PydanticOutputParser(output_cls=AnomaliesList), prompt_template_str=prompt_template, llm=llm, verbose=True, ) # Execute the program try: result = program(document_text=document_text) return result.anomalies except Exception as e: print(f"Error parsing LLM response: {e}") return [] # Example usage # Sample document text (shortened for brevity) document_text = ( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "The purple elephant danced gracefully on the moon while eating ice cream.\n" # out-of-context / anomaly "This agreement is governed by the laws of Norway...\n" ) # Extract anomalies anomalies = extract_anomalies_with_llama_index(document_text) # Print results for anomaly in anomalies: print(f"Anomaly: {anomaly}") -[ LlamaIndex (RAG) ]- LlamaIndex with RAG setup is a powerful and sophisticated framework for document retrieval and analysis, offering exceptional capabilities for knowledge-intensive applications. Its comprehensive architecture excels at handling complex document interactions and information retrieval tasks across large document collections. While it provides robust and versatile capabilities for building advanced document-based applications, it does require more manual configuration and specialized setup for structured extraction tasks compared to ContextGem's streamlined and intuitive approach. **Development overhead:** * 📝 **Manual prompt engineering**: Crafting comprehensive prompts that guide the LLM effectively * 🔧 **Manual model definition**: Developers must define Pydantic validation models for structured output * 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's response * 🔍 **Complex reference mapping**: Getting precise references correctly requires additional config, such as setting up a sentence splitter, CitationQueryEngine, adjusting chunk sizes, etc. Anomaly extraction example (LlamaIndex RAG) # LlamaIndex (RAG) implementation for extracting anomalies from a document, with source references and justifications import os from textwrap import dedent from typing import Any from llama_index.core import Document, Settings, VectorStoreIndex from llama_index.core.base.response.schema import RESPONSE_TYPE from llama_index.core.node_parser import SentenceSplitter from llama_index.core.output_parsers import PydanticOutputParser from llama_index.core.query_engine import CitationQueryEngine from llama_index.core.response_synthesizers.base import BaseSynthesizer from llama_index.core.retrievers import VectorIndexRetriever from llama_index.llms.openai import OpenAI from pydantic import BaseModel, Field # Pydantic models must be manually defined class Anomaly(BaseModel): text: str = Field(description="The anomalous text found in the document") justification: str = Field( description="Brief justification for why this is an anomaly" ) # This field will hold the citation info (e.g., node references) source_id: str | None = Field( description="Automatically added source reference", default=None ) class AnomaliesList(BaseModel): anomalies: list[Anomaly] = Field( description="List of anomalies found in the document" ) # Custom synthesizer that instructs the LLM to extract anomalies in JSON format. class AnomalyExtractorSynthesizer(BaseSynthesizer): def __init__(self, llm=None, nodes=None): super().__init__() self._llm = llm or Settings.llm # Nodes are still provided in case additional context is needed. self._nodes = nodes or [] def _get_prompts(self) -> dict[str, Any]: return {} def _update_prompts(self, prompts: dict[str, Any]): return async def aget_response( self, query_str: str, text_chunks: list[str], **kwargs: Any ) -> RESPONSE_TYPE: return self.get_response(query_str, text_chunks, **kwargs) def get_response( self, query_str: str, text_chunks: list[str], **kwargs: Any ) -> str: all_text = "\n".join(text_chunks) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt_str = dedent( """ You are an expert document analyzer. Your task is to identify anomalies in the document. Anomalies are statements or phrases that seem out of place or inconsistent with the document's context. Document: {all_text} For each anomaly, provide: 1. The anomalous text (only the specific phrase). 2. A brief justification for why it is an anomaly. Format your answer as a JSON object: {{ "anomalies": [ {{ "text": "anomalous text", "justification": "reason for anomaly", }} ] }} """ ) print(prompt_str) output_parser = PydanticOutputParser(output_cls=AnomaliesList) response = self._llm.complete(prompt_str.format(all_text=all_text)) try: parsed_response = output_parser.parse(response.text) self._last_anomalies = parsed_response return parsed_response.model_dump_json() except Exception as e: print(f"Error parsing LLM response: {e}") print(f"Raw response: {response.text}") return "{}" def extract_anomalies_with_citations( document_text: str, api_key: str | None = None ) -> list[Anomaly]: """ Extract anomalies from a document using LlamaIndex with citation support. Args: document_text: The content of the document. api_key: OpenAI API key (if not provided, read from environment variable). Returns: List of extracted anomalies with automatically added source references. """ openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY") llm = OpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0) Settings.llm = llm # Create a Document and split it into nodes doc = Document(text=document_text) splitter = SentenceSplitter( paragraph_separator="\n", chunk_size=100, chunk_overlap=0, ) nodes = splitter.get_nodes_from_documents([doc]) print(f"Document split into {len(nodes)} nodes") # Build a vector index and retriever using all nodes. index = VectorStoreIndex(nodes) retriever = VectorIndexRetriever(index=index, similarity_top_k=len(nodes)) # Create a custom synthesizer. synthesizer = AnomalyExtractorSynthesizer(llm=llm, nodes=nodes) # Initialize CitationQueryEngine by passing the expected components. citation_query_engine = CitationQueryEngine( retriever=retriever, llm=llm, response_synthesizer=synthesizer, citation_chunk_size=100, # Adjust as needed citation_chunk_overlap=10, # Adjust as needed ) try: response = citation_query_engine.query( "Extract all anomalies from this document" ) # If the synthesizer stored the anomalies, attach the citation info if hasattr(synthesizer, "_last_anomalies"): anomalies = synthesizer._last_anomalies.anomalies formatted_citations = ( response.get_formatted_sources() if hasattr(response, "get_formatted_sources") else None ) for anomaly in anomalies: anomaly.source_id = formatted_citations return anomalies return [] except Exception as e: print(f"Error querying document: {e}") return [] # Example usage document_text = ( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "The purple elephant danced gracefully on the moon while eating ice cream.\n" # anomaly "This agreement is governed by the laws of Norway...\n" ) anomalies = extract_anomalies_with_citations(document_text) for anomaly in anomalies: print(f"Anomaly: {anomaly}") -[ Instructor ]- Instructor is a popular framework that specializes in structured data extraction with LLMs using Pydantic. It offers excellent type safety and validation capabilities, making it a solid choice for many extraction tasks. While powerful for structured outputs, Instructor requires more manual setup for document analysis workflows. **Development overhead:** * 📝 **Manual prompt engineering**: Crafting comprehensive prompts that guide the LLM effectively * 🔧 **Manual model definition**: Developers must define Pydantic validation models for structured output * 🔍 **Manual reference mapping**: Writing custom logic for mapping references Anomaly extraction example (Instructor) # Instructor implementation for extracting anomalies from a document, with source references and justifications import os from textwrap import dedent import instructor from openai import OpenAI from pydantic import BaseModel, Field # Pydantic models must be manually defined class Anomaly(BaseModel): """An anomaly found in the document.""" text: str = Field(description="The anomalous text found in the document") justification: str = Field( description="Brief justification for why this is an anomaly" ) source_text: str = Field( description="The sentence containing the anomaly" ) # LLM reciting a reference is error-prone and unreliable class AnomaliesList(BaseModel): """List of anomalies found in the document.""" anomalies: list[Anomaly] = Field( description="List of anomalies found in the document" ) def extract_anomalies_with_instructor( document_text: str, api_key: str | None = None ) -> list[Anomaly]: """ Extract anomalies from a document using Instructor. Args: document_text: The text content of the document api_key: OpenAI API key (defaults to environment variable) Returns: List of extracted anomalies with justifications and references """ openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY") client = OpenAI(api_key=openai_api_key) instructor_client = instructor.from_openai(client) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = dedent( f""" You are an expert document analyzer. Your task is to identify any anomalies in the document. Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent with the rest of the document's context and purpose. Document: {document_text} Identify all anomalies in the document. For each anomaly, provide: 1. The anomalous text - just the specific anomalous phrase 2. A brief justification explaining why it's an anomaly 3. The exact complete sentence containing the anomaly for reference Only identify real anomalies that truly don't belong in this type of document. """ ) # Extract structured data using Instructor response = instructor_client.chat.completions.create( model="gpt-4o-mini", response_model=AnomaliesList, messages=[ {"role": "system", "content": "You are an expert document analyzer."}, {"role": "user", "content": prompt}, ], temperature=0, ) return response.anomalies # Example usage # Sample document text (shortened for brevity) document_text = ( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "The purple elephant danced gracefully on the moon while eating ice cream.\n" # out-of-context / anomaly "This agreement is governed by the laws of Norway...\n" ) # Extract anomalies anomalies = extract_anomalies_with_instructor(document_text) # Print results for anomaly in anomalies: print(f"Anomaly: {anomaly}") 🔬 Advanced Example =================== As use cases grow more complex, the development overhead of alternative frameworks becomes increasingly evident, while ContextGem's abstractions deliver substantial time savings. As extraction steps stack up, the implementation with other frameworks quickly becomes *non-scalable*: * 📝 **Manual prompt engineering**: Crafting comprehensive prompts for each extraction step * 🔧 **Manual model definition**: Defining Pydantic validation models for each element of extraction * 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's response * 🔍 **Manual reference mapping**: Writing custom logic for mapping references * 📄 **Complex pipeline configuration**: Writing custom logic for pipeline configuration and extraction components * 📊 **Implementing usage and cost tracking callbacks**, which quickly increases in complexity when multiple LLMs are used in the pipeline * 🔄 **Complex concurrency setup**: Implementing complex concurrency logic with asyncio * 📝 **Embedding examples in prompts**: Writing output examples directly in the custom prompts * 📋 **Manual result aggregation**: Need to write code to collect and organize results Below is a more advanced example of an extraction workflow - *using an extraction pipeline for multiple documents, with concurrency and cost tracking* - implemented side-by-side in ContextGem and other frameworks. (All implementations are self-contained. Comparison as of 24 March 2025.) -[ **ContextGem** ]- ⚡ Fastest way ContextGem is the fastest and easiest way to implement an LLM extraction workflow. All the boilerplate code is handled behind the scenes. **Major time savers:** * ⌨️ **Simple syntax**: ContextGem uses a simple, intuitive API that requires minimal code * 🔄 **Automatic model definition**: ContextGem automatically defines the Pydantic model for structured output * 📝 **Automatic prompt engineering**: ContextGem automatically constructs a prompt tailored to the extraction task * 🧩 **Automatic output parsing**: ContextGem automatically parses the LLM's response * 🔍 **Automatic reference tracking**: Precise references are automatically extracted and mapped to the original document * 📏 **Flexible reference granularity**: References can be tracked at different levels (paragraphs, sentences) * 📄 **Easy pipeline definition**: Simple, declarative syntax for defining the extraction pipeline involving multiple LLMs, in a few lines of code * 💰 **Automated usage and cost tracking**: Built-in token counting and cost calculation without additional setup * 🔄 **Built-in concurrency**: Concurrent execution of extraction steps with a simple switch "use_concurrency=True" * 📊 **Easy example definition**: Output examples can be easily defined without modifying any prompts * 📋 **Built-in result aggregation**: Results are automatically collected and organized in a unified storage model (document) Extraction pipeline example (ContextGem) # Advanced Usage Example - analyzing multiple documents with a single pipeline, # with different LLMs, concurrency and cost tracking import os from contextgem import ( Aspect, DateConcept, Document, DocumentLLM, DocumentLLMGroup, ExtractionPipeline, JsonObjectConcept, JsonObjectExample, LLMPricing, NumericalConcept, RatingConcept, StringConcept, StringExample, ) # Construct documents # Document 1 - Consultancy Agreement (shortened for brevity) doc1 = Document( raw_text=( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "All intellectual property created during the provision of services shall belong to the Customer...\n" "This agreement is governed by the laws of Norway...\n" "Annex 1: Data processing agreement...\n" "Annex 2: Statement of Work...\n" "Annex 3: Service Level Agreement...\n" ), ) # Document 2 - Service Level Agreement (shortened for brevity) doc2 = Document( raw_text=( "Service Level Agreement\n" "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n" "The agreement shall commence on January 1, 2023 and continue for 2 years...\n" "The Provider shall deliver IT support services as outlined in Schedule A...\n" "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n" "The Provider guarantees [99.9%] uptime for all critical systems...\n" "Either party may terminate with 60 days written notice...\n" "This agreement is governed by the laws of California...\n" "Schedule A: Service Descriptions...\n" "Schedule B: Response Time Requirements...\n" ), ) # Create a reusable extraction pipeline contract_pipeline = ExtractionPipeline() # Define aspects and aspect-level concepts in the pipeline # Concepts in the aspects will be extracted from the extracted aspect context contract_pipeline.aspects = [ # or use .add_aspects([...]) Aspect( name="Contract Parties", description="Clauses defining the parties to the agreement", concepts=[ # define aspect-level concepts, if any StringConcept( name="Party names and roles", description="Names of all parties entering into the agreement and their roles", examples=[ # optional StringExample( content="X (Client)", # guidance regarding the expected output format ) ], ) ], ), Aspect( name="Term", description="Clauses defining the term of the agreement", concepts=[ NumericalConcept( name="Contract term", description="The term of the agreement in years", numeric_type="int", # or "float", or "any" for auto-detection add_references=True, # extract references to the source text reference_depth="paragraphs", ) ], ), ] # Define document-level concepts # Concepts in the document will be extracted from the whole document content contract_pipeline.concepts = [ # or use .add_concepts() DateConcept( name="Effective date", description="The effective date of the agreement", ), StringConcept( name="Contract type", description="The type of agreement", llm_role="reasoner_text", # for this concept, we use a more advanced LLM for reasoning ), StringConcept( name="Governing law", description="The law that governs the agreement", ), JsonObjectConcept( name="Attachments", description="The titles and concise descriptions of the attachments to the agreement", structure={"title": str, "description": str | None}, examples=[ # optional JsonObjectExample( # guidance regarding the expected output format content={ "title": "Appendix A", "description": "Code of conduct", } ), ], ), RatingConcept( name="Duration adequacy", description="Contract duration adequacy considering the subject matter and best practices.", llm_role="reasoner_text", # for this concept, we use a more advanced LLM for reasoning rating_scale=(1, 10), add_justifications=True, # add justifications for the rating justification_depth="balanced", # provide a balanced justification justification_max_sents=3, ), ] # Assign pipeline to the documents # You can re-use the same pipeline for multiple documents doc1.assign_pipeline( contract_pipeline ) # assigns pipeline aspects and concepts to the document doc2.assign_pipeline( contract_pipeline ) # assigns pipeline aspects and concepts to the document # Create an LLM group for data extraction and reasoning llm_extractor = DocumentLLM( model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"], # your API key role="extractor_text", # signifies the LLM is used for data extraction tasks pricing_details=LLMPricing( # optional, for costs calculation input_per_1m_tokens=0.150, output_per_1m_tokens=0.600, ), # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider ) llm_reasoner = DocumentLLM( model="openai/o3-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"], # your API key role="reasoner_text", # signifies the LLM is used for reasoning tasks pricing_details=LLMPricing( # optional, for costs calculation input_per_1m_tokens=1.10, output_per_1m_tokens=4.40, ), # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider ) # The LLM group is used for all extraction tasks within the pipeline llm_group = DocumentLLMGroup(llms=[llm_extractor, llm_reasoner]) # Extract all information from the documents at once doc1 = llm_group.extract_all( doc1, use_concurrency=True ) # use concurrency to speed up extraction doc2 = llm_group.extract_all( doc2, use_concurrency=True ) # use concurrency to speed up extraction # Or use async variants .extract_all_async(...) # Get the extracted data print("Some extracted data from doc 1:") print("Contract Parties > Party names and roles:") print( doc1.get_aspect_by_name("Contract Parties") .get_concept_by_name("Party names and roles") .extracted_items ) print("Attachments:") print(doc1.get_concept_by_name("Attachments").extracted_items) # ... print("\nSome extracted data from doc 2:") print("Term > Contract term:") print( doc2.get_aspect_by_name("Term") .get_concept_by_name("Contract term") .extracted_items[0] .value ) print("Duration adequacy:") print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].value) print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].justification) # ... # Output processing costs (requires setting the pricing details for each LLM) print("\nProcessing costs:") print(llm_group.get_cost()) -[ LangChain ]- LangChain provides a powerful and flexible framework for building LLM applications with excellent composability and a rich ecosystem of integrations. While it offers great versatility for many use cases, it does require additional manual setup and configuration for complex extraction workflows. **Development overhead:** * 📝 **Manual prompt engineering**: Must craft detailed prompts for each extraction step * 🔧 **Manual model definition**: Need to define Pydantic models and output parsers for structured data * 🧩 **Complex chain configuration**: Requires manual setup of chains and their connections involving multiple LLMs * 🔍 **Manual reference mapping**: Must implement custom logic to track source references * 🔄 **Complex concurrency setup**: Implementing concurrent processing requires additional setup with asyncio * 💰 **Cost tracking setup**: Requires custom logic for cost tracking for each LLM * 💾 **No unified storage model**: Need to write additional code to collect and organize results Extraction pipeline example (LangChain) # LangChain implementation of analyzing multiple documents with a single pipeline, # with different LLMs, concurrency, and cost tracking # Jupyter notebook compatible version import asyncio import os import time from dataclasses import dataclass, field from textwrap import dedent import nest_asyncio nest_asyncio.apply() from langchain.callbacks import get_openai_callback from langchain.output_parsers import PydanticOutputParser from langchain.prompts import PromptTemplate from langchain_core.runnables import ( RunnableLambda, RunnableParallel, RunnablePassthrough, ) from langchain_openai import ChatOpenAI from pydantic import BaseModel, Field # Pydantic models must be manually defined class PartyInfo(BaseModel): """Information about contract parties""" name: str = Field(description="Name of the party") role: str = Field(description="Role of the party (e.g., Client, Provider)") class Term(BaseModel): """Contract term information""" duration_years: int = Field(description="Duration in years") reference: str = Field( description="Reference text from document" ) # LLM reciting a reference is error-prone and unreliable class Attachment(BaseModel): """Contract attachment information""" title: str = Field(description="Title of the attachment") description: str | None = Field(description="Brief description of the attachment") class ContractRating(BaseModel): """Rating with justification""" score: int = Field(description="Rating score (1-10)") justification: str = Field(description="Justification for the rating") class ContractInfo(BaseModel): """Complete contract information""" contract_type: str = Field(description="Type of contract") effective_date: str | None = Field(description="Effective date of the contract") governing_law: str | None = Field(description="Governing law of the contract") class AspectExtraction(BaseModel): """Result of aspect extraction""" aspect_text: str = Field( description="Extracted text for this aspect" ) # this does not provide granular structured content, such as specific paragraphs and sentences class PartyExtraction(BaseModel): """Party extraction results""" parties: list[PartyInfo] = Field(description="List of parties in the contract") class TermExtraction(BaseModel): """Term extraction results""" terms: list[Term] = Field(description="Contract term details") class AttachmentExtraction(BaseModel): """Attachment extraction results""" attachments: list[Attachment] = Field(description="List of contract attachments") class DurationRatingExtraction(BaseModel): """Duration adequacy rating""" rating: ContractRating = Field(description="Rating of contract duration adequacy") # Configuration models must be manually defined @dataclass class ExtractorConfig: """Configuration for a specific extractor""" name: str description: str model_name: str = "gpt-4o-mini" # Default model @dataclass class PipelineConfig: """Complete pipeline configuration""" # Aspect extractors party_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Contract Parties", description="Clauses defining the parties to the agreement", ) ) term_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Term", description="Clauses defining the term of the agreement" ) ) # Document-level extractors contract_info_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Contract Information", description="Basic contract information including type, date, and governing law", ) ) attachment_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Attachments", description="Contract attachments and their descriptions", ) ) duration_rating_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Duration Rating", description="Rating of contract duration adequacy", model_name="o3-mini", # Using a more capable model for judgment ) ) # LLM configuration def get_llm(model_name="gpt-4o-mini", api_key=None): """Get a ChatOpenAI instance with the specified configuration""" # Skipped temperature etc. for brevity, as e.g. temperature is not supported by o3-mini api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "") return ChatOpenAI(model=model_name, openai_api_key=api_key) # Chain components must be manually defined def create_aspect_extractor(aspect_name, aspect_description, model_name="gpt-4o-mini"): """Create a chain to extract text related to a specific aspect""" llm = get_llm(model_name=model_name) parser = PydanticOutputParser(pydantic_object=AspectExtraction) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = PromptTemplate( template=dedent( """ You are an expert document analyzer. Extract the text related to the following aspect from the document. Document: {document_text} Aspect: {aspect_name} Description: {aspect_description} Extract all text related to this aspect. {format_instructions} """ ), input_variables=["document_text", "aspect_name", "aspect_description"], partial_variables={"format_instructions": parser.get_format_instructions()}, ) # this does not provide granular structured content, such as specific paragraphs and sentences chain = prompt | llm | parser # Return a callable that works with both sync and async code def extractor(doc): return chain.invoke( { "document_text": doc, "aspect_name": aspect_name, "aspect_description": aspect_description, } ) # Add an async version that will be used when awaited async def async_extractor(doc): return await chain.ainvoke( { "document_text": doc, "aspect_name": aspect_name, "aspect_description": aspect_description, } ) extractor.ainvoke = async_extractor return extractor def create_party_extractor(model_name="gpt-4o-mini"): """Create a chain to extract party information""" llm = get_llm(model_name=model_name) parser = PydanticOutputParser(pydantic_object=PartyExtraction) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = PromptTemplate( template=dedent( """ You are an expert document analyzer. Extract all party information from the following contract text. Contract text: {aspect_text} For each party, extract their name and role in the agreement. {format_instructions} """ ), input_variables=["aspect_text"], partial_variables={"format_instructions": parser.get_format_instructions()}, ) chain = prompt | llm | parser return chain def create_term_extractor(model_name="gpt-4o-mini"): """Create a chain to extract term information""" llm = get_llm(model_name=model_name) parser = PydanticOutputParser(pydantic_object=TermExtraction) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = PromptTemplate( template=dedent( """ You are an expert document analyzer. Extract term information from the following contract text. Contract text: {aspect_text} Extract the contract term duration in years. Include the relevant reference text. {format_instructions} """ ), input_variables=["aspect_text"], partial_variables={"format_instructions": parser.get_format_instructions()}, ) chain = prompt | llm | parser return chain def create_contract_info_extractor(model_name="gpt-4o-mini"): """Create a chain to extract basic contract information""" llm = get_llm(model_name=model_name) parser = PydanticOutputParser(pydantic_object=ContractInfo) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = PromptTemplate( template=dedent( """ You are an expert document analyzer. Extract the following information from the contract document. Contract document: {document_text} Extract the contract type, effective date if mentioned, and governing law if specified. {format_instructions} """ ), input_variables=["document_text"], partial_variables={"format_instructions": parser.get_format_instructions()}, ) chain = prompt | llm | parser return chain def create_attachment_extractor(model_name="gpt-4o-mini"): """Create a chain to extract attachment information""" llm = get_llm(model_name=model_name) parser = PydanticOutputParser(pydantic_object=AttachmentExtraction) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = PromptTemplate( template=dedent( """ You are an expert document analyzer. Extract information about all attachments, annexes, schedules, or appendices mentioned in the contract. Contract document: {document_text} For each attachment, extract: 1. The title/name of the attachment (e.g., "Appendix A", "Schedule 1", "Annex 2") 2. A brief description of what the attachment contains (if mentioned in the document) Example format: {{"title": "Appendix A", "description": "Code of conduct"}} {format_instructions} """ ), input_variables=["document_text"], partial_variables={"format_instructions": parser.get_format_instructions()}, ) chain = prompt | llm | parser return chain def create_duration_rating_extractor(model_name="o3-mini"): """Create a chain to rate contract duration adequacy""" llm = get_llm(model_name=model_name) parser = PydanticOutputParser(pydantic_object=DurationRatingExtraction) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = PromptTemplate( template=dedent( """ You are an expert contract analyst. Evaluate the adequacy of the contract duration considering the subject matter and best practices. Contract document: {document_text} Rate the duration adequacy on a scale of 1-10, where: 1 = Extremely inadequate duration 10 = Perfectly adequate duration Provide a brief justification for your rating (2-3 sentences). {format_instructions} """ ), input_variables=["document_text"], partial_variables={"format_instructions": parser.get_format_instructions()}, ) chain = prompt | llm | parser return chain # Complete pipeline definition def create_document_pipeline(config=PipelineConfig()): """Create a complete document analysis pipeline and return it along with its components""" # Create aspect extractors party_aspect_extractor = create_aspect_extractor( config.party_extractor.name, config.party_extractor.description, config.party_extractor.model_name, ) term_aspect_extractor = create_aspect_extractor( config.term_extractor.name, config.term_extractor.description, config.term_extractor.model_name, ) # Create concept extractors for aspects party_extractor = create_party_extractor(config.party_extractor.model_name) term_extractor = create_term_extractor(config.term_extractor.model_name) # Create document-level extractors contract_info_extractor = create_contract_info_extractor( config.contract_info_extractor.model_name ) attachment_extractor = create_attachment_extractor( config.attachment_extractor.model_name ) duration_rating_extractor = create_duration_rating_extractor( config.duration_rating_extractor.model_name ) # Create aspect extraction pipeline party_pipeline = ( RunnablePassthrough() | party_aspect_extractor | RunnableLambda(lambda x: {"aspect_text": x.aspect_text}) | party_extractor ) term_pipeline = ( RunnablePassthrough() | term_aspect_extractor | RunnableLambda(lambda x: {"aspect_text": x.aspect_text}) | term_extractor ) # Create document-level extraction pipeline document_extraction = RunnableParallel( contract_info=contract_info_extractor, attachments=attachment_extractor, duration_rating=duration_rating_extractor, ) # Combine into complete pipeline complete_pipeline = RunnableParallel( parties=party_pipeline, terms=term_pipeline, document_info=document_extraction ) # Create a components dictionary for easy access components = { "party_pipeline": party_pipeline, "term_pipeline": term_pipeline, "contract_info_extractor": contract_info_extractor, "attachment_extractor": attachment_extractor, "duration_rating_extractor": duration_rating_extractor, } return complete_pipeline, components # Cost tracking class CostTracker: """Track LLM costs across multiple extractions""" def __init__(self): self.costs = { "gpt-4o-mini": { "input_per_1m": 0.15, "output_per_1m": 0.60, "input_tokens": 0, "output_tokens": 0, }, "o3-mini": { "input_per_1m": 1.10, "output_per_1m": 4.40, "input_tokens": 0, "output_tokens": 0, }, } self.total_cost = 0.0 def track_usage(self, model_name, input_tokens, output_tokens): """Track token usage for a model""" # Extract base model name base_model = model_name.split("/")[-1] if "/" in model_name else model_name if base_model in self.costs: self.costs[base_model]["input_tokens"] += input_tokens self.costs[base_model]["output_tokens"] += output_tokens # Calculate costs separately for input and output tokens input_cost = input_tokens * ( self.costs[base_model]["input_per_1m"] / 1000000 ) output_cost = output_tokens * ( self.costs[base_model]["output_per_1m"] / 1000000 ) self.total_cost += input_cost + output_cost def get_costs(self): """Get cost summary""" model_costs = {} for model, data in self.costs.items(): if data["input_tokens"] > 0 or data["output_tokens"] > 0: input_cost = data["input_tokens"] * (data["input_per_1m"] / 1000000) output_cost = data["output_tokens"] * (data["output_per_1m"] / 1000000) model_costs[model] = { "input_cost": input_cost, "output_cost": output_cost, "total_cost": input_cost + output_cost, "input_tokens": data["input_tokens"], "output_tokens": data["output_tokens"], } return { "model_costs": model_costs, "total_cost": self.total_cost, } # Document processing functions async def process_document_async( document_text, pipeline_and_components, cost_tracker=None, use_concurrency=True ): """Process a document asynchronously and track costs""" pipeline, components = pipeline_and_components # Unpack the pipeline and components results = {} # Track tokens used across all calls total_tokens = { "gpt-4o-mini": {"input": 0, "output": 0}, "o3-mini": {"input": 0, "output": 0}, } # Use the provided components async def process_parties(): """Process parties using the party pipeline""" with get_openai_callback() as cb: party_results = await components["party_pipeline"].ainvoke(document_text) total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens return party_results async def process_terms(): """Process terms using the term pipeline""" with get_openai_callback() as cb: term_results = await components["term_pipeline"].ainvoke(document_text) total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens return term_results async def process_contract_info(): """Process contract info""" with get_openai_callback() as cb: info_results = await components["contract_info_extractor"].ainvoke( document_text ) total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens return info_results async def process_attachments(): """Process attachments""" with get_openai_callback() as cb: attachment_results = await components["attachment_extractor"].ainvoke( document_text ) total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens return attachment_results async def process_duration_rating(): """Process duration rating""" with get_openai_callback() as cb: duration_results = await components["duration_rating_extractor"].ainvoke( document_text ) # Duration rating is done with o3-mini total_tokens["o3-mini"]["input"] += cb.prompt_tokens total_tokens["o3-mini"]["output"] += cb.completion_tokens return duration_results # Run extractions based on concurrency preference if use_concurrency: # Process all extractions concurrently for maximum speed ( parties, terms, contract_info, attachments, duration_rating, ) = await asyncio.gather( process_parties(), process_terms(), process_contract_info(), process_attachments(), process_duration_rating(), ) else: # Process extractions sequentially parties = await process_parties() terms = await process_terms() contract_info = await process_contract_info() attachments = await process_attachments() duration_rating = await process_duration_rating() # Update cost tracker if provided if cost_tracker: for model, tokens in total_tokens.items(): cost_tracker.track_usage(model, tokens["input"], tokens["output"]) # Structure results in an easy-to-use format results["contract_type"] = contract_info.contract_type results["governing_law"] = contract_info.governing_law results["effective_date"] = contract_info.effective_date results["parties"] = parties.parties results["term_years"] = terms.terms[0].duration_years if terms.terms else None results["term_reference"] = terms.terms[0].reference if terms.terms else None results["attachments"] = attachments.attachments results["duration_rating"] = duration_rating.rating return results def process_document( document_text, pipeline_and_components, cost_tracker=None, use_concurrency=True ): """ Process a document and track costs. This is a Jupyter-compatible version that uses the existing event loop instead of creating a new one with asyncio.run(). """ # Get the current event loop loop = asyncio.get_event_loop() # Run the async function in the current event loop return loop.run_until_complete( process_document_async( document_text, pipeline_and_components, cost_tracker, use_concurrency ) ) # Example usage # Sample contract texts (shortened for brevity) doc1_text = ( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "All intellectual property created during the provision of services shall belong to the Customer...\n" "This agreement is governed by the laws of Norway...\n" "Annex 1: Data processing agreement...\n" "Annex 2: Statement of Work...\n" "Annex 3: Service Level Agreement...\n" ) doc2_text = ( "Service Level Agreement\n" "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n" "The agreement shall commence on January 1, 2023 and continue for 2 years...\n" "The Provider shall deliver IT support services as outlined in Schedule A...\n" "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n" "The Provider guarantees [99.9%] uptime for all critical systems...\n" "Either party may terminate with 60 days written notice...\n" "This agreement is governed by the laws of California...\n" "Schedule A: Service Descriptions...\n" "Schedule B: Response Time Requirements...\n" ) # Function to pretty-print document results def print_document_results(doc_name, results): print(f"\nResults from {doc_name}:") print(f"Contract Type: {results['contract_type']}") print(f"Parties: {[f'{p.name} ({p.role})' for p in results['parties']]}") print(f"Term: {results['term_years']} years") print( f"Term Reference: {results['term_reference'] if results['term_reference'] else 'Not specified'}" ) print(f"Governing Law: {results['governing_law']}") print(f"Attachments: {[(a.title, a.description) for a in results['attachments']]}") print(f"Duration Rating: {results['duration_rating'].score}/10") print(f"Rating Justification: {results['duration_rating'].justification}") # Create cost tracker cost_tracker = CostTracker() # Create pipeline with default configuration - returns both pipeline and components pipeline, pipeline_components = create_document_pipeline() # Process documents print("Processing document 1 with concurrency...") start_time = time.time() doc1_results = process_document( doc1_text, (pipeline, pipeline_components), cost_tracker, use_concurrency=True ) print(f"Processing time: {time.time() - start_time:.2f} seconds") print("Processing document 2 with concurrency...") start_time = time.time() doc2_results = process_document( doc2_text, (pipeline, pipeline_components), cost_tracker, use_concurrency=True ) print(f"Processing time: {time.time() - start_time:.2f} seconds") # Print results print_document_results("Document 1 (Consultancy Agreement)", doc1_results) print_document_results("Document 2 (Service Level Agreement)", doc2_results) # Print cost information print("\nProcessing costs:") costs = cost_tracker.get_costs() for model, model_data in costs["model_costs"].items(): print(f"\n{model}:") print(f" Input cost: ${model_data['input_cost']:.4f}") print(f" Output cost: ${model_data['output_cost']:.4f}") print(f" Total cost: ${model_data['total_cost']:.4f}") print(f"\nTotal across all models: ${costs['total_cost']:.4f}") -[ LlamaIndex ]- LlamaIndex provides a robust data framework for LLM applications with excellent capabilities for knowledge retrieval and RAG. It offers powerful tools for working with documents and structured data, though implementing complex extraction workflows may require some additional configuration to fully leverage its capabilities. **Development overhead:** * 📝 **Manual prompt engineering**: Must craft detailed prompts for each extraction task * 🔧 **Manual model definition**: Need to define Pydantic models and output parsers for structured data * 🧩 **Pipeline setup**: Requires manual configuration of extraction pipeline components involving multiple LLMs * 🔍 **Limited reference tracking**: Basic source tracking, but requires additional work for fine-grained references * 📊 **Embedding examples in prompts**: Examples must be manually incorporated into prompts * 🔄 **Complex concurrency setup**: Implementing concurrent processing requires additional setup with asyncio * 💰 **Cost tracking setup**: Requires custom logic for cost tracking for each LLM * 💾 **No unified storage model**: Need to write additional code to collect and organize results Extraction pipeline example (LlamaIndex) # LlamaIndex implementation of analyzing multiple documents with a single pipeline, # with different LLMs, concurrency, and cost tracking # Jupyter notebook compatible version import asyncio import os from textwrap import dedent import nest_asyncio nest_asyncio.apply() from llama_index.core.callbacks import CallbackManager, TokenCountingHandler from llama_index.core.output_parsers import PydanticOutputParser from llama_index.core.program import LLMTextCompletionProgram from llama_index.llms.openai import OpenAI from pydantic import BaseModel, Field # Pydantic models must be manually defined class PartyInfo(BaseModel): """Information about contract parties""" name: str = Field(description="Name of the party") role: str = Field(description="Role of the party (e.g., Client, Provider)") class Term(BaseModel): """Contract term information""" duration_years: int = Field(description="Duration in years") reference: str = Field( description="Reference text from document" ) # LLM reciting a reference is error-prone and unreliable class Attachment(BaseModel): """Contract attachment information""" title: str = Field(description="Title of the attachment") description: str | None = Field(description="Brief description of the attachment") class ContractRating(BaseModel): """Rating with justification""" score: int = Field(description="Rating score (1-10)") justification: str = Field(description="Justification for the rating") class ContractInfo(BaseModel): """Complete contract information""" contract_type: str = Field(description="Type of contract") effective_date: str | None = Field(description="Effective date of the contract") governing_law: str | None = Field(description="Governing law of the contract") class AspectExtraction(BaseModel): """Result of aspect extraction""" aspect_text: str = Field( description="Extracted text for this aspect" ) # this does not provide granular structured content, such as specific paragraphs and sentences class PartyExtraction(BaseModel): """Party extraction results""" parties: list[PartyInfo] = Field(description="List of parties in the contract") class TermExtraction(BaseModel): """Term extraction results""" terms: list[Term] = Field(description="Contract term details") class AttachmentExtraction(BaseModel): """Attachment extraction results""" attachments: list[Attachment] = Field(description="List of contract attachments") class DurationRatingExtraction(BaseModel): """Duration adequacy rating""" rating: ContractRating = Field(description="Rating of contract duration adequacy") # Cost tracking class class CostTracker: """Track LLM costs across multiple extractions""" def __init__(self): self.costs = { "gpt-4o-mini": { "input_per_1m": 0.15, "output_per_1m": 0.60, "input_tokens": 0, "output_tokens": 0, }, "o3-mini": { "input_per_1m": 1.10, "output_per_1m": 4.40, "input_tokens": 0, "output_tokens": 0, }, } self.total_cost = 0.0 def track_usage(self, model_name, input_tokens, output_tokens): """Track token usage for a model""" # Extract base model name base_model = model_name.split("/")[-1] if "/" in model_name else model_name if base_model in self.costs: self.costs[base_model]["input_tokens"] += input_tokens self.costs[base_model]["output_tokens"] += output_tokens # Calculate costs separately for input and output tokens input_cost = input_tokens * ( self.costs[base_model]["input_per_1m"] / 1000000 ) output_cost = output_tokens * ( self.costs[base_model]["output_per_1m"] / 1000000 ) self.total_cost += input_cost + output_cost def get_costs(self): """Get cost summary""" model_costs = {} for model, data in self.costs.items(): if data["input_tokens"] > 0 or data["output_tokens"] > 0: input_cost = data["input_tokens"] * (data["input_per_1m"] / 1000000) output_cost = data["output_tokens"] * (data["output_per_1m"] / 1000000) model_costs[model] = { "input_cost": input_cost, "output_cost": output_cost, "total_cost": input_cost + output_cost, "input_tokens": data["input_tokens"], "output_tokens": data["output_tokens"], } return { "model_costs": model_costs, "total_cost": self.total_cost, } # Helper functions for extractors def get_llm(model_name="gpt-4o-mini", api_key=None, temperature=0, token_counter=None): """Get an OpenAI instance with the specified configuration""" api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "") # Create callback manager with token counter if provided callback_manager = None if token_counter is not None: callback_manager = CallbackManager([token_counter]) return OpenAI( model=model_name, api_key=api_key, temperature=temperature, callback_manager=callback_manager, ) def create_aspect_extractor( aspect_name, aspect_description, model_name="gpt-4o-mini", token_counter=None ): """Create an extractor to extract text related to a specific aspect""" llm = get_llm(model_name=model_name, token_counter=token_counter) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt_template = dedent( f""" You are an expert document analyzer. Extract the text related to the following aspect from the document. Document: {{document_text}} Aspect: {aspect_name} Description: {aspect_description} Extract all text related to this aspect. """ ) # this does not provide granular structured content, such as specific paragraphs and sentences program = LLMTextCompletionProgram.from_defaults( output_parser=PydanticOutputParser(output_cls=AspectExtraction), prompt_template_str=prompt_template, llm=llm, ) return program def create_party_extractor(model_name="gpt-4o-mini", token_counter=None): """Create an extractor for party information""" llm = get_llm(model_name=model_name, token_counter=token_counter) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt_template = dedent( """ You are an expert document analyzer. Extract all party information from the following contract text. Contract text: {aspect_text} For each party, extract their name and role in the agreement. """ ) program = LLMTextCompletionProgram.from_defaults( output_parser=PydanticOutputParser(output_cls=PartyExtraction), prompt_template_str=prompt_template, llm=llm, ) return program def create_term_extractor(model_name="gpt-4o-mini", token_counter=None): """Create an extractor for term information""" llm = get_llm(model_name=model_name, token_counter=token_counter) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt_template = dedent( """ You are an expert document analyzer. Extract term information from the following contract text. Contract text: {aspect_text} Extract the contract term duration in years. Include the relevant reference text. """ ) program = LLMTextCompletionProgram.from_defaults( output_parser=PydanticOutputParser(output_cls=TermExtraction), prompt_template_str=prompt_template, llm=llm, ) return program def create_contract_info_extractor(model_name="gpt-4o-mini", token_counter=None): """Create an extractor for basic contract information""" llm = get_llm(model_name=model_name, token_counter=token_counter) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt_template = dedent( """ You are an expert document analyzer. Extract the following information from the contract document. Contract document: {document_text} Extract the contract type, effective date if mentioned, and governing law if specified. """ ) program = LLMTextCompletionProgram.from_defaults( output_parser=PydanticOutputParser(output_cls=ContractInfo), prompt_template_str=prompt_template, llm=llm, ) return program def create_attachment_extractor(model_name="gpt-4o-mini", token_counter=None): """Create an extractor for attachment information""" llm = get_llm(model_name=model_name, token_counter=token_counter) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt_template = dedent( """ You are an expert document analyzer. Extract information about all attachments, annexes, schedules, or appendices mentioned in the contract. Contract document: {document_text} For each attachment, extract: 1. The title/name of the attachment (e.g., "Appendix A", "Schedule 1", "Annex 2") 2. A brief description of what the attachment contains (if mentioned in the document) Example format: {"title": "Appendix A", "description": "Code of conduct"} """ ) program = LLMTextCompletionProgram.from_defaults( output_parser=PydanticOutputParser(output_cls=AttachmentExtraction), prompt_template_str=prompt_template, llm=llm, ) return program def create_duration_rating_extractor(model_name="o3-mini", token_counter=None): """Create an extractor to rate contract duration adequacy""" llm = get_llm(model_name=model_name, token_counter=token_counter) # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt_template = dedent( """ You are an expert contract analyst. Evaluate the adequacy of the contract duration considering the subject matter and best practices. Contract document: {document_text} Rate the duration adequacy on a scale of 1-10, where: 1 = Extremely inadequate duration 10 = Perfectly adequate duration Provide a brief justification for your rating (2-3 sentences). """ ) program = LLMTextCompletionProgram.from_defaults( output_parser=PydanticOutputParser(output_cls=DurationRatingExtraction), prompt_template_str=prompt_template, llm=llm, ) return program # Main document processing functions async def process_document_async( document_text, cost_tracker=None, use_concurrency=True ): """Process a document asynchronously and track costs""" results = {} # Create separate token counting handlers for each model gpt4o_token_counter = TokenCountingHandler() o3_token_counter = TokenCountingHandler() # Create extractors with appropriate token counters party_aspect_extractor = create_aspect_extractor( "Contract Parties", "Clauses defining the parties to the agreement", token_counter=gpt4o_token_counter, ) term_aspect_extractor = create_aspect_extractor( "Term", "Clauses defining the term of the agreement", token_counter=gpt4o_token_counter, ) party_extractor = create_party_extractor(token_counter=gpt4o_token_counter) term_extractor = create_term_extractor(token_counter=gpt4o_token_counter) contract_info_extractor = create_contract_info_extractor( token_counter=gpt4o_token_counter ) attachment_extractor = create_attachment_extractor( token_counter=gpt4o_token_counter ) # Use separate token counter for o3-mini duration_rating_extractor = create_duration_rating_extractor( model_name="o3-mini", token_counter=o3_token_counter ) # Define processing functions using native async methods async def process_party_aspect(): response = await party_aspect_extractor.acall(document_text=document_text) return response async def process_term_aspect(): response = await term_aspect_extractor.acall(document_text=document_text) return response # Get aspect texts if use_concurrency: party_aspect, term_aspect = await asyncio.gather( process_party_aspect(), process_term_aspect() ) else: party_aspect = await process_party_aspect() term_aspect = await process_term_aspect() async def process_parties(): party_results = await party_extractor.acall( aspect_text=party_aspect.aspect_text ) return party_results async def process_terms(): term_results = await term_extractor.acall(aspect_text=term_aspect.aspect_text) return term_results async def process_contract_info(): contract_info = await contract_info_extractor.acall(document_text=document_text) return contract_info async def process_attachments(): attachments = await attachment_extractor.acall(document_text=document_text) return attachments async def process_duration_rating(): duration_rating = await duration_rating_extractor.acall( document_text=document_text ) return duration_rating # Run extractions based on concurrency preference if use_concurrency: ( parties, terms, contract_info, attachments, duration_rating, ) = await asyncio.gather( process_parties(), process_terms(), process_contract_info(), process_attachments(), process_duration_rating(), ) else: parties = await process_parties() terms = await process_terms() contract_info = await process_contract_info() attachments = await process_attachments() duration_rating = await process_duration_rating() # Get token usage from the token counter and update cost tracker if cost_tracker: cost_tracker.track_usage( "gpt-4o-mini", gpt4o_token_counter.prompt_llm_token_count, gpt4o_token_counter.completion_llm_token_count, ) cost_tracker.track_usage( "o3-mini", o3_token_counter.prompt_llm_token_count, o3_token_counter.completion_llm_token_count, ) # Structure results in an easy-to-use format results["contract_type"] = contract_info.contract_type results["governing_law"] = contract_info.governing_law results["effective_date"] = contract_info.effective_date results["parties"] = parties.parties results["term_years"] = terms.terms[0].duration_years if terms.terms else None results["term_reference"] = terms.terms[0].reference if terms.terms else None results["attachments"] = attachments.attachments results["duration_rating"] = duration_rating.rating return results def process_document(document_text, cost_tracker=None, use_concurrency=True): """ Process a document and track costs. This is a Jupyter-compatible version that uses the existing event loop instead of creating a new one with asyncio.run(). """ loop = asyncio.get_event_loop() return loop.run_until_complete( process_document_async(document_text, cost_tracker, use_concurrency) ) # Function to pretty-print document results def print_document_results(doc_name, results): print(f"\nResults from {doc_name}:") print(f"Contract Type: {results['contract_type']}") print(f"Parties: {[f'{p.name} ({p.role})' for p in results['parties']]}") print(f"Term: {results['term_years']} years") print( f"Term Reference: {results['term_reference'] if results['term_reference'] else 'Not specified'}" ) print(f"Governing Law: {results['governing_law']}") print(f"Attachments: {[(a.title, a.description) for a in results['attachments']]}") print(f"Duration Rating: {results['duration_rating'].score}/10") print(f"Rating Justification: {results['duration_rating'].justification}") # Example usage # Sample contract texts (shortened for brevity) doc1_text = ( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "All intellectual property created during the provision of services shall belong to the Customer...\n" "This agreement is governed by the laws of Norway...\n" "Annex 1: Data processing agreement...\n" "Annex 2: Statement of Work...\n" "Annex 3: Service Level Agreement...\n" ) doc2_text = ( "Service Level Agreement\n" "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n" "The agreement shall commence on January 1, 2023 and continue for 2 years...\n" "The Provider shall deliver IT support services as outlined in Schedule A...\n" "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n" "The Provider guarantees [99.9%] uptime for all critical systems...\n" "Either party may terminate with 60 days written notice...\n" "This agreement is governed by the laws of California...\n" "Schedule A: Service Descriptions...\n" "Schedule B: Response Time Requirements...\n" ) # Create cost tracker cost_tracker = CostTracker() # Process documents print("Processing document 1 with concurrency...") doc1_results = process_document(doc1_text, cost_tracker, use_concurrency=True) print("Processing document 2 with concurrency...") doc2_results = process_document(doc2_text, cost_tracker, use_concurrency=True) # Print results print_document_results("Document 1 (Consultancy Agreement)", doc1_results) print_document_results("Document 2 (Service Level Agreement)", doc2_results) # Print cost information print("\nProcessing costs:") costs = cost_tracker.get_costs() for model, model_data in costs["model_costs"].items(): print(f"\n{model}:") print(f" Input cost: ${model_data['input_cost']:.4f}") print(f" Output cost: ${model_data['output_cost']:.4f}") print(f" Total cost: ${model_data['total_cost']:.4f}") print(f"\nTotal across all models: ${costs['total_cost']:.4f}") -[ Instructor ]- Instructor is a powerful library focused on structured outputs from LLMs with strong typing support through Pydantic. It excels at extracting structured data with validation, but requires additional work to build complex extraction pipelines. **Development overhead:** * 📝 **Manual prompt engineering**: Crafting comprehensive prompts that guide the LLM effectively * 🔧 **Manual model definition**: Developers must define Pydantic validation models for structured output * 🧩 **Manual pipeline assembly**: Requires custom code to connect extraction components involving multiple LLMs * 🔍 **Manual reference mapping**: Must implement custom logic to track source references * 📊 **Embedding examples in prompts**: Examples must be manually incorporated into prompts * 🔄 **Complex concurrency setup**: Implementing concurrent processing requires additional setup with asyncio * 💰 **Cost tracking setup**: Requires custom logic for cost tracking for each LLM Extraction pipeline example (Instructor) # Instructor implementation of analyzing multiple documents with a single pipeline, # with different LLMs, concurrency, and cost tracking # Jupyter notebook compatible version import asyncio import os from dataclasses import dataclass, field from textwrap import dedent import instructor import nest_asyncio from openai import AsyncOpenAI, OpenAI from pydantic import BaseModel, Field nest_asyncio.apply() # Pydantic models must be manually defined class PartyInfo(BaseModel): """Information about contract parties""" name: str = Field(description="Name of the party") role: str = Field(description="Role of the party (e.g., Client, Provider)") class Term(BaseModel): """Contract term information""" duration_years: int = Field(description="Duration in years") reference: str = Field( description="Reference text from document" ) # LLM reciting a reference is error-prone and unreliable class Attachment(BaseModel): """Contract attachment information""" title: str = Field(description="Title of the attachment") description: str | None = Field(description="Brief description of the attachment") class ContractRating(BaseModel): """Rating with justification""" score: int = Field(description="Rating score (1-10)") justification: str = Field(description="Justification for the rating") class ContractInfo(BaseModel): """Complete contract information""" contract_type: str = Field(description="Type of contract") effective_date: str | None = Field(description="Effective date of the contract") governing_law: str | None = Field(description="Governing law of the contract") class AspectExtraction(BaseModel): """Result of aspect extraction""" aspect_text: str = Field( description="Extracted text for this aspect" ) # this does not provide granular structured content, such as specific paragraphs and sentences class PartyExtraction(BaseModel): """Party extraction results""" parties: list[PartyInfo] = Field(description="List of parties in the contract") class TermExtraction(BaseModel): """Term extraction results""" terms: list[Term] = Field(description="Contract term details") class AttachmentExtraction(BaseModel): """Attachment extraction results""" attachments: list[Attachment] = Field(description="List of contract attachments") class DurationRatingExtraction(BaseModel): """Duration adequacy rating""" rating: ContractRating = Field(description="Rating of contract duration adequacy") # Configuration models must be manually defined @dataclass class ExtractorConfig: """Configuration for a specific extractor""" name: str description: str model_name: str = "gpt-4o-mini" # Default model @dataclass class PipelineConfig: """Complete pipeline configuration""" # Aspect extractors party_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Contract Parties", description="Clauses defining the parties to the agreement", ) ) term_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Term", description="Clauses defining the term of the agreement" ) ) # Document-level extractors contract_info_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Contract Information", description="Basic contract information including type, date, and governing law", ) ) attachment_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Attachments", description="Contract attachments and their descriptions", ) ) duration_rating_extractor: ExtractorConfig = field( default_factory=lambda: ExtractorConfig( name="Duration Rating", description="Rating of contract duration adequacy", model_name="o3-mini", # Using a more capable model for judgment ) ) # LLM client setup def get_client(api_key=None): """Get an OpenAI client with instructor integrated""" api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "") client = OpenAI(api_key=api_key) return instructor.from_openai(client) async def get_async_client(api_key=None): """Get an AsyncOpenAI client with instructor integrated""" api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "") client = AsyncOpenAI(api_key=api_key) return instructor.from_openai(client) # Helper function to execute completions with token tracking async def execute_with_tracking(model, messages, response_model, cost_tracker=None): """ Execute a completion request with token tracking. """ # Create the Instructor client client = await get_async_client() # Make a single API call with Instructor response = await client.chat.completions.create( model=model, response_model=response_model, messages=messages ) # Access the raw response to get token usage if cost_tracker and hasattr(response, "_raw_response"): raw_response = response._raw_response if hasattr(raw_response, "usage"): prompt_tokens = raw_response.usage.prompt_tokens completion_tokens = raw_response.usage.completion_tokens cost_tracker.track_usage(model, prompt_tokens, completion_tokens) return response def execute_sync(model, messages, response_model): """Execute a completion request synchronously""" client = get_client() return client.chat.completions.create( model=model, response_model=response_model, messages=messages ) # Unified extraction functions def extract_aspect( document_text, aspect_name, aspect_description, model_name="gpt-4o-mini", is_async=False, cost_tracker=None, ): """Extract text related to a specific aspect""" # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = dedent( f""" You are an expert document analyzer. Extract the text related to the following aspect from the document. Document: {document_text} Aspect: {aspect_name} Description: {aspect_description} Extract all text related to this aspect. """ ) # this does not provide granular structured content, such as specific paragraphs and sentences messages = [ {"role": "system", "content": "You are an expert document analyzer."}, {"role": "user", "content": prompt}, ] if is_async: return execute_with_tracking( model_name, messages, AspectExtraction, cost_tracker ) else: return execute_sync(model_name, messages, AspectExtraction) def extract_parties( aspect_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None ): """Extract party information""" # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = dedent( f""" You are an expert document analyzer. Extract all party information from the following contract text. Contract text: {aspect_text} For each party, extract their name and role in the agreement. """ ) messages = [ {"role": "system", "content": "You are an expert document analyzer."}, {"role": "user", "content": prompt}, ] if is_async: return execute_with_tracking( model_name, messages, PartyExtraction, cost_tracker ) else: return execute_sync(model_name, messages, PartyExtraction) def extract_terms( aspect_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None ): """Extract term information""" # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = dedent( f""" You are an expert document analyzer. Extract term information from the following contract text. Contract text: {aspect_text} Extract the contract term duration in years. Include the relevant reference text. """ ) messages = [ {"role": "system", "content": "You are an expert document analyzer."}, {"role": "user", "content": prompt}, ] if is_async: return execute_with_tracking(model_name, messages, TermExtraction, cost_tracker) else: return execute_sync(model_name, messages, TermExtraction) def extract_contract_info( document_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None ): """Extract basic contract information""" # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = dedent( f""" You are an expert document analyzer. Extract the following information from the contract document. Contract document: {document_text} Extract the contract type, effective date if mentioned, and governing law if specified. """ ) messages = [ {"role": "system", "content": "You are an expert document analyzer."}, {"role": "user", "content": prompt}, ] if is_async: return execute_with_tracking(model_name, messages, ContractInfo, cost_tracker) else: return execute_sync(model_name, messages, ContractInfo) def extract_attachments( document_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None ): """Extract attachment information""" # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = dedent( f""" You are an expert document analyzer. Extract information about all attachments, annexes, schedules, or appendices mentioned in the contract. Contract document: {document_text} For each attachment, extract: 1. The title/name of the attachment (e.g., "Appendix A", "Schedule 1", "Annex 2") 2. A brief description of what the attachment contains (if mentioned in the document) """ ) messages = [ {"role": "system", "content": "You are an expert document analyzer."}, {"role": "user", "content": prompt}, ] if is_async: return execute_with_tracking( model_name, messages, AttachmentExtraction, cost_tracker ) else: return execute_sync(model_name, messages, AttachmentExtraction) def extract_duration_rating( document_text, model_name="o3-mini", is_async=False, cost_tracker=None ): """Rate contract duration adequacy""" # Prompt must be manually drafted # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy. prompt = dedent( f""" You are an expert contract analyst. Evaluate the adequacy of the contract duration considering the subject matter and best practices. Contract document: {document_text} Rate the duration adequacy on a scale of 1-10, where: 1 = Extremely inadequate duration 10 = Perfectly adequate duration Provide a brief justification for your rating (2-3 sentences). """ ) messages = [ {"role": "system", "content": "You are an expert contract analyst."}, {"role": "user", "content": prompt}, ] if is_async: return execute_with_tracking( model_name, messages, DurationRatingExtraction, cost_tracker ) else: return execute_sync(model_name, messages, DurationRatingExtraction) # Cost tracking class CostTracker: """Track LLM costs across multiple extractions""" def __init__(self): self.costs = { "gpt-4o-mini": { "input_per_1m": 0.15, "output_per_1m": 0.60, "input_tokens": 0, "output_tokens": 0, }, "o3-mini": { "input_per_1m": 1.10, "output_per_1m": 4.40, "input_tokens": 0, "output_tokens": 0, }, } self.total_cost = 0.0 def track_usage(self, model_name, input_tokens, output_tokens): """Track token usage for a model""" # Extract base model name base_model = model_name.split("/")[-1] if "/" in model_name else model_name if base_model in self.costs: self.costs[base_model]["input_tokens"] += input_tokens self.costs[base_model]["output_tokens"] += output_tokens # Calculate costs separately for input and output tokens input_cost = input_tokens * ( self.costs[base_model]["input_per_1m"] / 1000000 ) output_cost = output_tokens * ( self.costs[base_model]["output_per_1m"] / 1000000 ) self.total_cost += input_cost + output_cost def get_costs(self): """Get cost summary""" model_costs = {} for model, data in self.costs.items(): if data["input_tokens"] > 0 or data["output_tokens"] > 0: input_cost = data["input_tokens"] * (data["input_per_1m"] / 1000000) output_cost = data["output_tokens"] * (data["output_per_1m"] / 1000000) model_costs[model] = { "input_cost": input_cost, "output_cost": output_cost, "total_cost": input_cost + output_cost, "input_tokens": data["input_tokens"], "output_tokens": data["output_tokens"], } return { "model_costs": model_costs, "total_cost": self.total_cost, } # Document processing functions async def process_document_async( document_text, config=None, cost_tracker=None, use_concurrency=True ): """Process a document asynchronously and track costs""" if config is None: config = PipelineConfig() results = {} # Define processing functions async def process_party_pipeline(): # Extract party aspect party_aspect = await extract_aspect( document_text, config.party_extractor.name, config.party_extractor.description, model_name=config.party_extractor.model_name, is_async=True, cost_tracker=cost_tracker, ) # Extract parties from the aspect parties = await extract_parties( party_aspect.aspect_text, model_name=config.party_extractor.model_name, is_async=True, cost_tracker=cost_tracker, ) return parties async def process_term_pipeline(): # Extract term aspect term_aspect = await extract_aspect( document_text, config.term_extractor.name, config.term_extractor.description, model_name=config.term_extractor.model_name, is_async=True, cost_tracker=cost_tracker, ) # Extract terms from the aspect terms = await extract_terms( term_aspect.aspect_text, model_name=config.term_extractor.model_name, is_async=True, cost_tracker=cost_tracker, ) return terms async def process_contract_info(): return await extract_contract_info( document_text, model_name=config.contract_info_extractor.model_name, is_async=True, cost_tracker=cost_tracker, ) async def process_attachments(): return await extract_attachments( document_text, model_name=config.attachment_extractor.model_name, is_async=True, cost_tracker=cost_tracker, ) async def process_duration_rating(): return await extract_duration_rating( document_text, model_name=config.duration_rating_extractor.model_name, is_async=True, cost_tracker=cost_tracker, ) # Run extractions based on concurrency preference if use_concurrency: # Process all extractions concurrently for maximum speed ( parties, terms, contract_info, attachments, duration_rating, ) = await asyncio.gather( process_party_pipeline(), process_term_pipeline(), process_contract_info(), process_attachments(), process_duration_rating(), ) else: # Process extractions sequentially parties = await process_party_pipeline() terms = await process_term_pipeline() contract_info = await process_contract_info() attachments = await process_attachments() duration_rating = await process_duration_rating() # Structure results in the same format as the LangChain implementation results["contract_type"] = contract_info.contract_type results["governing_law"] = contract_info.governing_law results["effective_date"] = contract_info.effective_date results["parties"] = parties.parties results["term_years"] = terms.terms[0].duration_years if terms.terms else None results["term_reference"] = terms.terms[0].reference if terms.terms else None results["attachments"] = attachments.attachments results["duration_rating"] = duration_rating.rating return results def process_document( document_text, config=None, cost_tracker=None, use_concurrency=True ): """ Process a document and track costs. """ # Get the current event loop loop = asyncio.get_event_loop() # Run the async function in the current event loop return loop.run_until_complete( process_document_async(document_text, config, cost_tracker, use_concurrency) ) # Example usage # Sample contract texts (shortened for brevity) doc1_text = ( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "All intellectual property created during the provision of services shall belong to the Customer...\n" "This agreement is governed by the laws of Norway...\n" "Annex 1: Data processing agreement...\n" "Annex 2: Statement of Work...\n" "Annex 3: Service Level Agreement...\n" ) doc2_text = ( "Service Level Agreement\n" "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n" "The agreement shall commence on January 1, 2023 and continue for 2 years...\n" "The Provider shall deliver IT support services as outlined in Schedule A...\n" "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n" "The Provider guarantees [99.9%] uptime for all critical systems...\n" "Either party may terminate with 60 days written notice...\n" "This agreement is governed by the laws of California...\n" "Schedule A: Service Descriptions...\n" "Schedule B: Response Time Requirements...\n" ) # Function to pretty-print document results def print_document_results(doc_name, results): print(f"\nResults from {doc_name}:") print(f"Contract Type: {results['contract_type']}") print(f"Parties: {[f'{p.name} ({p.role})' for p in results['parties']]}") print(f"Term: {results['term_years']} years") print( f"Term Reference: {results['term_reference'] if results['term_reference'] else 'Not specified'}" ) print(f"Governing Law: {results['governing_law']}") print(f"Attachments: {[(a.title, a.description) for a in results['attachments']]}") print(f"Duration Rating: {results['duration_rating'].score}/10") print(f"Rating Justification: {results['duration_rating'].justification}") # Create cost tracker cost_tracker = CostTracker() # Create pipeline with default configuration config = PipelineConfig() # Process documents print("Processing document 1 with concurrency...") doc1_results = process_document(doc1_text, config, cost_tracker, use_concurrency=True) print("Processing document 2 with concurrency...") doc2_results = process_document(doc2_text, config, cost_tracker, use_concurrency=True) # Print results print_document_results("Document 1 (Consultancy Agreement)", doc1_results) print_document_results("Document 2 (Service Level Agreement)", doc2_results) # Print cost information print("\nProcessing costs:") costs = cost_tracker.get_costs() for model, model_data in costs["model_costs"].items(): print(f"\n{model}:") print(f" Input cost: ${model_data['input_cost']:.4f}") print(f" Output cost: ${model_data['output_cost']:.4f}") print(f" Total cost: ${model_data['total_cost']:.4f}") print(f"\nTotal across all models: ${costs['total_cost']:.4f}") # ==== how_it_works ==== How it works ************ 📏 Leveraging LLM Context Windows ================================= ContextGem leverages LLMs' long context windows to deliver superior extraction accuracy. Unlike RAG approaches that often struggle with complex concepts and nuanced insights, ContextGem is betting on the continuously expanding context capacity, evolving capabilities of modern LLMs, and constantly decreasing LLM costs. This approach enables direct information extraction from full documents, eliminating retrieval inconsistencies and capturing the complete context necessary for accurate understanding. 🧩 Core Components ================== ContextGem's main elements are the Document, Aspect, and Concept models: 📄 **Document** --------------- "Document" model contains text and/or visual content representing a specific document. Documents can vary in type and purpose, including but not limited to: * *Contracts*: legal agreements defining terms and obligations. * *Invoices*: financial documents detailing transactions and payments. * *Curricula Vitae (CVs)*: resumes outlining an individual's professional experience and qualifications. * *General documents*: any other types of documents that may contain text or images. 🔍 **Aspect** ------------- "Aspect" model contains text representing a defined area or topic within a document (or another aspect) that requires focused attention. Each aspect reflects a specific subject or theme. For example: * *Contract aspects*: payment terms, parties involved, or termination clauses. * *Invoice aspects*: due dates, line-item breakdowns, or tax details. * *CV aspects*: work experience, education, or skills. Aspects may have sub-aspects, for more granular extraction with nested context. This hierarchical structure allows for progressive refinement of focus areas, enabling precise extraction of information from complex documents while maintaining the contextual relationships between different levels of content. 💡 **Concept** -------------- Concept model contains a unit of information or an entity, derived from an aspect or the broader document context. Concepts represent a wide range of data points and insights, from simple entities (names, dates, monetary values) to complex evaluations, conclusions, and answers to specific questions. Concepts can be: * *Factual extractions*: such as a penalty clause in a contract, a total amount due in an invoice, or a certification in a CV. * *Analytical insights*: such as risk assessments, compliance evaluations, or pattern identifications. * *Reasoned conclusions*: such as determining whether a document meets specific criteria or answers particular questions. * *Interpretative judgments*: such as ratings, classifications, or qualitative assessments based on document content. Concepts may be attached to an aspect or a document. The context for the concept extraction will be the aspect or document, respectively. This flexible attachment allows for both targeted extraction from specific document sections and broader analysis across the entire document content. When attached to aspects, concepts benefit from the focused context, enabling more precise extraction of domain-specific information. When attached to documents, concepts can leverage the complete context to identify patterns, anomalies, or insights that span multiple sections. Multiple concept types are supported: "StringConcept", "BooleanConcept", "NumericalConcept", "DateConcept", "JsonObjectConcept", "RatingConcept", "LabelConcept" Component Examples ^^^^^^^^^^^^^^^^^^ +-----------------+----------------------+----------------------+----------------------+----------------------+ | | Document | Aspect | Sub-aspect | Concept | |=================|======================|======================|======================|======================| | **Legal** | *Software License | Intellectual | Patent | Indemnification | | | Agreement* | Property Rights | Indemnification | Coverage Scope ("Js | | | | | | onObjectConcept") | +-----------------+----------------------+----------------------+----------------------+----------------------+ | **Financial** | *Quarterly Earnings | Revenue Analysis | Regional Performance | Year-over-Year | | | Report* | | | Growth Rate | | | | | | ("NumericalConcept") | +-----------------+----------------------+----------------------+----------------------+----------------------+ | **Healthcare** | *Medical Research | Methodology | Patient Selection | Inclusion/Exclusion | | | Paper* | | Criteria | Validity | | | | | | ("BooleanConcept") | +-----------------+----------------------+----------------------+----------------------+----------------------+ | **Technical** | *System Architecture | Security Framework | Authentication | Implementation Risk | | | Document* | | Protocols | Rating | | | | | | ("RatingConcept") | +-----------------+----------------------+----------------------+----------------------+----------------------+ | **HR** | *Employee Handbook* | Leave Policy | Parental Leave | Eligibility Start | | | | | Benefits | Date ("DateConcept") | +-----------------+----------------------+----------------------+----------------------+----------------------+ 🔄 Extraction Workflow ====================== ContextGem uses the following models to extract information from documents: 🤖 **DocumentLLM** ------------------ **A single configurable LLM with a specific role to extract specific information from the document.** The "role" of an LLM is an abstraction used to assign various LLMs tasks of different complexity. For example, if an aspect/concept is assigned "llm_role="extractor_text"", this aspect/concept is extracted from the document using the LLM with "role="extractor_text"". This helps to channel different tasks to different LLMs, ensuring that the task is handled by the most appropriate model. Usually, domain expertise is required to determine the most appropriate role for a specific aspect/concept. But for simple use cases, when working with text-only documents and a single LLM, you can skip the role assignment completely, in which case the role will default to ""extractor_text"". An LLM can have a configurable fallback LLM with the same role. See "DocumentLLM" and 🏷️ LLM Roles for more details. 🤖🤖 **DocumentLLMGroup** ------------------------- **A group of LLMs with different unique roles to extract different information from the document.** For more complex and granular extraction workflows, an LLM group can be used to extract different information from the same document using different LLMs with different roles. For example, a simpler LLM e.g. gpt-4o-mini can be used to extract specific aspects of the document, and a more powerful LLM e.g. o3-mini will handle the extraction of complex concepts that require reasoning over the aspects' context. Each LLM can have its own backend and configuration, and one fallback LLM with the same role. See "DocumentLLMGroup" and 🏷️ LLM Roles for more details. LLM Group Workflow Example ^^^^^^^^^^^^^^^^^^^^^^^^^^ +-----------------+----------------------+----------------------+----------------------+ | | LLM 1 | LLM 2 | LLM 3 | | | ("extractor_text") | ("reasoner_text") | ("extractor_vision") | |=================|======================|======================|======================| | *Model* | gpt-4.1-mini | o4-mini | gpt-4.1-mini | +-----------------+----------------------+----------------------+----------------------+ | *Task* | Extract payment | Detect anomalies in | Extract invoice | | | terms from a | the payment terms | amounts | | | contract | | | +-----------------+----------------------+----------------------+----------------------+ | *Fallback LLM* | gpt-4o-mini | o3-mini | gpt-4o-mini | | (optional) | | | | +-----------------+----------------------+----------------------+----------------------+ [image: ContextGem - How it works infographics][image] ℹ️ What ContextGem Doesn't Offer (Yet) ====================================== While ContextGem excels at structured data extraction from individual documents, it's important to understand its intentional design boundaries: **Not a RAG framework** ----------------------- ContextGem focuses on in-depth single-document analysis, leveraging long context windows of LLMs for maximum accuracy and precision. It does not offer RAG capabilities for cross-document querying or corpus- wide information retrieval. For these use cases, modern RAG frameworks such as LlamaIndex remain more appropriate. **Not an agent framework** -------------------------- ContextGem is not designed as an agent framework. It now supports tool calling in chat (function-calling) with optional parallel tool calls and JSON schema validation of tool arguments, which enables building lightweight, task-oriented agents using your own control loop together with "ChatSession". For full agent orchestration (planning/critique, goal decomposition, long-term memory, schedulers, multi-agent coordination), we recommend frameworks specifically designed for this purpose. ContextGem integrates cleanly as a high-accuracy document extraction tool in larger agent systems thanks to its simple API and structured outputs. # ==== installation ==== Installation ************ 🔧 Prerequisites ================ Before installing ContextGem, ensure you have: * Python 3.10-3.13 * pip (Python package installer) 📦 Installation Methods ======================= From PyPI --------- The simplest way to install ContextGem is via pip: pip install -U contextgem Or using uv (faster alternative): uv add contextgem Development Installation ------------------------ For development, clone the repository and use uv: git clone https://github.com/shcherbak-ai/contextgem.git cd contextgem # Install uv if you don't have it pip install uv # Install dependencies including development extras uv sync --all-groups ✅ Verifying Installation ========================= To verify that ContextGem is installed correctly, run: python -c "import contextgem; print(contextgem.__version__)" # ==== quickstart ==== Quickstart examples ******************* This guide will help you get started with ContextGem by walking through basic extraction examples. Below are complete, self-contained examples showing how to extract data from a document using ContextGem. 🔄 Extraction Process ===================== ContextGem follows a simple extraction process: 1. Create a "Document" instance with your content 2. Define "Aspect" instances for sections of interest 3. Define concept instances ("StringConcept", "BooleanConcept", "NumericalConcept", "DateConcept", "JsonObjectConcept", "RatingConcept") for specific data points to extract, and attach them to "Aspect" (for aspect context) or "Document" (for document context). 4. Use "DocumentLLM" or "DocumentLLMGroup" to perform the extraction 5. Access the extracted data in the document object 📋 Aspect Extraction from Document ================================== Tip: Aspect extraction is useful for identifying and extracting specific sections or topics from documents. Common use cases include: * Extracting specific clauses from legal contracts * Identifying specific sections from financial reports * Isolating relevant topics from research papers * Extracting product features from technical documentation # Quick Start Example - Extracting aspect from a document import os from contextgem import Aspect, Document, DocumentLLM # Example document instance # Document content is shortened for brevity doc = Document( raw_text=( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "This agreement is governed by the laws of Norway...\n" ), ) # Define an aspect with optional concept(s), using natural language doc_aspect = Aspect( name="Governing law", description="Clauses defining the governing law of the agreement", reference_depth="sentences", ) # Add aspects to the document doc.add_aspects([doc_aspect]) # (add more aspects to the document, if needed) # Create an LLM for extraction llm = DocumentLLM( model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), # your API key ) # Extract information from the document extracted_aspects = llm.extract_aspects_from_document(doc) # or use async version llm.extract_aspects_from_document_async(doc) # Access extracted information print("Governing law aspect:") print( extracted_aspects[0].extracted_items ) # extracted aspect items with references to sentences # or doc.get_aspect_by_name("Governing law").extracted_items 🌳 Extracting Aspect with Sub-Aspects ===================================== Tip: Sub-aspect extraction helps organize complex topics into logical components. Common use cases include: * Breaking down termination clauses in employment contracts into company rights, employee rights, and severance terms * Dividing financial report sections into revenue streams, expenses, and forecasts * Organizing product specifications into technical details, compatibility, and maintenance requirements # Quick Start Example - Extracting an aspect with sub-aspects import os from contextgem import Aspect, Document, DocumentLLM # Sample document (content shortened for brevity) contract_text = """ EMPLOYMENT AGREEMENT ... 8. TERMINATION 8.1 Termination by the Company. The Company may terminate the Employee's employment for Cause at any time upon written notice. "Cause" shall mean: (i) Employee's material breach of this Agreement; (ii) Employee's conviction of a felony; or (iii) Employee's willful misconduct that causes material harm to the Company. 8.2 Termination by the Employee. The Employee may terminate employment for Good Reason upon 30 days' written notice to the Company. "Good Reason" shall mean a material reduction in Employee's base salary or a material diminution in Employee's duties. 8.3 Severance. If the Employee's employment is terminated by the Company without Cause or by the Employee for Good Reason, the Employee shall be entitled to receive severance pay equal to six (6) months of the Employee's base salary. ... """ doc = Document(raw_text=contract_text) # Define termination aspect with practical sub-aspects termination_aspect = Aspect( name="Termination", description="Provisions related to the termination of employment", aspects=[ # assign sub-aspects (optional) Aspect( name="Company Termination Rights", description="Conditions under which the company can terminate employment", ), Aspect( name="Employee Termination Rights", description="Conditions under which the employee can terminate employment", ), Aspect( name="Severance Terms", description="Compensation or benefits provided upon termination", ), ], ) # Add the aspect to the document. Sub-aspects are added with the parent aspect. doc.add_aspects([termination_aspect]) # (add more aspects to the document, if needed) # Create an LLM for extraction llm = DocumentLLM( model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ.get( "CONTEXTGEM_OPENAI_API_KEY" ), # your API key of the LLM provider ) # Extract all information from the document doc = llm.extract_all(doc) # Get results with references in the document object print("\nTermination aspect:\n") termination_aspect = doc.get_aspect_by_name("Termination") for sub_aspect in termination_aspect.aspects: print(sub_aspect.name) for item in sub_aspect.extracted_items: print(item.value) print("\n") 🔍 Concept Extraction from Aspect ================================= Tip: Concept extraction from aspects helps identify specific data points within already extracted sections or topics. Common use cases include: * Extracting payment amounts from a contract's payment terms * Extracting liability cap from a contract's liability section * Isolating timelines from delivery terms * Extracting a list of features from a product description * Identifying programming languages from a CV's experience section # Quick Start Example - Extracting a concept from an aspect import os from contextgem import Aspect, Document, DocumentLLM, StringConcept, StringExample # Example document instance # Document content is shortened for brevity doc = Document( raw_text=( "Employment Agreement\n" "This agreement between TechCorp Inc. (Employer) and Jane Smith (Employee)...\n" "The employment shall commence on January 15, 2023 and continue until terminated...\n" "The Employee shall work as a Senior Software Engineer reporting to the CTO...\n" "The Employee shall receive an annual salary of $120,000 paid monthly...\n" "The Employee is entitled to 20 days of paid vacation per year...\n" "The Employee agrees to a notice period of 30 days for resignation...\n" "This agreement is governed by the laws of California...\n" ), ) # Define an aspect with a specific concept, using natural language doc_aspect = Aspect( name="Compensation", description="Clauses defining the compensation and benefits for the employee", reference_depth="sentences", ) # Define a concept within the aspect aspect_concept = StringConcept( name="Annual Salary", description="The annual base salary amount specified in the employment agreement", examples=[ # optional StringExample( content="$X per year", # guidance regarding format ) ], add_references=True, reference_depth="sentences", ) # Add the concept to the aspect doc_aspect.add_concepts([aspect_concept]) # (add more concepts to the aspect, if needed) # Add the aspect to the document doc.add_aspects([doc_aspect]) # (add more aspects to the document, if needed) # Create an LLM for extraction llm = DocumentLLM( model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), # your API key ) # Extract information from the document doc = llm.extract_all(doc) # or use async version llm.extract_all_async(doc) # Access extracted information in the document object print("Compensation aspect:") print( doc.get_aspect_by_name("Compensation").extracted_items ) # extracted aspect items with references to sentences print("Annual Salary concept:") print( doc.get_aspect_by_name("Compensation") .get_concept_by_name("Annual Salary") .extracted_items ) # extracted concept items with references to sentences 📝 Concept Extraction from Document (text) ========================================== Tip: Concept extraction from text documents locates specific information directly from text. Common use cases include: * Extracting anomalies from entire legal documents * Identifying financial figures across multiple report sections * Extracting citations and references from academic papers * Identifying product specifications from technical manuals * Extracting contact information from business documents # Quick Start Example - Extracting a concept from a document import os from contextgem import Document, DocumentLLM, JsonObjectConcept # Example document instance # Document content is shortened for brevity doc = Document( raw_text=( "Statement of Work\n" "Project: Cloud Migration Initiative\n" "Client: Acme Corporation\n" "Contractor: TechSolutions Inc.\n\n" "Project Timeline:\n" "Start Date: March 1, 2025\n" "End Date: August 31, 2025\n\n" "Deliverables:\n" "1. Infrastructure assessment report (Due: March 15, 2025)\n" "2. Migration strategy document (Due: April 10, 2025)\n" "3. Test environment setup (Due: May 20, 2025)\n" "4. Production migration (Due: July 15, 2025)\n" "5. Post-migration support (Due: August 31, 2025)\n\n" "Budget: $250,000\n" "Payment Schedule: 20% upfront, 30% at midpoint, 50% upon completion\n" ), ) # Define a document-level concept using e.g. JsonObjectConcept # This will extract structured data from the entire document doc_concept = JsonObjectConcept( name="Project Details", description="Key project information including timeline, deliverables, and budget", structure={ "project_name": str, "client": str, "contractor": str, "budget": str, "payment_terms": str, }, # simply use a dictionary with type hints (including generic aliases and union types) add_references=True, reference_depth="paragraphs", ) # Add the concept to the document doc.add_concepts([doc_concept]) # (add more concepts to the document, if needed) # Create an LLM for extraction llm = DocumentLLM( model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), # your API key ) # Extract information from the document extracted_concepts = llm.extract_concepts_from_document(doc) # or use async version llm.extract_concepts_from_document_async(doc) # Access extracted information print("Project Details:") print( extracted_concepts[0].extracted_items ) # extracted concept items with references to paragraphs # Or doc.get_concept_by_name("Project Details").extracted_items 🖼️ Concept Extraction from Document (vision) ============================================ Tip: Concept extraction using vision capabilities processes documents with complex layouts or images. Common use cases include: * Extracting data from scanned contracts or receipts * Identifying information from charts and graphs in reports * Identifying visual product features from marketing materials # Quick Start Example - Extracting concept from a document with an image import os from pathlib import Path from contextgem import Document, DocumentLLM, NumericalConcept, create_image # Path adapted for testing current_file = Path(__file__).resolve() root_path = current_file.parents[4] image_path = root_path / "tests" / "images" / "invoices" / "invoice.jpg" # Create an image instance using the create_image utility doc_image = create_image(image_path) # Example document instance holding only the image doc = Document( images=[doc_image], # may contain multiple images ) # Define a concept to extract the invoice total amount doc_concept = NumericalConcept( name="Invoice Total", description="The total amount to be paid as shown on the invoice", numeric_type="float", llm_role="extractor_vision", # use vision model ) # Add concept to the document doc.add_concepts([doc_concept]) # (add more concepts to the document, if needed) # Create an LLM for extraction llm = DocumentLLM( model="openai/gpt-4o-mini", # Using a model with vision capabilities api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), # your API key role="extractor_vision", # mark LLM as vision model ) # Extract information from the document extracted_concepts = llm.extract_concepts_from_document(doc) # or use async version: await llm.extract_concepts_from_document_async(doc) # Access extracted information print("Invoice Total:") print(extracted_concepts[0].extracted_items) # extracted concept items # or doc.get_concept_by_name("Invoice Total").extracted_items 💬 Lightweight LLM Chat Interface ================================= Note: While ContextGem is primarily designed for advanced structured data extraction, it also provides a lightweight, unified interface for interacting with LLMs via natural language - across both text and vision - with built-in fallback support. Tip: To preserve message history across turns, pass a "ChatSession" instance via "chat_session=..." to "DocumentLLM.chat(...)" (or ".chat_async(...)"). Without a session, each "chat(...)" call is treated as a one-off message → response interaction. # Using LLMs for chat (text + vision), with fallback LLM support import os from contextgem import DocumentLLM from contextgem.public import ChatSession # Initialize main LLM for chat main_model = DocumentLLM( model="openai/gpt-4o", # or another provider/model api_key=os.getenv("CONTEXTGEM_OPENAI_API_KEY"), # your API key for the LLM provider system_message="", # disable default system message for chat, or provide your own ) # Optional: configure fallback LLM for reliability fallback_model = DocumentLLM( model="openai/gpt-4o-mini", # or another provider/model api_key=os.getenv("CONTEXTGEM_OPENAI_API_KEY"), # your API key for the LLM provider is_fallback=True, system_message="", # also disable default system message for fallback, or provide your own ) main_model.fallback_llm = fallback_model # Preserve conversation history across turns with a ChatSession session = ChatSession() first_response = main_model.chat( "Hi there!", # images=[Image(...)] # optional: add images for vision models chat_session=session, ) second_response = main_model.chat( "And what is EBITDA?", chat_session=session, ) # or use async: `response = await main_model.chat_async(...)` # Or send a chat message without a session (one-off message → response) one_off_response = main_model.chat("Test") 🛠️ Chat with Tools ------------------ Tip: Provide OpenAI-compatible tool schemas via "tools=[...]" and register Python handlers with "@register_tool". Tool support is only used in "chat(...)" and "chat_async(...)". Note: Tool handlers must return a string. If you need to return structured data, serialize it (e.g., with "json.dumps") before returning. import os from contextgem import ChatSession, DocumentLLM, register_tool # Define tool handlers and register them @register_tool def compute_invoice_total(items: list[dict]) -> str: total = 0 for it in items: qty = float(it.get("qty", 0)) price = float(it.get("price", 0)) total += qty * price return str(total) # OpenAI-compatible tool schema passed to the model tools = [ { "type": "function", "function": { "name": "compute_invoice_total", "description": "Compute invoice total as sum(qty*price) over items", "parameters": { "type": "object", "properties": { "items": { "type": "array", "items": { "type": "object", "properties": { "qty": {"type": "number"}, "price": {"type": "number"}, }, "required": ["qty", "price"], }, "minItems": 1, } }, "required": ["items"], }, }, }, ] # Configure an LLM that supports tool use llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), system_message="You are a helpful assistant.", # override default system message for chat tools=tools, ) # Maintain history across turns session = ChatSession() prompt = ( "What's the invoice total for the items " "[{'qty':2.0,'price':3.5},{'qty':1.0,'price':3.0}]? " "Prices are in USD." ) answer = llm.chat(prompt, chat_session=session) print("Answer:", answer) # ==== documents/document_config ==== Creating Documents ****************** This guide explains how to create and configure "Document" instances to process textual and visual content for analysis. Documents serve as the container for the content from which information (aspects and concepts) can be extracted. ⚙️ Configuration Parameters =========================== The minimum configuration for a document requires either "raw_text", "paragraphs", or "images": Document creation from pathlib import Path from contextgem import Document, Paragraph, create_image # Create a document with raw text content contract_document = Document( raw_text=( "...This agreement is effective as of January 1, 2025.\n\n" "All parties must comply with the terms outlined herein. The terms include " "monthly reporting requirements and quarterly performance reviews.\n\n" "Failure to adhere to these terms may result in termination of the agreement. " "Additionally, any breach of confidentiality will be subject to penalties as " "described in this agreement.\n\n" "This agreement shall remain in force for a period of three (3) years unless " "otherwise terminated according to the provisions stated above..." ), paragraph_segmentation_mode="newlines", # Default mode, splits on newlines ) # Create a document with more advanced paragraph segmentation using a SaT model report_document = Document( raw_text=( "Executive Summary " "This report outlines our quarterly performance. " "Revenue increased by [15%] compared to the previous quarter.\n\n" "Customer satisfaction metrics show positive trends across all regions..." ), paragraph_segmentation_mode="sat", # Use SaT model for intelligent paragraph segmentation sat_model_id="sat-3l-sm", # Specify which SaT model to use ) # Create a document with predefined paragraphs, e.g. when you use a custom # paragraph segmentation tool document_from_paragraphs = Document( paragraphs=[ Paragraph(raw_text="This is the first paragraph."), Paragraph(raw_text="This is the second paragraph with more content."), Paragraph(raw_text="Final paragraph concluding the document."), # ... ] ) # Create document with images # Path is adapted for doc tests current_file = Path(__file__).resolve() root_path = current_file.parents[4] image_path = root_path / "tests" / "images" / "invoices" / "invoice.png" # Create a document with only images (no text) image_document = Document( images=[ create_image(image_path), # contextgem.Image instance # ... ] ) # Create a document with both text and images mixed_document = Document( raw_text="This document contains both text and visual elements.", images=[ create_image(image_path), # contextgem.Image instance # ... ], ) The "Document" class accepts the following parameters: +---------------------------+-----------------+-----------------+-----------------------------------------------+ | Parameter | Type | Default Value | Description | |===========================|=================|=================|===============================================| | "raw_text" | "str | None" | "None" | The main text of the document as a single | | | | | string. | +---------------------------+-----------------+-----------------+-----------------------------------------------+ | "paragraphs" | "list[Paragrap | "[]" | List of "Paragraph" instances in consecutive | | | h]" | | order as they appear in the document. | | | | | Normally auto-populated from "raw_text". | +---------------------------+-----------------+-----------------+-----------------------------------------------+ | "images" | "list[Image]" | "[]" | List of "Image" instances attached to or | | | | | representing the document. Used for visual | | | | | content analysis. | +---------------------------+-----------------+-----------------+-----------------------------------------------+ | "aspects" | "list[Aspect]" | "[]" | List of "Aspect" instances associated with | | | | | the document for focused analysis. Must have | | | | | unique names and descriptions. See Aspect | | | | | Extraction for more details. | +---------------------------+-----------------+-----------------+-----------------------------------------------+ | "concepts" | "list[_Concept | "[]" | List of "_Concept" instances associated with | | | ]" | | the document for information extraction. Must | | | | | have unique names and descriptions. See | | | | | supported concept types in Supported | | | | | Concepts. | +---------------------------+-----------------+-----------------+-----------------------------------------------+ | "paragraph_segmentation_ | "Literal["newl | ""newlines"" | Mode for paragraph segmentation. ""newlines"" | | mode" | ines", "sat"]" | | splits on newline characters, ""sat"" uses a | | | | | SaT (Segment Any Text) model for intelligent | | | | | segmentation. | +---------------------------+-----------------+-----------------+-----------------------------------------------+ | "sat_model_id" | "SaTModelId" | ""sat-3l-sm"" | SaT model ID for paragraph/sentence | | | | | segmentation or a local path to a SaT model. | | | | | See wtpsplit models for available options. | +---------------------------+-----------------+-----------------+-----------------------------------------------+ | "pre_segment_sentences" | "bool" | "False" | Whether to pre-segment sentences during | | | | | Document initialization. When "False", | | | | | sentence segmentation is deferred until | | | | | sentences are actually needed, improving | | | | | initialization performance. | +---------------------------+-----------------+-----------------+-----------------------------------------------+ 🔄 DOCX Document Conversion =========================== ContextGem provides a built-in "DocxConverter" to easily transform DOCX files into LLM-ready "Document" instances. For detailed usage examples and configuration options, see DOCX Converter. 🎯 Adding Aspects and Concepts for Extraction ============================================= Before extracting information from a document with an LLM, you must define and add **aspects** and **concepts** to your document instance. These components serve as the foundation for targeted analysis and structured information extraction. **Aspects** define the text segments (sections, topics, themes) to be extracted from the document. They can be combined with concepts for comprehensive analysis. **Concepts** define specific data points to be extracted or inferred from the document content: entities, insights, structured objects, classifications, numerical calculations, dates, ratings, and assessments. For detailed guidance on creating and configuring these components, see: * Aspect Extraction - Complete guide to defining and using aspects * Supported Concepts - All available concept types and how to use them # ==== converters/docx ==== DOCX Converter ************** ContextGem provides built-in converter to easily transform DOCX files into LLM-ready ContextGem document objects. * 📑 **Comprehensive extraction of document elements**: paragraphs, headings, lists, tables, comments, footnotes, textboxes, headers/footers, links, embedded images, and inline formatting * 🧩 **Document structure preservation** with rich metadata for improved LLM analysis * 🛠️ **Built-in converter** that directly processes Word XML Note: ✨ **Performance improvement in v0.17.1**: DOCX converter now converts files **~2X faster**. 🚀 Usage ======== # Using ContextGem's DocxConverter from contextgem import DocxConverter converter = DocxConverter() # Convert a DOCX file to an LLM-ready ContextGem Document # from path document = converter.convert("path/to/document.docx") # or from file object with open("path/to/document.docx", "rb") as docx_file_object: document = converter.convert(docx_file_object) # Perform data extraction on the resulting Document object # document.add_aspects(...) # document.add_concepts(...) # llm.extract_all(document) # You can also use DocxConverter instance as a standalone text extractor docx_text = converter.convert_to_text_format( "path/to/document.docx", output_format="markdown", # or "raw" ) 🔄 Conversion Process ===================== The "DocxConverter" performs the following operations when converting a DOCX file to a ContextGem Document with "convert()" method: +---------------------------+----------------------------------------------------+---------------------------+ | Elements | Extraction Details | Control Parameter | | | | (Default) | |===========================|====================================================|===========================| | **Text** | Extracts the full document text as raw text, and | "apply_markdown=True" | | | optionally applies markdown processing and | | | | formatting while preserving raw text separately | | +---------------------------+----------------------------------------------------+---------------------------+ | **Paragraphs** | Extracts "Paragraph" objects with rich metadata | *Always included* | | | serving as additional context for LLM (e.g., | | | | *"Style: Normal, Table: 3, Row: 1, Column: 3, | | | | Table Cell"*) | | +---------------------------+----------------------------------------------------+---------------------------+ | **Headings** | Preserves heading levels and formats as markdown | *Always included* | | | headings when in markdown mode | | +---------------------------+----------------------------------------------------+---------------------------+ | **Lists** | Maintains list hierarchy, numbering, and | *Always included* | | | formatting with proper indentation and list type | | | | information | | +---------------------------+----------------------------------------------------+---------------------------+ | **Tables** | Preserves table structure and formats tables in | "include_tables=True" | | | markdown mode | | +---------------------------+----------------------------------------------------+---------------------------+ | **Headers & Footers** | Captures document headers and footers with | "include_headers=True" / | | | appropriate metadata | "include_footers=True" | +---------------------------+----------------------------------------------------+---------------------------+ | **Footnotes** | Extracts footnotes with references and preserves | "include_footnotes=True" | | | connection to original text | | +---------------------------+----------------------------------------------------+---------------------------+ | **Comments** | Preserves document comments with author | "include_comments=True" | | | information and timestamps | | +---------------------------+----------------------------------------------------+---------------------------+ | **Links** | Processes and formats hyperlinks, preserving both | "include_links=True" | | | link text and target URLs | | +---------------------------+----------------------------------------------------+---------------------------+ | **Text Boxes** | Extracts text from various text box formats | "include_textboxes=True" | +---------------------------+----------------------------------------------------+---------------------------+ | **Inline Formatting** | Applies inline formatting such as bold, italic, | "include_inline_formatti | | | underline, etc. when in markdown mode | ng=True" | +---------------------------+----------------------------------------------------+---------------------------+ | **Images** | Extracts embedded images and converts them to | "include_images=True" | | | "Image" objects for further processing with vision | | | | models | | +---------------------------+----------------------------------------------------+---------------------------+ ℹ️ Current Limitations ====================== DocxConverter has the following limitations: * Drawings such as charts are skipped as it is challenging to represent them in text format. * Inline markdown formatting (bold, italic, etc.) and hyperlink formatting are not supported in specially marked sections (headers, footers, footnotes, comments). * Extraction of generated table of contents (ToC) is not supported. (A ToC is an automatically generated list of document headings with page numbers that Word creates based on heading styles.) # ==== aspects/aspects ==== Aspect Extraction ***************** "Aspect" is a fundamental component of ContextGem that represents a defined area or topic within a document that requires focused attention. Aspects help identify and extract specific sections or themes from documents according to predefined criteria. 📝 Overview =========== Aspects serve as containers for organizing and structuring document content extraction. They allow you to: * **Extract document sections**: Identify and extract specific parts of documents (e.g., contract clauses, report sections, policy terms) * **Organize content hierarchically**: Create sub-aspects to break down complex topics into logical components * **Define extraction scope**: Focus on specific areas of interest before applying detailed concept extraction While concepts extract specific data points, aspects extract entire sections or topics from documents, providing context for subsequent detailed analysis. ⭐ Key Features =============== Hierarchical Organization ------------------------- Aspects support nested structures through sub-aspects, allowing you to break down complex topics: * **Parent aspects** represent broad topics (e.g., *"Termination Clauses"*) * **Sub-aspects** represent specific components (e.g., *"Notice Period"*, *"Severance Terms"*, *"Company Rights"*) Integration with Concepts ------------------------- Aspects can contain "_Concept" instances for detailed data extraction within the identified sections, creating a two-stage extraction workflow. Note: See supported concept types in Supported Concepts. All public concept types inherit from the internal "_Concept" base class. 💻 Basic Usage ============== Simple Aspect Extraction ------------------------ Here's how to extract a specific section from a document: # ContextGem: Aspect Extraction import os from contextgem import Aspect, Document, DocumentLLM # Create a document instance doc = Document( raw_text=( "Software License Agreement\n" "This software license agreement (Agreement) is entered into between Tech Corp (Licensor) and Client Corp (Licensee).\n" "...\n" "2. Term and Termination\n" "This Agreement shall commence on the Effective Date and shall continue for a period of three (3) years, " "unless earlier terminated in accordance with the provisions hereof. Either party may terminate this Agreement " "upon thirty (30) days written notice to the other party.\n" "\n" "3. Payment Terms\n" "Licensee agrees to pay Licensor an annual license fee of $10,000, payable within thirty (30) days of the " "invoice date. Late payments shall incur a penalty of 1.5% per month.\n" "...\n" ), ) # Define an aspect to extract the termination clause termination_aspect = Aspect( name="Termination Clauses", description="Sections describing how and when the agreement can be terminated, including notice periods and conditions", ) # Add the aspect to the document doc.add_aspects([termination_aspect]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the aspect from the document termination_aspect = llm.extract_aspects_from_document(doc)[0] # Access the extracted information print("Extracted Termination Clauses:") for item in termination_aspect.extracted_items: print(f"- {item.value}") Aspect with Sub-Aspects ----------------------- Breaking down complex topics into components: # ContextGem: Aspect Extraction with Sub-Aspects import os from contextgem import Aspect, Document, DocumentLLM # Create a document instance doc = Document( raw_text=( "Employment Agreement\n" "This Employment Agreement is entered into between Global Tech Inc. (Company) and John Smith (Employee).\n" "\n" "Section 8: Termination\n" "8.1 Termination by Company\n" "The Company may terminate this agreement at any time with or without cause by providing thirty (30) days " "written notice to the Employee. In case of termination for cause, no notice period is required.\n" "\n" "8.2 Termination by Employee\n" "The Employee may terminate this agreement by providing fourteen (14) days written notice to the Company. " "The Employee must complete all pending assignments before the termination date.\n" "\n" "8.3 Severance Benefits\n" "Upon termination without cause, the Employee shall receive severance pay equal to two (2) weeks of base salary " "for each year of service, with a minimum of four (4) weeks and a maximum of twenty-six (26) weeks. " "Severance benefits are contingent upon signing a release agreement.\n" "\n" "8.4 Return of Company Property\n" "Upon termination, the Employee must immediately return all Company property, including laptops, access cards, " "confidential documents, and any other materials belonging to the Company.\n" "\n" "Section 9: Non-Competition\n" "The Employee agrees not to engage in any business that competes with the Company for a period of twelve (12) " "months following termination of employment within a 50-mile radius of the Company's headquarters.\n" ), ) # Define the main termination aspect with sub-aspects termination_aspect = Aspect( name="Termination Provisions", description="All provisions related to employment termination including conditions, procedures, and consequences", aspects=[ Aspect( name="Company Termination Rights", description="Conditions and procedures for the company to terminate the employee, including notice periods and cause requirements", ), Aspect( name="Employee Termination Rights", description="Conditions and procedures for the employee to terminate employment, including notice requirements and obligations", ), Aspect( name="Severance Benefits", description="Compensation and benefits provided to the employee upon termination, including calculation methods and conditions", ), Aspect( name="Post-Termination Obligations", description="Employee obligations that continue after termination, including property return and non-competition requirements", ), ], ) # Add the aspect to the document doc.add_aspects([termination_aspect]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract aspects from the document termination_aspect = llm.extract_aspects_from_document(doc)[0] # Access the extracted information print("All Termination Provisions:") for item in termination_aspect.extracted_items: print(f"- {item.value}") print("\nSub-Aspects:") for sub_aspect in termination_aspect.aspects: print(f"\n{sub_aspect.name}:") for item in sub_aspect.extracted_items: print(f"- {item.value}") ⚙️ Parameters ============= When creating an "Aspect", you can configure the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "name" | "str" | (Required) | A unique name identifier for the aspect. Must be | | | | | unique among sibling aspects. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "description" | "str" | (Required) | A detailed description of what the aspect | | | | | represents and what content should be extracted. | | | | | Must be unique among sibling aspects. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "aspects" | "list[Aspect]" | "[]" | *Optional*. List of sub-aspects for hierarchical | | | | | organization. Limited to one nesting level. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "concepts" | "list[_Concept | "[]" | *Optional*. List of concepts associated with the | | | ]" | | aspect for detailed data extraction within the | | | | | aspect's scope. See supported concept types in | | | | | Supported Concepts. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "llm_role" | "str" | ""extractor_te | The role of the LLM responsible for aspect | | | | xt"" | extraction. Available values: ""extractor_text"", | | | | | ""reasoner_text"". For more details, see 🏷️ LLM | | | | | Roles. Note that aspects only support text-based | | | | | extraction. For this reason, aspects cannot have | | | | | vision LLM roles (i.e. "llm_role" parameter value | | | | | ending with "_vision"). Concepts with vision LLM | | | | | roles cannot be used within aspects. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "reference_depth" | "str" | ""paragraphs"" | The structural depth of references. Available | | | | | values: ""paragraphs"", ""sentences"". Paragraph | | | | | references are always populated for aspect's | | | | | extracted items, as aspect's extracted items | | | | | represent existing text segments. Sentence | | | | | references are only populated when | | | | | "reference_depth="sentences"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_justifications" | "bool" | "False" | Whether the LLM will output justification for each | | | | | extracted item. Justifications provide valuable | | | | | insights into why specific text segments were | | | | | extracted for the aspect, helping you understand | | | | | the LLM's reasoning, verify extraction accuracy, | | | | | and debug unexpected results. This is particularly | | | | | useful when working with complex aspects. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_dept | "str" | ""brief"" | The level of detail for justifications. Available | | h" | | | values: ""brief"", ""balanced"", | | | | | ""comprehensive"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_max_ | "int" | "2" | Maximum number of sentences in a justification. | | sents" | | | | +----------------------+-----------------+-----------------+----------------------------------------------------+ 📊 Extracted Items ================== When an "Aspect" is extracted, it is populated with **a list of extracted items** accessible through the ".extracted_items" property. Each item is an instance of the "_StringItem" class with the following attributes: +----------------------+----------------------+--------------------------------------------------------------+ | Attribute | Type | Description | |======================|======================|==============================================================| | "value" | str | The extracted text segment representing the aspect | +----------------------+----------------------+--------------------------------------------------------------+ | "justification" | str | Explanation of why this text segment was identified as | | | | relevant to the aspect (only if "add_justifications=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_paragrap | list["Paragraph"] | List of paragraph objects that contain the extracted aspect | | hs" | | content (always populated for aspect's extracted items) | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_sentence | list["Sentence"] | List of sentence objects that contain the extracted aspect | | s" | | content (only if "reference_depth="sentences"") | +----------------------+----------------------+--------------------------------------------------------------+ 🚀 Advanced Usage ================= Aspects with Concepts --------------------- Combining aspect extraction with detailed concept extraction: # ContextGem: Aspect Extraction with Concepts import os from contextgem import Aspect, Document, DocumentLLM, NumericalConcept, StringConcept # Create a document instance doc = Document( raw_text=( "Service Agreement\n" "This Service Agreement is between DataFlow Solutions (Provider) and Enterprise Corp (Client).\n" "\n" "3. Payment Terms\n" "3.1 Service Fees\n" "The Client shall pay the Provider a monthly service fee of $5,000 for basic services. " "Additional premium features are available for an extra $1,200 per month. " "Setup fee is a one-time payment of $2,500.\n" "\n" "3.2 Payment Schedule\n" "All payments are due within 15 business days of invoice receipt. " "Invoices will be sent on the first day of each month for the upcoming service period. " "Late payments will incur a penalty of 2% per month on the outstanding balance.\n" "\n" "3.3 Payment Methods\n" "Payments may be made by bank transfer, corporate check, or ACH. " "Credit card payments are accepted for amounts under $1,000 with a 3% processing fee. " "Wire transfer fees are the responsibility of the Client.\n" "\n" "3.4 Refund Policy\n" "Services are non-refundable once delivered. However, if services are terminated " "with 30 days notice, any prepaid fees for future periods will be refunded on a pro-rata basis.\n" ), ) # Define an aspect with associated concepts payment_aspect = Aspect( name="Payment Terms", description="All clauses and provisions related to payment, including fees, schedules, methods, and policies", concepts=[ NumericalConcept( name="Monthly Service Fee", description="The regular monthly fee for basic services", numeric_type="float", ), NumericalConcept( name="Premium Features Fee", description="Additional monthly fee for premium features", numeric_type="float", ), NumericalConcept( name="Setup Fee", description="One-time initial setup or onboarding fee", numeric_type="float", ), NumericalConcept( name="Payment Due Days", description="Number of days the client has to make payment after receiving invoice", numeric_type="int", ), NumericalConcept( name="Late Payment Penalty Rate", description="Percentage penalty charged per month for late payments", numeric_type="float", ), StringConcept( name="Accepted Payment Methods", description="List of payment methods that are accepted by the provider", ), StringConcept( name="Refund Policy", description="Conditions and procedures for refunds or credits", ), ], ) # Add the aspect to the document doc.add_aspects([payment_aspect]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract aspects and their concepts from the document doc = llm.extract_all(doc) # Access the extracted payment terms aspect and concepts payment_terms_aspect = doc.get_aspect_by_name("Payment Terms") print("Extracted Payment Terms Section:") for item in payment_terms_aspect.extracted_items: print(f"- {item.value}") print("\nExtracted Payment Details:") for concept in payment_terms_aspect.concepts: print(f"\n{concept.name}:") for item in concept.extracted_items: print(f"- {item.value}") # Access specific extracted values monthly_fee = payment_terms_aspect.get_concept_by_name("Monthly Service Fee") print(f"\nMonthly Service Fee: ${monthly_fee.extracted_items[0].value}") Complex Hierarchical Structure ------------------------------ Creating a comprehensive document analysis structure with aspects, sub-aspects and concepts: # ContextGem: Complex Hierarchical Aspect Extraction with Sub-Aspects and Concepts import os from contextgem import ( Aspect, BooleanConcept, Document, DocumentLLM, NumericalConcept, StringConcept, ) # Create a document instance doc = Document( raw_text=( "Software Development and Licensing Agreement\n" "\n" "1. Intellectual Property Rights\n" "1.1 Ownership of Developed Software\n" "All software developed under this Agreement shall remain the exclusive property of the Developer. " "The Client receives a non-exclusive license to use the software as specified in Section 2.\n" "\n" "1.2 Client Data and Content\n" "The Client retains all rights to data and content provided to the Developer. " "The Developer may not use Client data for any purpose other than fulfilling this Agreement.\n" "\n" "1.3 Third-Party Components\n" "The software may include third-party open-source components. The Client agrees to comply " "with all applicable open-source licenses.\n" "\n" "2. License Terms\n" "2.1 Grant of License\n" "Developer grants Client a perpetual, non-transferable license to use the software " "for internal business purposes only, limited to 100 concurrent users.\n" "\n" "2.2 License Restrictions\n" "Client may not redistribute, sublicense, or create derivative works. " "Reverse engineering is prohibited except as required by law.\n" "\n" "3. Payment and Financial Terms\n" "3.1 Development Fees\n" "Total development fee is $150,000, payable in three installments: " "$50,000 upon signing, $50,000 at 50% completion, and $50,000 upon delivery.\n" "\n" "3.2 Ongoing License Fees\n" "Annual license fee of $12,000 is due each year starting from the first anniversary. " "Fees may increase by up to 5% annually with 60 days notice.\n" "\n" "3.3 Payment Terms\n" "All payments due within 30 days of invoice. Late payments incur 1.5% monthly penalty.\n" "\n" "4. Liability and Risk Allocation\n" "4.1 Limitation of Liability\n" "Developer's total liability shall not exceed the total amount paid under this Agreement. " "Neither party shall be liable for indirect, consequential, or punitive damages.\n" "\n" "4.2 Indemnification\n" "Client agrees to indemnify Developer against third-party claims arising from Client's use " "of the software, except for claims related to Developer's IP infringement.\n" "\n" "4.3 Insurance Requirements\n" "Developer shall maintain professional liability insurance of at least $1,000,000. " "Client shall maintain general liability insurance of at least $2,000,000.\n" ), ) # Define a complex hierarchical structure contract_aspects = [ Aspect( name="Intellectual Property Provisions", description="All provisions related to intellectual property rights, ownership, and usage", aspects=[ Aspect( name="Software Ownership", description="Clauses defining who owns the developed software and related IP rights", concepts=[ StringConcept( name="Software Owner", description="The party that owns the developed software", ), BooleanConcept( name="Exclusive Ownership", description="Whether the ownership is exclusive to one party", ), ], ), Aspect( name="Client Data Rights", description="Provisions about client data ownership and developer's permitted use", concepts=[ StringConcept( name="Data Usage Restrictions", description="Limitations on how developer can use client data", ), ], ), Aspect( name="Third-Party Components", description="Terms regarding use of third-party or open-source components", concepts=[ BooleanConcept( name="Open Source Included", description="Whether the software includes open-source components", ), ], ), ], ), Aspect( name="License Grant and Restrictions", description="Terms defining the software license granted to the client and any restrictions", aspects=[ Aspect( name="License Scope", description="The extent and limitations of the license granted", concepts=[ StringConcept( name="License Type", description="The type of license granted (exclusive, non-exclusive, etc.)", ), NumericalConcept( name="User Limit", description="Maximum number of concurrent users allowed", numeric_type="int", ), BooleanConcept( name="Perpetual License", description="Whether the license is perpetual or time-limited", ), ], ), Aspect( name="Usage Restrictions", description="Prohibited uses and activities under the license", concepts=[ BooleanConcept( name="Redistribution Allowed", description="Whether client can redistribute the software", ), BooleanConcept( name="Derivative Works Allowed", description="Whether client can create derivative works", ), ], ), ], ), Aspect( name="Financial Terms", description="All payment-related provisions including fees, schedules, and penalties", concepts=[ NumericalConcept( name="Total Development Fee", description="The total amount for software development", numeric_type="float", ), NumericalConcept( name="Annual License Fee", description="Yearly fee for using the software", numeric_type="float", ), NumericalConcept( name="Payment Due Days", description="Number of days to make payment after invoice", numeric_type="int", ), ], ), Aspect( name="Risk and Liability Management", description="Provisions for managing risks, liability limitations, and insurance requirements", aspects=[ Aspect( name="Liability Limitations", description="Caps and exclusions on each party's liability", concepts=[ StringConcept( name="Liability Cap", description="Maximum amount of liability for each party", ), StringConcept( name="Excluded Damages", description="Types of damages that are excluded from liability", ), ], ), Aspect( name="Insurance Requirements", description="Required insurance coverage for each party", concepts=[ NumericalConcept( name="Developer Insurance Amount", description="Minimum professional liability insurance for developer", numeric_type="float", ), NumericalConcept( name="Client Insurance Amount", description="Minimum general liability insurance for client", numeric_type="float", ), ], ), ], ), ] # Add all aspects to the document doc.add_aspects(contract_aspects) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract aspects and concepts doc = llm.extract_all(doc) # Access the hierarchical extraction results print("=== CONTRACT ANALYSIS RESULTS ===\n") for main_aspect in doc.aspects: print(f"{main_aspect.name.upper()}") for item in main_aspect.extracted_items: print(f"- {item.value}") # Access main aspect concepts if main_aspect.concepts: print(" Main Aspect Concepts:") for concept in main_aspect.concepts: print(f" • {concept.name}:") for item in concept.extracted_items: print(f" - {item.value}") # Access sub-aspects if main_aspect.aspects: print(" Sub-Aspects:") for sub_aspect in main_aspect.aspects: print(f" {sub_aspect.name}") for item in sub_aspect.extracted_items: print(f" - {item.value}") # Access sub-aspect concepts if sub_aspect.concepts: print(" Sub-Aspect Concepts:") for concept in sub_aspect.concepts: print(f" • {concept.name}:") for item in concept.extracted_items: print(f" - {item.value}") print() Justifications for Extraction ----------------------------- Justifications provide explanations for why specific text segments were identified as relevant to an aspect. Justifications help users understand the reasoning behind extractions and evaluate their relevance. When enabled, each extracted item includes a generated explanation of why that text segment was considered part of the aspect. Example: # ContextGem: Aspect Extraction with Justifications import os from contextgem import Aspect, Document, DocumentLLM # Create a document instance doc = Document( raw_text=( "NON-DISCLOSURE AGREEMENT\n" "\n" 'This Non-Disclosure Agreement ("Agreement") is entered into between TechCorp Inc. ' '("Disclosing Party") and Innovation Labs LLC ("Receiving Party") on January 15, 2024.\n' "...\n" ), ) # Define a single aspect focused on NDA direction with justifications nda_direction_aspect = Aspect( name="NDA Direction", description="Provisions informing the NDA direction (whether mutual or one-way) and information flow between parties", add_justifications=True, justification_depth="balanced", justification_max_sents=4, ) # Add the aspect to the document doc.aspects = [nda_direction_aspect] # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the aspect with justifications nda_direction_aspect = llm.extract_aspects_from_document(doc)[0] for i, item in enumerate(nda_direction_aspect.extracted_items, 1): print(f"- {i}. {item.value}") print(f" Justification: {item.justification}") print() Note: References are always included for aspects. The "reference_paragraphs" field is automatically populated in extracted items of aspects, as they represent existing text segments in the document. The "reference_sentences" field is only populated when "reference_depth" is set to ""sentences"". You can access these references as follows: # Always available for aspects aspect.extracted_items[0].reference_paragraphs # Only populated if reference_depth="sentences" aspect.extracted_items[0].reference_sentences 💡 Best Practices ================= Aspect Definition ----------------- * **Be specific**: Provide clear, detailed descriptions that help the LLM understand exactly what content constitutes the aspect * **Use domain terminology**: Include relevant domain-specific terms that help identify the target content * **Define scope clearly**: Specify what should and shouldn't be included in the aspect Structuring Complex Content --------------------------- * **Logical decomposition**: Break down complex topics into logical, non-overlapping components * **Meaningful relationships**: Ensure sub-aspects and/or concepts genuinely belong to their parent aspect Integration Strategy -------------------- * **Two-stage extraction**: Use aspects to identify relevant sections first, then apply sub-aspects and/or concepts for detailed data extraction * **Scope alignment**: Ensure sub-aspects and/or concepts are relevant to their containing aspects * **Reference tracking**: Enable references when you need to trace extracted data back to source locations 🎯 Example Use Cases ==================== These are examples of how aspects may be used in different domains: Contract Analysis ----------------- * **Termination Clauses**: Extract and analyze termination conditions, notice periods, and severance terms * **Payment Terms**: Identify payment schedules, amounts, and conditions * **Liability Sections**: Extract liability caps, limitations, and indemnification clauses * **Intellectual Property**: Identify IP ownership, licensing, and usage rights Financial Reports ----------------- * **Revenue Sections**: Extract revenue recognition, breakdown by segments, and growth analysis * **Compliance Sections**: Identify regulatory compliance statements and audit findings * **Key Performance Indicators**: Extract precise numerical metrics like EBITDA margins, debt-to-equity ratios, and year-over-year percentage changes Technical Documentation ----------------------- * **Product Specifications**: Extract technical requirements, features, and performance criteria * **Installation Procedures**: Identify setup steps, configuration requirements, and dependencies * **Troubleshooting Sections**: Extract problem descriptions, diagnostic steps, and solutions * **API Documentation**: Identify endpoints, parameters, and usage examples Research Papers --------------- * **Methodology Sections**: Extract research methods, data collection, and analysis approaches * **Results Sections**: Identify findings, statistical outcomes, and experimental results * **Discussion Sections**: Extract interpretation, implications, and future research directions # ==== concepts/supported_concepts ==== Supported Concepts ****************** In ContextGem, Concepts are building blocks for defining the structured data you want to extract from documents. Each concept type is designed for different kinds of information, allowing you to build complex extraction schemas. Available Concept Types ======================= ContextGem provides several types of concepts, each tailored for specific extraction needs: * 📝 StringConcept: For extracting text values * ✅ BooleanConcept: For extracting boolean (True/False) values * 🔢 NumericalConcept: For extracting numerical values (integers or floats) * 📅 DateConcept: For extracting date objects * ⭐ RatingConcept: For extracting numerical ratings within a defined scale * 📊 JsonObjectConcept: For extracting structured data with multiple fields * 🏷️ LabelConcept: For classification using predefined labels (multi- class or multi-label) This section provides detailed documentation for each concept type, including usage examples and best practices. # ==== concepts/string_concept ==== StringConcept ************* "StringConcept" is a versatile concept type in ContextGem that extracts text-based information from documents, ranging from simple data fields to complex analytical insights. 📝 Overview =========== "StringConcept" is used when you need to extract text values from documents, including: * **Simple fields**: names, titles, descriptions, identifiers * **Complex analyses**: conclusions, assessments, recommendations, summaries * **Detected elements**: anomalies, patterns, key findings, critical insights This concept type offers flexibility to extract both factual information and interpretive content that requires advanced understanding. 💻 Usage Example ================ Here's a simple example of how to use "StringConcept" to extract a person's name from a document: # ContextGem: StringConcept Extraction import os from contextgem import Document, DocumentLLM, StringConcept # Create a Document object from text doc = Document(raw_text="My name is John Smith and I am 30 years old.") # Define a StringConcept to extract a person's name name_concept = StringConcept( name="Person name", description="Full name of the person", ) # Attach the concept to the document doc.add_concepts([name_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document name_concept = llm.extract_concepts_from_document(doc)[0] # Get the extracted value print(name_concept.extracted_items[0].value) # Output: "John Smith" # Or access the extracted value from the document object print(doc.concepts[0].extracted_items[0].value) # Output: "John Smith" ⚙️ Parameters ============= When creating a "StringConcept", you can specify the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "name" | "str" | (Required) | A unique name identifier for the concept | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "description" | "str" | (Required) | A clear description of what the concept represents | | | | | and what should be extracted | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "examples" | "list[StringEx | "[]" | Optional. Example values that help the LLM better | | | ample]" | | understand what to extract and the expected format | | | | | (e.g., *"Party Name (Role)"* format for contract | | | | | parties). This additional guidance helps improve | | | | | extraction accuracy and consistency. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "llm_role" | "str" | ""extractor_te | The role of the LLM responsible for extracting the | | | | xt"" | concept. Available values: ""extractor_text"", | | | | | ""reasoner_text"", ""extractor_vision"", | | | | | ""reasoner_vision"", ""extractor_multimodal"", | | | | | ""reasoner_multimodal"". For more details, see 🏷️ | | | | | LLM Roles. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_justifications" | "bool" | "False" | Whether to include justifications for extracted | | | | | items. Justifications provide explanations of why | | | | | the LLM extracted specific values and the | | | | | reasoning behind the extraction, which is | | | | | especially useful for complex extractions or when | | | | | debugging results. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_dept | "str" | ""brief"" | Justification detail level. Available values: | | h" | | | ""brief"", ""balanced"", ""comprehensive"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_max_ | "int" | "2" | Maximum sentences in a justification. | | sents" | | | | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_references" | "bool" | "False" | Whether to include source references for extracted | | | | | items. References indicate the specific locations | | | | | in the document where the information was either | | | | | directly found or from which it was inferred, | | | | | helping to trace back extracted values to their | | | | | source content even when the extraction involves | | | | | reasoning or interpretation. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "reference_depth" | "str" | ""paragraphs"" | Source reference granularity. Available values: | | | | | ""paragraphs"", ""sentences"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "singular_occurrenc | "bool" | "False" | Whether this concept is restricted to having only | | e" | | | one extracted item. If "True", only a single | | | | | extracted item will be extracted. This is | | | | | particularly relevant when it might be unclear for | | | | | the LLM whether to focus on the concept as a | | | | | single item or extract multiple items. For | | | | | example, when extracting the total amount of | | | | | payments in a contract, where payments might be | | | | | mentioned in different parts of the document but | | | | | you only want the final total. Note that with | | | | | advanced LLMs, this constraint may not be strictly | | | | | required as they can often infer the appropriate | | | | | number of items to extract from the concept's | | | | | name, description, and type (e.g., "document | | | | | title" vs "key findings"). | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "custom_data" | "dict" | "{}" | Optional. Dictionary for storing any additional | | | | | data that you want to associate with the concept. | | | | | This data must be JSON-serializable. This data is | | | | | not used for extraction but can be useful for | | | | | custom processing or downstream tasks. | +----------------------+-----------------+-----------------+----------------------------------------------------+ 🚀 Advanced Usage ================= ✏️ Adding Examples ------------------ You can add examples to improve the extraction accuracy and set the expected format for a "StringConcept": # ContextGem: StringConcept Extraction with Examples import os from contextgem import Document, DocumentLLM, StringConcept, StringExample # Create a Document object from text contract_text = """ SERVICE AGREEMENT This Service Agreement (the "Agreement") is entered into as of January 15, 2025 by and between: XYZ Innovations Inc., a Delaware corporation with offices at 123 Tech Avenue, San Francisco, CA ("Provider"), and Omega Enterprises LLC, a New York limited liability company with offices at 456 Business Plaza, New York, NY ("Customer"). """ doc = Document(raw_text=contract_text) # Create a StringConcept for extracting parties and their roles parties_concept = StringConcept( name="Contract parties", description="Names of parties and their roles in the contract", examples=[ StringExample(content="Acme Corporation (Supplier)"), StringExample(content="TechGroup Inc. (Client)"), ], # add examples providing additional guidance to the LLM ) # Attach the concept to the document doc.add_concepts([parties_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document parties_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted parties and their roles print("Extracted parties and roles:") for item in parties_concept.extracted_items: print(f"- {item.value}") # Expected output: # - XYZ Innovations Inc. (Provider) # - Omega Enterprises LLC (Customer) 🔍 References and Justifications for Extraction ----------------------------------------------- You can configure a "StringConcept" to include justifications and references. Justifications help explain the reasoning behind extracted values, especially for complex or inferred information like conclusions or assessments, while references point to the specific parts of the document that informed the extraction: # ContextGem: StringConcept Extraction with References and Justifications import os from contextgem import Document, DocumentLLM, StringConcept # Sample document text containing financial information financial_text = """ 2024 Financial Performance Summary Revenue increased to $120 million in fiscal year 2024, representing 15% growth compared to the previous year. This growth was primarily driven by the expansion of our enterprise client base and the successful launch of our premium service tier. The Board has recommended a dividend of $1.25 per share, which will be payable to shareholders of record as of March 15, 2025. """ # Create a Document from the text doc = Document(raw_text=financial_text) # Create a StringConcept with justifications and references enabled key_figures_concept = StringConcept( name="Financial key figures", description="Important financial metrics and figures mentioned in the report", add_justifications=True, # enable justifications to understand extraction reasoning justification_depth="balanced", justification_max_sents=3, # allow up to 3 sentences for each justification add_references=True, # include references to source text reference_depth="sentences", # reference specific sentences rather than paragraphs ) # Attach the concept to the document doc.add_concepts([key_figures_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4o-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept key_figures_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted items with justifications and references print("Extracted financial key figures:") for item in key_figures_concept.extracted_items: print(f"\nFigure: {item.value}") print(f"Justification: {item.justification}") print("Source references:") for sent in item.reference_sentences: print(f"- {sent.raw_text}") 📊 Extracted Items ================== When a "StringConcept" is extracted, it is populated with **a list of extracted items** accessible through the ".extracted_items" property. Each item is an instance of the "_StringItem" class with the following attributes: +----------------------+----------------------+--------------------------------------------------------------+ | Attribute | Type | Description | |======================|======================|==============================================================| | "value" | str | The extracted text string | +----------------------+----------------------+--------------------------------------------------------------+ | "justification" | str | Explanation of why this string was extracted (only if | | | | "add_justifications=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_paragrap | list["Paragraph"] | List of paragraph objects that informed the extraction (only | | hs" | | if "add_references=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_sentence | list["Sentence"] | List of sentence objects that informed the extraction (only | | s" | | if "add_references=True" and "reference_depth="sentences"") | +----------------------+----------------------+--------------------------------------------------------------+ 💡 Best Practices ================= Here are some best practices to optimize your use of "StringConcept": * Provide a clear and specific description that helps the LLM understand exactly what to extract. * Include examples (using "StringExample") to improve extraction accuracy and demonstrate the expected format (e.g., *"Party Name (Role)"* for contract parties or *"Revenue: $X million"* for financial figures). * Enable justifications (using "add_justifications=True") when you need to see why the LLM extracted certain values. * Enable references (using "add_references=True") when you need to trace back to where in the document the information was found or understand what evidence informed extracted values (especially for inferred information). * When relevant, enforce only a single item extraction (using "singular_occurrence=True"). This is particularly relevant when it might be unclear for the LLM whether to focus on the concept as a single item or extract multiple items. For example, when extracting the total amount of payments in a contract, where payments might be mentioned in different parts of the document but you only want the final total. # ==== concepts/boolean_concept ==== BooleanConcept ************** "BooleanConcept" is a specialized concept type that evaluates document content and produces True/False assessments based on specific criteria, conditions, or properties you define. 📝 Overview =========== "BooleanConcept" is used when you need to determine if a document contains or satisfies specific attributes, properties, or conditions that can be represented as True or False values, such as: * **Presence checks**: contains confidential information, includes specific clauses, mentions certain topics * **Compliance assessments**: meets regulatory requirements, follows specific formatting standards * **Binary classifications**: is favorable/unfavorable, is complete/incomplete, is approved/rejected 💻 Usage Example ================ Here's a simple example of how to use "BooleanConcept" to determine if a document mentions confidential information: # ContextGem: BooleanConcept Extraction import os from contextgem import BooleanConcept, Document, DocumentLLM # Create a Document object from text doc = Document( raw_text="This document contains confidential information and should not be shared publicly." ) # Define a BooleanConcept to detect confidential content confidentiality_concept = BooleanConcept( name="Is confidential", description="Whether the document contains confidential information", ) # Attach the concept to the document doc.add_concepts([confidentiality_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document confidentiality_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted value print(confidentiality_concept.extracted_items[0].value) # Output: True # Or access the extracted value from the document object print(doc.concepts[0].extracted_items[0].value) # Output: True ⚙️ Parameters ============= When creating a "BooleanConcept", you can specify the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "name" | "str" | (Required) | A unique name identifier for the concept | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "description" | "str" | (Required) | A clear description of what condition or property | | | | | the concept evaluates and the criteria for | | | | | determining true or false values | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "llm_role" | "str" | ""extractor_te | The role of the LLM responsible for extracting the | | | | xt"" | concept. Available values: ""extractor_text"", | | | | | ""reasoner_text"", ""extractor_vision"", | | | | | ""reasoner_vision"", ""extractor_multimodal"", | | | | | ""reasoner_multimodal"". For more details, see 🏷️ | | | | | LLM Roles. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_justifications" | "bool" | "False" | Whether to include justifications for extracted | | | | | items. Justifications provide explanations of why | | | | | the LLM extracted specific values and the | | | | | reasoning behind the extraction, which is | | | | | especially useful for complex extractions or when | | | | | debugging results. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_dept | "str" | ""brief"" | Justification detail level. Available values: | | h" | | | ""brief"", ""balanced"", ""comprehensive"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_max_ | "int" | "2" | Maximum sentences in a justification. | | sents" | | | | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_references" | "bool" | "False" | Whether to include source references for extracted | | | | | items. References indicate the specific locations | | | | | in the document where evidence supporting the | | | | | boolean determination was found, helping to trace | | | | | back the true/false value to relevant content that | | | | | influenced the decision. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "reference_depth" | "str" | ""paragraphs"" | Source reference granularity. Available values: | | | | | ""paragraphs"", ""sentences"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "singular_occurrenc | "bool" | "False" | Whether this concept is restricted to having only | | e" | | | one extracted item. If "True", only a single | | | | | extracted item will be extracted. For boolean | | | | | concepts, this parameter is particularly useful | | | | | when you want to make a single true/false | | | | | determination about the entire document (e.g., | | | | | "contains confidential information") or a unique | | | | | determination about a specific aspect (e.g., "is | | | | | the payment schedule finalized"). This helps | | | | | distinguish between evaluating overall document | | | | | properties versus identifying multiple instances | | | | | where a condition might be true/false. Note that | | | | | with advanced LLMs, this constraint may not be | | | | | required as they can often infer the appropriate | | | | | number of items to extract from the concept's | | | | | name, description, and type (e.g., "contains | | | | | confidential information" vs "compliance | | | | | violations"). | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "custom_data" | "dict" | "{}" | Optional. Dictionary for storing any additional | | | | | data that you want to associate with the concept. | | | | | This data must be JSON-serializable. This data is | | | | | not used for extraction but can be useful for | | | | | custom processing or downstream tasks. | +----------------------+-----------------+-----------------+----------------------------------------------------+ 🚀 Advanced Usage ================= 🔍 References and Justifications for Extraction ----------------------------------------------- You can configure a "BooleanConcept" to include justifications and references. Justifications help explain the reasoning behind true/false determinations, while references point to the specific parts of the document that influenced the decision: # ContextGem: BooleanConcept Extraction with References and Justifications import os from contextgem import BooleanConcept, Document, DocumentLLM # Sample document text containing policy information policy_text = """ Company Data Retention Policy (Updated 2024) All customer data must be encrypted at rest and in transit using industry-standard encryption protocols. Personal information should be retained for no longer than 3 years after the customer relationship ends. Employees are required to complete data privacy training annually. """ # Create a Document from the text doc = Document(raw_text=policy_text) # Create a BooleanConcept with justifications and references enabled compliance_concept = BooleanConcept( name="Has encryption requirement", description="Whether the document specifies that data must be encrypted", add_justifications=True, # Enable justifications to understand reasoning justification_depth="brief", justification_max_sents=1, # Allow up to 1 sentences for each justification add_references=True, # Include references to source text reference_depth="sentences", # Reference specific sentences rather than paragraphs ) # Attach the concept to the document doc.add_concepts([compliance_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4o-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept compliance_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted value with justification and references print(f"Has encryption requirement: {compliance_concept.extracted_items[0].value}") print(f"\nJustification: {compliance_concept.extracted_items[0].justification}") print("\nSource references:") for sent in compliance_concept.extracted_items[0].reference_sentences: print(f"- {sent.raw_text}") 📊 Extracted Items ================== When a "BooleanConcept" is extracted, it is populated with **a list of extracted items** accessible through the ".extracted_items" property. Each item is an instance of the "_BooleanItem" class with the following attributes: +----------------------+----------------------+--------------------------------------------------------------+ | Attribute | Type | Description | |======================|======================|==============================================================| | "value" | bool | The extracted boolean value (True or False) | +----------------------+----------------------+--------------------------------------------------------------+ | "justification" | str | Explanation of why this boolean value was determined (only | | | | if "add_justifications=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_paragrap | list["Paragraph"] | List of paragraph objects that influenced the boolean | | hs" | | determination (only if "add_references=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_sentence | list["Sentence"] | List of sentence objects that influenced the boolean | | s" | | determination (only if "add_references=True" and | | | | "reference_depth="sentences"") | +----------------------+----------------------+--------------------------------------------------------------+ 💡 Best Practices ================= Here are some best practices to optimize your use of "BooleanConcept": * Provide a clear and specific description that helps the LLM understand exactly what condition to evaluate, using precise and unambiguous language in your concept names and descriptions. Since boolean concepts yield true/false values, focus on describing what criteria should be used to make the determination (e.g., *"whether the document mentions specific compliance requirements"* rather than just *"compliance requirements"*). Avoid vague terms that could be interpreted multiple ways—for example, use *"contains legally binding obligations"* instead of *"contains important content"* to ensure consistent and accurate determinations. * Break down complex conditions into multiple simpler boolean concepts when appropriate. Instead of one concept checking *"document is complete and compliant and approved,"* consider separate concepts for each condition. This provides more granular insights and makes it easier to identify specific issues when any condition fails. * Enable justifications (using "add_justifications=True") when you need to understand the reasoning behind the LLM's true/false determination. * Enable references (using "add_references=True") when you need to trace back to specific parts of the document that influenced the boolean decision or verify the evidence used to make the determination. * Use "singular_occurrence=True" to enforce only a single boolean determination for the entire document. This is particularly useful for concepts that should yield a single true/false answer, such as *"contains confidential information"* or *"is compliant with regulations,"* rather than identifying multiple instances where the condition might be true or false throughout the document. # ==== concepts/numerical_concept ==== NumericalConcept **************** "NumericalConcept" is a specialized concept type that extracts, calculates, or derives numerical values (integers, floats, or both) from document content. 📝 Overview =========== "NumericalConcept" enables powerful numerical data extraction and analysis from documents, such as: * **Direct extraction**: retrieving explicitly stated values like prices, percentages, dates, or measurements * **Calculated values**: computing sums, averages, growth rates, or other derived metrics * **Quantitative assessments**: determining counts, frequencies, totals, or numerical scores The concept can work with integers, floating-point numbers, or both types based on your configuration. 💻 Usage Example ================ Here's a simple example of how to use "NumericalConcept" to extract a price from a document: # ContextGem: NumericalConcept Extraction import os from contextgem import Document, DocumentLLM, NumericalConcept # Create a Document object from text doc = Document( raw_text="The latest smartphone model costs $899.99 and will be available next week." ) # Define a NumericalConcept to extract the price price_concept = NumericalConcept( name="Product price", description="The price of the product", numeric_type="float", # We expect a decimal price ) # Attach the concept to the document doc.add_concepts([price_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document price_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted value print(price_concept.extracted_items[0].value) # Output: 899.99 # Or access the extracted value from the document object print(doc.concepts[0].extracted_items[0].value) # Output: 899.99 ⚙️ Parameters ============= When creating a "NumericalConcept", you can specify the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "name" | "str" | (Required) | A unique name identifier for the concept | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "description" | "str" | (Required) | A clear description of what numerical value to | | | | | extract, which can include explicit values to | | | | | find, calculations to perform, or quantitative | | | | | assessments to derive from the document content | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "numeric_type" | "str" | ""any"" | The type of numerical values to extract. Available | | | | | values: ""int"", ""float"", ""any"". When ""any"" | | | | | is specified, the system will automatically | | | | | determine whether to use an integer or floating- | | | | | point representation based on the extracted value, | | | | | choosing the most appropriate type for each | | | | | numerical item. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "llm_role" | "str" | ""extractor_te | The role of the LLM responsible for extracting the | | | | xt"" | concept. Available values: ""extractor_text"", | | | | | ""reasoner_text"", ""extractor_vision"", | | | | | ""reasoner_vision"", ""extractor_multimodal"", | | | | | ""reasoner_multimodal"". For more details, see 🏷️ | | | | | LLM Roles. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_justifications" | "bool" | "False" | Whether to include justifications for extracted | | | | | items. Justifications provide explanations of why | | | | | the LLM extracted specific numerical values and | | | | | the reasoning behind the extraction, which is | | | | | especially useful for complex calculations, | | | | | inferred values, or when debugging results. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_dept | "str" | ""brief"" | Justification detail level. Available values: | | h" | | | ""brief"", ""balanced"", ""comprehensive"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_max_ | "int" | "2" | Maximum sentences in a justification. | | sents" | | | | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_references" | "bool" | "False" | Whether to include source references for extracted | | | | | items. References indicate the specific locations | | | | | in the document where the numerical values were | | | | | either directly found or from which they were | | | | | calculated or inferred, helping to trace back | | | | | extracted values to their source content even when | | | | | the extraction involves complex calculations or | | | | | mathematical reasoning. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "reference_depth" | "str" | ""paragraphs"" | Source reference granularity. Available values: | | | | | ""paragraphs"", ""sentences"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "singular_occurrenc | "bool" | "False" | Whether this concept is restricted to having only | | e" | | | one extracted item. If "True", only a single | | | | | numerical value will be extracted. For numerical | | | | | concepts, this parameter is particularly useful | | | | | when you want to extract a single specific value | | | | | rather than identifying multiple numerical values | | | | | throughout the document. This helps distinguish | | | | | between single-value concepts versus multi-value | | | | | concepts (e.g., *"total contract value"* vs *"all | | | | | payment amounts"*). Note that with advanced LLMs, | | | | | this constraint may not be required as they can | | | | | often infer the appropriate number of items to | | | | | extract from the concept's name, description, and | | | | | type. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "custom_data" | "dict" | "{}" | Optional. Dictionary for storing any additional | | | | | data that you want to associate with the concept. | | | | | This data must be JSON-serializable. This data is | | | | | not used for extraction but can be useful for | | | | | custom processing or downstream tasks. | +----------------------+-----------------+-----------------+----------------------------------------------------+ 🚀 Advanced Usage ================= 🔍 References and Justifications for Extraction ----------------------------------------------- You can configure a "NumericalConcept" to include justifications and references. Justifications help explain the reasoning behind the extracted values, while references point to the specific parts of the document where the numerical values were either directly found or from which they were calculated or inferred, helping to trace back extracted values to their source content even when the extraction involves complex calculations or mathematical reasoning: # ContextGem: NumericalConcept Extraction with References and Justifications import os from contextgem import Document, DocumentLLM, NumericalConcept # Document with values that require calculation/inference report_text = """ Quarterly Sales Report - Q2 2023 Product A: Sold 450 units at $75 each Product B: Sold 320 units at $125 each Product C: Sold 180 units at $95 each Marketing expenses: $28,500 Operating costs: $42,700 """ # Create a Document from the text doc = Document(raw_text=report_text) # Create a NumericalConcept for total revenue total_revenue_concept = NumericalConcept( name="Total quarterly revenue", description="The total revenue calculated by multiplying units sold by their price", add_justifications=True, justification_depth="comprehensive", # Detailed justification to show calculation steps justification_max_sents=4, # Maximum number of sentences for justification add_references=True, reference_depth="paragraphs", # Reference specific paragraphs singular_occurrence=True, # Ensure that the data is merged into a single item ) # Attach the concept to the document doc.add_concepts([total_revenue_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/o4-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept total_revenue_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted inferred value with justification print("Calculated total quarterly revenue:") for item in total_revenue_concept.extracted_items: print(f"\nTotal Revenue: {item.value}") print(f"Calculation Justification: {item.justification}") print("Source references:") for para in item.reference_paragraphs: print(f"- {para.raw_text}") 📊 Extracted Items ================== When a "NumericalConcept" is extracted, it is populated with **a list of extracted items** accessible through the ".extracted_items" property. Each item is an instance of the "_NumericalItem" class with the following attributes: +----------------------+----------------------+--------------------------------------------------------------+ | Attribute | Type | Description | |======================|======================|==============================================================| | "value" | int or float | The extracted numerical value, either an integer or | | | | floating-point number depending on the "numeric_type" | | | | setting | +----------------------+----------------------+--------------------------------------------------------------+ | "justification" | str | Explanation of why this numerical value was extracted (only | | | | if "add_justifications=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_paragrap | list["Paragraph"] | List of paragraph objects where the numerical value was | | hs" | | found or from which it was calculated or inferred (only if | | | | "add_references=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_sentence | list["Sentence"] | List of sentence objects where the numerical value was found | | s" | | or from which it was calculated or inferred (only if | | | | "add_references=True" and "reference_depth="sentences"") | +----------------------+----------------------+--------------------------------------------------------------+ 💡 Best Practices ================= Here are some best practices to optimize your use of "NumericalConcept": * Provide a clear and specific description that helps the LLM understand exactly what numerical values to extract, using precise and unambiguous language in your concept names and descriptions. For numerical concepts, be explicit about the exact values you're seeking (e.g., *"the total contract value in USD"* rather than just *"contract value"*). Avoid vague terms that could lead to incorrect extractions—for example, use *"quarterly revenue figures in millions"* instead of *"revenue numbers"* to ensure consistent and accurate extractions. * Use the appropriate "numeric_type" based on what you expect to extract or calculate: * Use ""int"" for counts, quantities, or whole numbers * Use ""float"" for prices, measurements, or values that may have decimal points * Use ""any"" when you're not sure or need to extract both types * Break down complex numerical extractions into multiple simpler numerical concepts when appropriate. Instead of one concept extracting *"all financial metrics,"* consider separate concepts for *"revenue figures,"* *"expense amounts,"* and *"profit margins."* This provides more structured data and makes it easier to process the results for specific purposes. * Enable justifications (using "add_justifications=True") when you need to understand the reasoning behind the LLM's numerical extractions, especially when calculations or conversions are involved. * Enable references (using "add_references=True") when you need to trace back to specific parts of the document that contained the numerical values or were used to calculate derived values. * Use "singular_occurrence=True" to enforce only a single numerical value extraction. This is particularly useful for concepts that should yield a unique value, such as *"total contract value"* or *"effective interest rate,"* rather than identifying multiple numerical values throughout the document. # ==== concepts/date_concept ==== DateConcept *********** "DateConcept" is a specialized concept type that extracts, interprets, and processes date information from documents, returning standardized "datetime.date" objects. 📝 Overview =========== "DateConcept" is used when you need to extract date information from documents, allowing you to: * **Extract explicit dates**: Identify dates that are directly mentioned in various formats (e.g., "January 15, 2025", "15/01/2025", "2025-01-15") * **Infer implicit dates**: Deduce dates from contextual information (e.g., "next Monday", "two weeks from signing", "the following quarter") * **Calculate derived dates**: Determine dates based on other temporal references (e.g., "30 days after delivery", "the fiscal year ending") * **Normalize date representations**: Convert various date formats into standardized Python "datetime.date" objects for consistent processing This concept type is particularly valuable for extracting temporal information from documents such as: * Contract effective dates, expiration dates, and renewal periods * Report publication dates and data collection periods * Event scheduling information and deadline specifications * Historical dates and chronological sequences 💻 Usage Example ================ Here's a simple example of how to use "DateConcept" to extract a publication date from a document: # ContextGem: DateConcept Extraction import os from contextgem import DateConcept, Document, DocumentLLM # Create a Document object from text doc = Document( raw_text="The research paper was published on March 15, 2025 and has been cited 42 times since." ) # Define a DateConcept to extract the publication date date_concept = DateConcept( name="Publication date", description="The date when the paper was published", ) # Attach the concept to the document doc.add_concepts([date_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document date_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted value print( type(date_concept.extracted_items[0].value), date_concept.extracted_items[0].value ) # Output: 2025-03-15 # Or access the extracted value from the document object print( type(doc.concepts[0].extracted_items[0].value), doc.concepts[0].extracted_items[0].value, ) # Output: 2025-03-15 ⚙️ Parameters ============= When creating a "DateConcept", you can specify the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "name" | "str" | (Required) | A unique name identifier for the concept | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "description" | "str" | (Required) | A clear description of what date information to | | | | | extract, which can include explicit dates to find, | | | | | implicit dates to infer, or temporal relationships | | | | | to identify. For date concepts, be specific about | | | | | the exact date information sought (e.g., *"the | | | | | contract signing date"* rather than just *"dates | | | | | in the document"*) to ensure consistent and | | | | | accurate extractions. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "llm_role" | "str" | ""extractor_te | The role of the LLM responsible for extracting the | | | | xt"" | concept. Available values: ""extractor_text"", | | | | | ""reasoner_text"", ""extractor_vision"", | | | | | ""reasoner_vision"", ""extractor_multimodal"", | | | | | ""reasoner_multimodal"". For more details, see 🏷️ | | | | | LLM Roles. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_justifications" | "bool" | "False" | Whether to include justifications for extracted | | | | | items. Justifications provide explanations of why | | | | | specific dates were extracted, which is especially | | | | | valuable when dates are inferred from contextual | | | | | clues (e.g., *"next quarter"* or *"30 days after | | | | | signing"*) or when resolving ambiguous date | | | | | references in the document. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_dept | "str" | ""brief"" | Justification detail level. Available values: | | h" | | | ""brief"", ""balanced"", ""comprehensive"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_max_ | "int" | "2" | Maximum sentences in a justification. | | sents" | | | | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_references" | "bool" | "False" | Whether to include source references for extracted | | | | | items. References indicate the specific locations | | | | | in the document where date information was found, | | | | | derived, or inferred from. This is particularly | | | | | useful for tracing dates back to their original | | | | | context, understanding how relative dates were | | | | | calculated (e.g., *"30 days after delivery"*), or | | | | | verifying how the system resolved ambiguous | | | | | temporal references (e.g., *"next fiscal year"*). | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "reference_depth" | "str" | ""paragraphs"" | Source reference granularity. Available values: | | | | | ""paragraphs"", ""sentences"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "singular_occurrenc | "bool" | "False" | Whether this concept is restricted to having only | | e" | | | one extracted item. If "True", only a single date | | | | | will be extracted. For date concepts, this | | | | | parameter is particularly useful when you want to | | | | | extract a specific, unique date in the document | | | | | (e.g., *"publication date"* or *"contract signing | | | | | date"*) rather than identifying multiple dates | | | | | throughout the document. Note that with advanced | | | | | LLMs, this constraint may not be required as they | | | | | can often infer the appropriate cardinality from | | | | | the concept's name, description, and type. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "custom_data" | "dict" | "{}" | Optional. Dictionary for storing any additional | | | | | data that you want to associate with the concept. | | | | | This data must be JSON-serializable. This data is | | | | | not used for extraction but can be useful for | | | | | custom processing or downstream tasks. | +----------------------+-----------------+-----------------+----------------------------------------------------+ 🚀 Advanced Usage ================= 🔍 References and Justifications for Extraction ----------------------------------------------- You can configure a "DateConcept" to include justifications and references. Justifications help explain the reasoning behind extracted dates, especially for complex or inferred temporal information (like dates derived from expressions such as *"30 days after delivery"* or *"next fiscal year"*), while references point to the specific parts of the document that contained the date information or based on which date information was inferred: # ContextGem: DateConcept Extraction with References and Justifications import os from contextgem import DateConcept, Document, DocumentLLM # Sample document text containing project timeline information project_text = """ Project Timeline: Website Redesign The website redesign project officially kicked off on March 1, 2024. The development team has estimated the project will take 4 months to complete. Key milestones: - Design phase: 1 month - Development phase: 2 months - Testing and deployment: 1 month The marketing team needs the final completion date to plan the launch campaign. """ # Create a Document from the text doc = Document(raw_text=project_text) # Create a DateConcept to calculate the project completion date completion_date_concept = DateConcept( name="Project completion date", description="The final completion date for the website redesign project", add_justifications=True, # enable justifications to understand extraction logic justification_depth="balanced", justification_max_sents=3, # allow up to 3 sentences for the calculation justification add_references=True, # include references to source text reference_depth="sentences", # reference specific sentences rather than paragraphs singular_occurrence=True, # extract only one calculated date ) # Attach the concept to the document doc.add_concepts([completion_date_concept]) # Configure DocumentLLM llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept completion_date_concept = llm.extract_concepts_from_document(doc)[0] # Print the calculated completion date with justification and references print("Calculated project completion date:") extracted_item = completion_date_concept.extracted_items[ 0 ] # get the single calculated date print(f"\nCompletion Date: {extracted_item.value}") # expected output: 2024-07-01 print(f"Calculation Justification: {extracted_item.justification}") print("Source references used for calculation:") for sent in extracted_item.reference_sentences: print(f"- {sent.raw_text}") 📊 Extracted Items ================== When a "DateConcept" is extracted, it is populated with **a list of extracted items** accessible through the ".extracted_items" property. Each item is an instance of the "_DateItem" class with the following attributes: +----------------------+----------------------+--------------------------------------------------------------+ | Attribute | Type | Description | |======================|======================|==============================================================| | "value" | datetime.date | The extracted date as a Python "datetime.date" object | +----------------------+----------------------+--------------------------------------------------------------+ | "justification" | str | Explanation of why this date was extracted (only if | | | | "add_justifications=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_paragrap | list["Paragraph"] | List of paragraph objects where the date was found or from | | hs" | | which it was calculated, derived, or inferred (only if | | | | "add_references=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_sentence | list["Sentence"] | List of sentence objects where the date was found or from | | s" | | which it was calculated, derived, or inferred (only if | | | | "add_references=True" and "reference_depth="sentences"") | +----------------------+----------------------+--------------------------------------------------------------+ 💡 Best Practices ================= Here are some best practices to optimize your use of "DateConcept": * Provide a clear and specific description that helps the LLM understand exactly what date to extract, using precise and unambiguous language (e.g., *"contract signing date"* rather than just *"date"*). * For dates that require interpretation or calculation (like *"30 days after delivery"* or *"end of next fiscal year"*), include these requirements explicitly in your description to ensure the LLM performs the necessary temporal reasoning. * Break down complex date extractions into multiple simpler date concepts when appropriate. Instead of one concept extracting *"all contract dates,"* consider separate concepts for *"contract signing date,"* *"effective date,"* and *"termination date."* * Enable justifications (using "add_justifications=True") when you need to understand the reasoning behind date calculations or extractions, especially for relative or inferred dates. * Enable references (using "add_references=True") when you need to trace back to specific parts of the document that contained the original date information or where dates were calculated from (e.g., deriving a project completion date from a start date plus duration information). * Use "singular_occurrence=True" to enforce only a single date extraction. This is particularly useful for concepts that should yield a unique calculated date, such as *"project completion deadline"* where multiple timeline elements need to be synthesized into a single target date, or when multiple date mentions actually refer to the same event. * Leverage the returned Python "datetime.date" objects for direct integration with date-based calculations, comparisons, or formatting in your application logic. # ==== concepts/rating_concept ==== RatingConcept ************* "RatingConcept" is a specialized concept type that calculates, infers, and derives rating values from documents within a clearly defined numerical scale. 📝 Overview =========== "RatingConcept" enables sophisticated rating analysis from documents, allowing you to: * **Derive implicit ratings**: Calculate ratings based on sentiment analysis, key criteria, or contextual evaluation * **Generate evaluative scores**: Produce numerical assessments that quantify quality, relevance, or performance * **Normalize diverse signals**: Convert qualitative assessments into consistent numerical ratings within your defined scale * **Synthesize overall scores**: Combine multiple factors or opinions into comprehensive rating assessments This concept type is particularly valuable for generating evaluative information from documents such as: * Product and service reviews where sentiment must be quantified on a standardized scale * Performance assessments requiring numerical quality or satisfaction scoring * Risk evaluations needing severity or probability measurements * Content analyses where subjective characteristics must be rated objectively 💻 Usage Example ================ Here's a simple example of how to use "RatingConcept" to extract a product rating: # ContextGem: RatingConcept Extraction import os from contextgem import Document, DocumentLLM, RatingConcept # Create a Document object from text describing a product without an explicit rating smartphone_description = ( "This smartphone features a 5000mAh battery that lasts all day with heavy use. " "The display is 6.7 inch AMOLED with 120Hz refresh rate. " "Camera system includes a 50MP main sensor, 12MP ultrawide, and 8MP telephoto lens. " "The phone runs on the latest processor with 8GB RAM and 256GB storage. " "It has IP68 water resistance and Gorilla Glass Victus protection." ) doc = Document(raw_text=smartphone_description) # Define a RatingConcept that requires analysis to determine a rating product_quality = RatingConcept( name="Product Quality Rating", description=( "Evaluate the overall quality of the smartphone based on its specifications, " "features, and adherence to industry best practices" ), rating_scale=(1, 10), add_justifications=True, # include justification for the rating justification_depth="balanced", justification_max_sents=5, ) # Attach the concept to the document doc.add_concepts([product_quality]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document - the LLM will analyze and assign a rating product_quality = llm.extract_concepts_from_document(doc)[0] # Print the calculated rating print(f"Quality Rating: {product_quality.extracted_items[0].value}") # Print the justification print(f"Justification: {product_quality.extracted_items[0].justification}") ⚙️ Parameters ============= When creating a "RatingConcept", you can specify the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "name" | "str" | (Required) | A unique name identifier for the concept | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "description" | "str" | (Required) | A clear description of what should be evaluated | | | | | and rated, including the criteria for assigning | | | | | different values within the rating scale (e.g., | | | | | "Evaluate product quality based on features, | | | | | durability, and performance where 1 represents | | | | | poor quality and 10 represents exceptional | | | | | quality"). The more specific the description, the | | | | | more consistent and accurate the ratings will be. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "rating_scale" | "tuple[int, | (Required) | Defines the boundaries for valid ratings as a | | | int]" | | tuple of (start, end) values (e.g., "(1, 5)" for a | | | | | 1-5 star rating, or "(0, 100)" for a percentage- | | | | | based evaluation). This parameter establishes the | | | | | numerical range within which all ratings must | | | | | fall, ensuring consistency across evaluations. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "llm_role" | "str" | ""extractor_te | The role of the LLM responsible for extracting the | | | | xt"" | concept. Available values: ""extractor_text"", | | | | | ""reasoner_text"", ""extractor_vision"", | | | | | ""reasoner_vision"", ""extractor_multimodal"", | | | | | ""reasoner_multimodal"". For more details, see 🏷️ | | | | | LLM Roles. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_justifications" | "bool" | "False" | Whether to include justifications for extracted | | | | | items. Justifications provide explanations of why | | | | | the LLM assigned specific rating values and the | | | | | reasoning behind the evaluation, which is | | | | | especially useful for understanding the factors | | | | | that influenced the rating. For example, a | | | | | justification might explain that a smartphone | | | | | received an 8/10 quality rating based on its | | | | | premium build materials, advanced camera system, | | | | | and long battery life, despite lacking expandable | | | | | storage. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_dept | "str" | ""brief"" | Justification detail level. Available values: | | h" | | | ""brief"", ""balanced"", ""comprehensive"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_max_ | "int" | "2" | Maximum sentences in a justification. | | sents" | | | | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_references" | "bool" | "False" | Whether to include source references for extracted | | | | | items. References indicate the specific locations | | | | | in the document that provided information or | | | | | evidence used to determine the rating. This is | | | | | particularly useful for understanding which parts | | | | | of the document influenced the rating assessment, | | | | | allowing to trace back evaluations to relevant | | | | | content that supports the numerical value | | | | | assigned. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "reference_depth" | "str" | ""paragraphs"" | Source reference granularity. Available values: | | | | | ""paragraphs"", ""sentences"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "singular_occurrenc | "bool" | "False" | Whether this concept is restricted to having only | | e" | | | one extracted item. If "True", only a single | | | | | rating will be extracted. For rating concepts, | | | | | this parameter is particularly useful when you | | | | | want to extract a single overall score (e.g., | | | | | *"overall product quality"*) rather than | | | | | identifying multiple ratings throughout the | | | | | document for different aspects or features. This | | | | | helps distinguish between a global evaluation | | | | | versus component-specific ratings. Note that with | | | | | advanced LLMs, this constraint may not be required | | | | | as they can often infer the appropriate number of | | | | | ratings to extract from the concept's name, | | | | | description, and rating context. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "custom_data" | "dict" | "{}" | Optional. Dictionary for storing any additional | | | | | data that you want to associate with the concept. | | | | | This data must be JSON-serializable. This data is | | | | | not used for extraction but can be useful for | | | | | custom processing or downstream tasks. | +----------------------+-----------------+-----------------+----------------------------------------------------+ 🚀 Advanced Usage ================= 🔍 References and Justifications for Extraction ----------------------------------------------- When extracting a "RatingConcept", it's often useful to include justifications to understand the reasoning behind the score: # ContextGem: RatingConcept Extraction with References and Justifications import os from contextgem import Document, DocumentLLM, RatingConcept # Sample document text about a software product with various aspects software_review = """ Software Review: ProjectManager Pro 5.0 User Interface: The interface is clean and modern, with intuitive navigation. New users can quickly find what they need without extensive training. The dashboard provides a comprehensive overview of project status. Performance: The application loads quickly even with large projects. Resource-intensive operations like generating reports occasionally cause minor lag on older systems. The mobile app performs exceptionally well, even on limited bandwidth. Features: Project templates are well-designed and cover most common project types. Task dependencies are easily managed, and the Gantt chart visualization is excellent. However, the software lacks advanced risk management tools that competitors offer. Support: The documentation is comprehensive and well-organized. Customer service response time averages 4 hours, which is acceptable but not industry-leading. The knowledge base needs more video tutorials. """ # Create a Document from the text doc = Document(raw_text=software_review) # Create a RatingConcept with justifications and references enabled usability_rating_concept = RatingConcept( name="Software usability rating", description="Evaluate the overall usability of the software on a scale of 1-10 based on UI design, intuitiveness, and learning curve", rating_scale=(1, 10), add_justifications=True, # enable justifications to explain the rating justification_depth="comprehensive", # provide detailed reasoning justification_max_sents=5, # allow up to 5 sentences for justification add_references=True, # include references to source text reference_depth="sentences", # reference specific sentences rather than paragraphs ) # Attach the concept to the document doc.add_concepts([usability_rating_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept usability_rating_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted rating item with justification and references extracted_item = usability_rating_concept.extracted_items[0] print(f"Software Usability Rating: {extracted_item.value}/10") print(f"\nJustification: {extracted_item.justification}") print("\nSource references:") for sent in extracted_item.reference_sentences: print(f"- {sent.raw_text}") ⭐⭐ Multiple Rating Categories ------------------------------- You can extract multiple rating categories from a document by creating separate rating concepts: # ContextGem: Multiple RatingConcept Extraction import os from contextgem import Document, DocumentLLM, RatingConcept # Sample document text about a restaurant review with multiple quality aspects to rate restaurant_review = """ Restaurant Review: Bella Cucina Atmosphere: The restaurant has a warm, inviting ambiance with soft lighting and comfortable seating. The décor is elegant without being pretentious, and the noise level allows for easy conversation. Food Quality: The ingredients were fresh and high-quality. The pasta was perfectly cooked al dente, and the sauces were flavorful and well-balanced. The seafood dish had slightly overcooked shrimp, but the fish was excellent. Service: Our server was knowledgeable about the menu and wine list. Water glasses were kept filled, and plates were cleared promptly. However, there was a noticeable delay between appetizers and main courses. Value: Portion sizes were generous for the price point. The wine list offers selections at various price points, though markup is slightly higher than average for comparable restaurants in the area. """ # Create a Document from the text doc = Document(raw_text=restaurant_review) # Define a consistent rating scale to be used across all rating categories restaurant_rating_scale = (1, 5) # Define multiple rating concepts for different quality aspects of the restaurant atmosphere_rating = RatingConcept( name="Atmosphere Rating", description="Rate the restaurant's atmosphere and ambiance", rating_scale=restaurant_rating_scale, ) food_rating = RatingConcept( name="Food Quality Rating", description="Rate the quality, preparation, and taste of the food", rating_scale=restaurant_rating_scale, ) service_rating = RatingConcept( name="Service Rating", description="Rate the efficiency, knowledge, and attentiveness of the service", rating_scale=restaurant_rating_scale, ) value_rating = RatingConcept( name="Value Rating", description="Rate the value for money considering portion sizes and pricing", rating_scale=restaurant_rating_scale, ) # Attach all concepts to the document doc.add_concepts([atmosphere_rating, food_rating, service_rating, value_rating]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract all concepts from the document extracted_concepts = llm.extract_concepts_from_document(doc) # Print all ratings print("Restaurant Ratings (1-5 scale):") for concept in extracted_concepts: if concept.extracted_items: print(f"{concept.name}: {concept.extracted_items[0].value}/5") # Calculate and print overall average rating average_rating = sum( concept.extracted_items[0].value for concept in extracted_concepts ) / len(extracted_concepts) print(f"\nOverall Rating: {average_rating:.1f}/5") 📊 Extracted Items ================== When a "RatingConcept" is extracted, it is populated with **a list of extracted items** accessible through the ".extracted_items" property. Each item is an instance of the "_IntegerItem" class with the following attributes: +----------------------+----------------------+--------------------------------------------------------------+ | Attribute | Type | Description | |======================|======================|==============================================================| | "value" | int | The extracted rating value as an integer within the defined | | | | rating scale | +----------------------+----------------------+--------------------------------------------------------------+ | "justification" | str | Explanation of why this rating was extracted (only if | | | | "add_justifications=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_paragrap | list["Paragraph"] | List of paragraph objects that influenced the rating | | hs" | | determination (only if "add_references=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_sentence | list["Sentence"] | List of sentence objects that influenced the rating | | s" | | determination (only if "add_references=True" and | | | | "reference_depth="sentences"") | +----------------------+----------------------+--------------------------------------------------------------+ 💡 Best Practices ================= * Create descriptive names for your rating concepts that clearly indicate what aspect is being evaluated (e.g., *"Product Usability Rating"* rather than just *"Rating"*). * Enhance extraction quality by including clear definitions of what each point on the scale represents in your concept description (e.g., *"1 = poor, 3 = average, 5 = excellent"*). * Provide specific evaluation criteria in your concept description to guide the LLM's assessment process. For example, when rating software usability, specify that factors like interface intuitiveness, learning curve, and navigation efficiency should be considered. * Enable justifications (using "add_justifications=True") when you need to understand the reasoning behind a rating, which is particularly valuable for evaluations that involve complex criteria where the rationale may not be immediately obvious from the score alone. * Enable references (using "add_references=True") to trace ratings back to specific evidence in the document that informed the evaluation. * Apply "singular_occurrence=True" for concepts that should yield a single comprehensive rating (like an overall product score) rather than multiple ratings throughout the document. # ==== concepts/json_object_concept ==== JsonObjectConcept ***************** "JsonObjectConcept" is a powerful concept type that extracts structured data in the form of JSON objects from documents, enabling sophisticated information organization and retrieval. 📝 Overview =========== "JsonObjectConcept" is used when you need to extract complex, structured information from unstructured text, including: * **Nested data structures**: objects with multiple fields, hierarchical information, and related attributes * **Standardized formats**: consistent data extraction following predefined schemas for reliable downstream processing * **Complex entity extraction**: comprehensive extraction of entities with multiple attributes and relationships This concept type offers the flexibility to define precise schemas that match your data requirements, ensuring that extracted information maintains structural integrity and relationships between different data elements. 💻 Usage Example ================ Here's a simple example of how to use "JsonObjectConcept" to extract product information: # ContextGem: JsonObjectConcept Extraction import os from pprint import pprint from typing import Literal from contextgem import Document, DocumentLLM, JsonObjectConcept # Define product information text product_text = """ Product: Smart Fitness Watch X7 Price: $199.99 Features: Heart rate monitoring, GPS tracking, Sleep analysis Battery Life: 5 days Water Resistance: IP68 Available Colors: Black, Silver, Blue Customer Rating: 4.5/5 """ # Create a Document object from text doc = Document(raw_text=product_text) # Define a JsonObjectConcept with a structure for product information product_concept = JsonObjectConcept( name="Product Information", description="Extract detailed product information including name, price, features, and specifications", structure={ "name": str, "price": float, "features": list[str], "specifications": { "battery_life": str, "water_resistance": Literal["IP67", "IP68", "IPX7", "Not water resistant"], }, "available_colors": list[str], "customer_rating": float, }, ) # Attach the concept to the document doc.add_concepts([product_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document product_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted structured data extracted_product = product_concept.extracted_items[0].value pprint(extracted_product) ⚙️ Parameters ============= When creating a "JsonObjectConcept", you can specify the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "name" | "str" | (Required) | A unique name identifier for the concept | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "description" | "str" | (Required) | A clear description of what the concept represents | | | | | and what should be extracted | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "structure" | "type | | (Required) | JSON object schema defining the data structure to | | | dict[str, Any]" | | be extracted. Can be specified as a Python class | | | | | with type annotations or a dictionary with field | | | | | names as keys and their corresponding types as | | | | | values. This schema can represent simple flat | | | | | structures or complex nested hierarchies with | | | | | multiple levels of organization. The LLM will | | | | | attempt to extract data that conforms to this | | | | | structure, enabling precise and consistent | | | | | extraction of complex information patterns. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "examples" | "list[JsonObje | "[]" | Optional. Example JSON objects illustrating the | | | ctExample]" | | concept usage. Such examples must conform to the | | | | | "structure" schema. Examples significantly improve | | | | | extraction accuracy by showing the LLM concrete | | | | | instances of the expected output format and | | | | | content patterns. This is particularly valuable | | | | | for complex schemas with nested structures or when | | | | | there are specific formatting conventions that | | | | | should be followed (e.g., how dates, identifiers, | | | | | or specialized fields should be represented). | | | | | Examples also help clarify how to handle edge | | | | | cases or ambiguous information in the source | | | | | document. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "llm_role" | "str" | ""extractor_te | The role of the LLM responsible for extracting the | | | | xt"" | concept. Available values: ""extractor_text"", | | | | | ""reasoner_text"", ""extractor_vision"", | | | | | ""reasoner_vision"", ""extractor_multimodal"", | | | | | ""reasoner_multimodal"". For more details, see 🏷️ | | | | | LLM Roles. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_justifications" | "bool" | "False" | Whether to include justifications for extracted | | | | | items. Justifications provide explanations of why | | | | | the LLM extracted specific JSON structures and the | | | | | reasoning behind field values. This is especially | | | | | valuable for complex structures where the | | | | | extraction process involves inference or when | | | | | multiple data points must be synthesized. For | | | | | example, a justification might explain how the LLM | | | | | determined a product's category based on various | | | | | features mentioned across different paragraphs, or | | | | | why certain optional fields were populated or left | | | | | empty based on available information in the | | | | | document. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_dept | "str" | ""brief"" | Justification detail level. Available values: | | h" | | | ""brief"", ""balanced"", ""comprehensive"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_max_ | "int" | "2" | Maximum sentences in a justification. | | sents" | | | | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_references" | "bool" | "False" | Whether to include source references for extracted | | | | | items. References indicate the specific locations | | | | | in the document that informed the extraction of | | | | | the JSON structure. This is particularly valuable | | | | | for complex objects where field values may be | | | | | calculated or inferred from multiple scattered | | | | | pieces of information throughout the document. | | | | | References help trace back extracted values to | | | | | their source evidence, validate the extraction | | | | | reasoning, and understand which parts of the | | | | | document contributed to the synthesis of | | | | | structured data, especially for fields requiring | | | | | interpretation, not only direct extraction. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "reference_depth" | "str" | ""paragraphs"" | Source reference granularity. Available values: | | | | | ""paragraphs"", ""sentences"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "singular_occurrenc | "bool" | "False" | Whether this concept is restricted to having only | | e" | | | one extracted item. If "True", only a single JSON | | | | | object will be extracted. For JSON object | | | | | concepts, this parameter is particularly useful | | | | | when you want to extract a comprehensive | | | | | structured representation of a single entity | | | | | (e.g., "product specifications" or "company | | | | | profile") rather than multiple instances of | | | | | structured data scattered throughout the document. | | | | | This is especially valuable when extracting | | | | | complex nested objects that aggregate information | | | | | from different parts of the document into a | | | | | cohesive whole. Note that with advanced LLMs, this | | | | | constraint may not be required as they can often | | | | | infer the appropriate number of objects to extract | | | | | from the concept's name, description, and schema | | | | | structure. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "custom_data" | "dict" | "{}" | Optional. Dictionary for storing any additional | | | | | data that you want to associate with the concept. | | | | | This data must be JSON-serializable. This data is | | | | | not used for extraction but can be useful for | | | | | custom processing or downstream tasks. | +----------------------+-----------------+-----------------+----------------------------------------------------+ 🏗️ Defining Structure ===================== The "structure" parameter defines the schema for the data you want to extract. JsonObjectConcept uses Pydantic models internally to validate all structures, ensuring type safety and data integrity. You can define this structure using either dictionaries or classes. Dictionary-based definitions provide a simpler abstraction for defining JSON object structures, while still benefiting from Pydantic's robust validation system under the hood. You can define the structure in several ways: 1. **Using a dictionary with type annotations:** from contextgem import JsonObjectConcept product_info_concept = JsonObjectConcept( name="Product Information", description="Product details", structure={ "name": str, "price": float, "is_available": bool, "ratings": list[float], }, ) 2. **Using nested dictionaries for complex structures:** from contextgem import JsonObjectConcept device_config_concept = JsonObjectConcept( name="Device Configuration", description="Configuration details for a networked device", structure={ "device": {"id": str, "type": str, "model": str}, "network": {"ip_address": str, "subnet_mask": str, "gateway": str}, "settings": {"enabled": bool, "mode": str}, }, ) 3. **Using a Python class with type annotations:** While dictionary structures provide the simplest way to define JSON schemas, you may prefer to use class definitions if that better fits your codebase style. You can define your structure using a Python class with type annotations: from pydantic import BaseModel from contextgem import JsonObjectConcept # Use a Pydantic model to define the structure of the JSON object class ProductSpec(BaseModel): name: str version: str features: list[str] product_spec_concept = JsonObjectConcept( name="Product Specification", description="Technical specifications for a product", structure=ProductSpec, ) 4. **Using nested classes for complex structures:** If you prefer to use class definitions for hierarchical data structures (already supported by dictionary structures), you can use nested class definitions. This approach offers a more object-oriented style that may better align with your existing codebase, especially when working with dataclasses or Pydantic models in your application code. When using nested class definitions, all classes in the structure must inherit from the "JsonObjectClassStruct" utility class to enable automatic conversion of the whole class hierarchy to a dictionary structure: from dataclasses import dataclass from contextgem import JsonObjectConcept from contextgem.public.utils import JsonObjectClassStruct # Use dataclasses to define the structure of the JSON object # All classes in the nested class structure must inherit from JsonObjectClassStruct # to enable automatic conversion of the class hierarchy to a dictionary structure # for JsonObjectConcept @dataclass class Location(JsonObjectClassStruct): latitude: float longitude: float altitude: float @dataclass class Sensor(JsonObjectClassStruct): id: str type: str location: Location # reference to another class active: bool @dataclass class SensorNetwork(JsonObjectClassStruct): network_id: str primary_sensor: Sensor # reference to another class backup_sensors: list[Sensor] # list of another class sensor_network_concept = JsonObjectConcept( name="IoT Sensor Network", description="Configuration for a network of IoT sensors", structure=SensorNetwork, # nested class structure ) 🚀 Advanced Usage ================= ✏️ Adding Examples ------------------ You can provide examples of structured JSON objects to improve extraction accuracy, especially for complex schemas or when there might be ambiguity in how to organize or format the extracted information: # ContextGem: JsonObjectConcept Extraction with Examples import os from pprint import pprint from contextgem import Document, DocumentLLM, JsonObjectConcept, JsonObjectExample # Document object with ambiguous medical report text medical_report = """ PATIENT ASSESSMENT Date: March 15, 2023 Patient: John Doe (ID: 12345) Vital Signs: BP: 125/82 mmHg HR: 72 bpm Temp: 98.6°F SpO2: 98% Chief Complaint: Patient presents with persistent cough for 2 weeks, mild fever in evenings (up to 100.4°F), and fatigue. No shortness of breath. Patient reports recent travel to Southeast Asia 3 weeks ago. Assessment: Physical examination shows slight wheezing in upper right lung. No signs of pneumonia on chest X-ray. WBC slightly elevated at 11,500. Patient appears in stable condition but fatigued. Impression: 1. Acute bronchitis, likely viral 2. Rule out early TB given travel history 3. Fatigue, likely secondary to infection Plan: - Rest for 5 days - Symptomatic treatment with over-the-counter cough suppressant - Follow-up in 1 week - TB test ordered Dr. Sarah Johnson, MD """ doc = Document(raw_text=medical_report) # Create a JsonObjectConcept for extracting medical assessment data # Without examples, the LLM might struggle with ambiguous fields or formatting variations medical_assessment_concept = JsonObjectConcept( name="Medical Assessment", description="Key information from a patient medical assessment", structure={ "patient": { "id": str, "vital_signs": { "blood_pressure": str, "heart_rate": int, "temperature": float, "oxygen_saturation": int, }, }, "clinical": { "symptoms": list[str], "diagnosis": list[str], "travel_history": bool, }, "treatment": {"recommendations": list[str], "follow_up_days": int}, }, # Examples provide helpful guidance on how to: # 1. Map data from unstructured text to structured fields # 2. Handle formatting variations (BP as "120/80" vs separate systolic/diastolic) # 3. Extract implicit information (converting "SpO2: 98%" to just 98) examples=[ JsonObjectExample( content={ "patient": { "id": "87654", "vital_signs": { "blood_pressure": "130/85", "heart_rate": 68, "temperature": 98.2, "oxygen_saturation": 99, }, }, "clinical": { "symptoms": ["headache", "dizziness", "nausea"], "diagnosis": ["Migraine", "Dehydration"], "travel_history": False, }, "treatment": { "recommendations": [ "Hydration", "Pain medication", "Dark room rest", ], "follow_up_days": 14, }, } ), JsonObjectExample( content={ "patient": { "id": "23456", "vital_signs": { "blood_pressure": "145/92", "heart_rate": 88, "temperature": 100.8, "oxygen_saturation": 96, }, }, "clinical": { "symptoms": ["sore throat", "cough", "fever"], "diagnosis": ["Strep throat", "Pharyngitis"], "travel_history": True, }, "treatment": { "recommendations": ["Antibiotics", "Throat lozenges", "Rest"], "follow_up_days": 7, }, } ), ], ) # Attach the concept to the document doc.add_concepts([medical_assessment_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document medical_assessment_concept = llm.extract_concepts_from_document(doc)[0] # Print the extracted medical assessment print("Extracted medical assessment:") assessment = medical_assessment_concept.extracted_items[0].value pprint(assessment) 🔍 References and Justifications for Extraction ----------------------------------------------- You can configure a "JsonObjectConcept" to include justifications and references, which provide transparency into the extraction process. Justifications explain the reasoning behind the extracted values, while references point to the specific parts of the document that were used as sources for the extraction: # ContextGem: JsonObjectConcept Extraction with References and Justifications import os from pprint import pprint from typing import Literal from contextgem import Document, DocumentLLM, JsonObjectConcept # Sample document text containing a customer complaint customer_complaint = """ CUSTOMER COMPLAINT #CR-2023-0472 Date: November 15, 2023 Customer: Sarah Johnson Description: I purchased the Ultra Premium Blender (Model XJ-5000) from your online store on October 3, 2023. The product was delivered on October 10, 2023. After using it only 5 times, the motor started making loud grinding noises and then completely stopped working on November 12. I've tried troubleshooting using the manual, including checking for obstructions and resetting the device, but nothing has resolved the issue. I expected much better quality given the premium price point ($249.99) and the 5-year warranty advertised. I've been a loyal customer for over 7 years and have purchased several kitchen appliances from your company. This is the first time I've experienced such a significant quality issue. I would like a replacement unit or a full refund. Previous interactions: - Spoke with customer service representative Alex on Nov 13 (Ref #CS-98721) - Was told to submit this formal complaint after troubleshooting was unsuccessful - No resolution offered during initial call Contact: sarah.johnson@example.com | (555) 123-4567 """ # Create a Document from the text doc = Document(raw_text=customer_complaint) # Create a JsonObjectConcept with justifications and references enabled complaint_analysis_concept = JsonObjectConcept( name="Complaint analysis", description="Detailed analysis of a customer complaint", structure={ "issue_type": Literal[ "product defect", "delivery problem", "billing error", "service issue", "other", ], "warranty_applicable": bool, "severity": Literal["low", "medium", "high", "critical"], "customer_loyalty_status": Literal["new", "regular", "loyal", "premium"], "recommended_resolution": Literal[ "replacement", "refund", "repair", "partial refund", "other" ], "priority_level": Literal["low", "standard", "high", "urgent"], "expected_business_impact": Literal["minimal", "moderate", "significant"], }, add_justifications=True, justification_depth="comprehensive", # provide detailed justifications justification_max_sents=10, # provide up to 10 sentences for each justification add_references=True, reference_depth="sentences", # provide references to the sentences in the document ) # Attach the concept to the document doc.add_concepts([complaint_analysis_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept complaint_analysis_concept = llm.extract_concepts_from_document(doc)[0] # Get the extracted complaint analysis complaint_analysis_item = complaint_analysis_concept.extracted_items[0] # Print the structured analysis print("Complaint Analysis\n") pprint(complaint_analysis_item.value) print("\nJustification:") print(complaint_analysis_item.justification) # Print key source references print("\nReferences:") for sent in complaint_analysis_item.reference_sentences: print(f"- {sent.raw_text}") 💡 Best Practices ================= * Keep your JSON structures simple yet comprehensive, focusing on the essential fields needed for your use case to avoid LLM prompt overloading. * Include realistic examples (using "JsonObjectExample") that precisely match your schema to guide extraction, especially for ambiguous or specialized data formats. * Provide detailed descriptions for your JsonObjectConcept that specify exactly what structured data to extract and how fields should be interpreted. * For complex JSON objects, use nested dictionaries or class hierarchies to organize related fields logically. * Enable justifications (using "add_justifications=True") when interpretation rationale is important, especially for extractions that involve judgment or qualitative assessment, such as sentiment analysis (positive/negative), priority assignment (high/medium/low), or data categorization where the LLM must make interpretive decisions rather than extract explicit facts. * Enable references (using "add_references=True") when you need to verify the document source of extracted values for compliance or verification purposes. This is especially valuable when the LLM is not just directly extracting explicit text, but also interpreting or inferring information from context. For example, in legal document analysis where traceability of information is essential for auditing or validation, references help track both explicit statements and the implicit information the model has derived from them. * Use "singular_occurrence=True" when you expect exactly one instance of the structured data in the document (e.g., a single product specification, one patient medical record, or a unique customer complaint). This is useful for documents with a clear singular focus. Conversely, omit this parameter ("False" is the default) when you need to extract multiple instances of the same structure from a document, such as multiple product listings in a catalog, several patient records in a hospital report, or various customer complaints in a feedback compilation. # ==== concepts/label_concept ==== LabelConcept ************ "LabelConcept" is a classification concept type in ContextGem that categorizes documents or content using predefined labels, supporting both single-label and multi-label classification approaches. 🏷️ Overview =========== "LabelConcept" is used when you need to classify content into predefined categories, including: * **Document classification**: contract types, document categories, legal classifications * **Content categorization**: topics, themes, subjects, areas of focus * **Quality assessment**: compliance levels, risk categories, priority levels * **Multi-faceted tagging**: multiple applicable labels for comprehensive classification This concept type supports two classification modes: * **Multi-class**: Always selects exactly one label from the predefined set (mutually exclusive labels) - used for classifying the content into a single type or category. A label is always returned, even if none perfectly fit the content. * **Multi-label**: Selects zero, one, or multiple labels from the predefined set (non-exclusive labels) - used when multiple topics or attributes can apply simultaneously. Returns only applicable labels, or no labels if none apply. Note: **For multi-label classification**: When none of the predefined labels apply to the content being classified, no extracted items will be returned for the concept (empty "extracted_items" list). This ensures that only applicable labels are selected.**For multi- class classification**: A label is always returned, as this classification type requires selecting the best-fitting option from the predefined set, even if none perfectly match the content. Important: **For multi-class classification**: Since multi-class classification will always return exactly one label, you should consider including a general "other" label (such as "N/A", "misc", "unspecified", etc.) to handle cases where none of the specific labels apply, unless your labels are broad enough to cover all cases, or you know that the classified content always falls under one of the predefined labels without edge cases. This ensures appropriate classification even when the content doesn't clearly fit into any of the predefined specific categories. 💻 Usage Example ================ Here's a basic example of how to use "LabelConcept" for document type classification: # ContextGem: Contract Type Classification using LabelConcept import os from contextgem import Document, DocumentLLM, LabelConcept # Create a Document object from legal document text legal_doc_text = """ NON-DISCLOSURE AGREEMENT This Non-Disclosure Agreement ("Agreement") is entered into as of January 15, 2025, by and between TechCorp Inc., a Delaware corporation ("Disclosing Party"), and DataSystems LLC, a California limited liability company ("Receiving Party"). WHEREAS, Disclosing Party possesses certain confidential information relating to its proprietary technology and business operations; NOW, THEREFORE, in consideration of the mutual covenants contained herein, the parties agree as follows: 1. CONFIDENTIAL INFORMATION The term "Confidential Information" shall mean any and all non-public information... 2. OBLIGATIONS OF RECEIVING PARTY Receiving Party agrees to hold all Confidential Information in strict confidence... """ doc = Document(raw_text=legal_doc_text) # Define a LabelConcept for contract type classification contract_type_concept = LabelConcept( name="Contract Type", description="Classify the type of contract", labels=["NDA", "Consultancy Agreement", "Privacy Policy", "Other"], classification_type="multi_class", # only one label can be selected (mutually exclusive labels) singular_occurrence=True, # expect only one classification result ) print(contract_type_concept._format_labels_in_prompt) # Attach the concept to the document doc.add_concepts([contract_type_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document contract_type_concept = llm.extract_concepts_from_document(doc)[0] # Check if any labels were extracted if contract_type_concept.extracted_items: # Get the classified document type classified_type = contract_type_concept.extracted_items[0].value print(f"Document classified as: {classified_type}") # Output: ['NDA'] else: print("No applicable labels found for this document") ⚙️ Parameters ============= When creating a "LabelConcept", you can specify the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "name" | "str" | (Required) | A unique name identifier for the concept | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "description" | "str" | (Required) | A clear description of what the concept represents | | | | | and how classification should be performed | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "labels" | "list[str]" | (Required) | List of predefined labels for classification. Must | | | | | contain at least 2 unique labels | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "classification_typ | "str" | ""multi_class"" | Classification mode. Available values: | | e" | | | ""multi_class"" (select exactly one label), | | | | | ""multi_label"" (select one or more labels). | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "llm_role" | "str" | ""extractor_te | The role of the LLM responsible for extracting the | | | | xt"" | concept. Available values: ""extractor_text"", | | | | | ""reasoner_text"", ""extractor_vision"", | | | | | ""reasoner_vision"", ""extractor_multimodal"", | | | | | ""reasoner_multimodal"". For more details, see 🏷️ | | | | | LLM Roles. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_justifications" | "bool" | "False" | Whether to include justifications for extracted | | | | | items. Justifications provide explanations of why | | | | | specific labels were selected and the reasoning | | | | | behind the classification decision. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_dept | "str" | ""brief"" | Justification detail level. Available values: | | h" | | | ""brief"", ""balanced"", ""comprehensive"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "justification_max_ | "int" | "2" | Maximum sentences in a justification. | | sents" | | | | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "add_references" | "bool" | "False" | Whether to include source references for extracted | | | | | items. References indicate the specific locations | | | | | in the document that informed the classification | | | | | decision. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "reference_depth" | "str" | ""paragraphs"" | Source reference granularity. Available values: | | | | | ""paragraphs"", ""sentences"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "singular_occurrenc | "bool" | "False" | Whether this concept is restricted to having only | | e" | | | one extracted item. If "True", only a single | | | | | extracted item will be extracted. This is | | | | | particularly useful for global document | | | | | classifications where only one classification | | | | | result is expected. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "custom_data" | "dict" | "{}" | Optional. Dictionary for storing any additional | | | | | data that you want to associate with the concept. | | | | | This data must be JSON-serializable. This data is | | | | | not used for extraction but can be useful for | | | | | custom processing or downstream tasks. | +----------------------+-----------------+-----------------+----------------------------------------------------+ 🚀 Advanced Usage ================= 🏷️ Multi-Class vs Multi-Label Classification -------------------------------------------- Choose the appropriate classification type based on your use case: **Multi-Class Classification** ("classification_type="multi_class""): * Always selects exactly one label from the predefined set (mutually exclusive labels) * A label is always returned, even if none perfectly fit the content * Ideal for: document types, priority levels, status categories * Example: A document must be classified as one type: "NDA", "Consultancy Agreement", or "Privacy Policy" (or "Other" if none apply) **Multi-Label Classification** ("classification_type="multi_label""): * Selects zero, one, or multiple labels from the predefined set (non- exclusive labels) * Returns only applicable labels; can return no labels if none apply * Ideal for: content topics, applicable regulations, feature tags * Example: A document can cover multiple topics: "Finance", "Legal", "Technology", or none of these topics Here's an example demonstrating multi-label classification for content topic identification: # ContextGem: Multi-Label Classification with LabelConcept import os from contextgem import Document, DocumentLLM, LabelConcept # Create a Document object with business document text covering multiple topics business_doc_text = """ QUARTERLY BUSINESS REVIEW - Q4 2024 FINANCIAL PERFORMANCE Revenue for Q4 2024 reached $2.8 million, exceeding our target by 12%. The finance team has prepared detailed budget projections for 2025, with anticipated growth of 18% across all divisions. TECHNOLOGY INITIATIVES Our development team has successfully implemented the new cloud infrastructure, reducing operational costs by 25%. The IT department is now focusing on cybersecurity enhancements and data analytics platform upgrades. HUMAN RESOURCES UPDATE We welcomed 15 new employees this quarter, bringing our total headcount to 145. The HR team has launched a comprehensive employee wellness program and updated our remote work policies. LEGAL AND COMPLIANCE All regulatory compliance requirements have been met for Q4. The legal department has reviewed and updated our data privacy policies in accordance with recent legislation changes. MARKETING STRATEGY The marketing team launched three successful campaigns this quarter, resulting in a 40% increase in lead generation. Our digital marketing efforts have expanded to include LinkedIn advertising and content marketing. """ doc = Document(raw_text=business_doc_text) # Define a LabelConcept for topic classification allowing multiple topics content_topics_concept = LabelConcept( name="Document Topics", description="Identify all relevant business topics covered in this document", labels=[ "Finance", "Technology", "HR", "Legal", "Marketing", "Operations", "Sales", "Strategy", ], classification_type="multi_label", # multiple labels can be selected (non-exclusive labels) ) # Attach the concept to the document doc.add_concepts([content_topics_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document content_topics_concept = llm.extract_concepts_from_document(doc)[0] # Check if any labels were extracted if content_topics_concept.extracted_items: # Get all identified topics identified_topics = content_topics_concept.extracted_items[0].value print(f"Document covers the following topics: {', '.join(identified_topics)}") # Expected output might include: Finance, Technology, HR, Legal, Marketing else: print("No applicable topic labels found for this document") 🔍 References and Justifications for Classification --------------------------------------------------- You can configure a "LabelConcept" to include justifications and references to understand classification decisions. This is particularly valuable when dealing with complex documents that might contain elements of multiple document types: # ContextGem: LabelConcept with References and Justifications import os from contextgem import Document, DocumentLLM, LabelConcept # Create a Document with content that might be challenging to classify mixed_content_text = """ QUARTERLY BUSINESS REVIEW AND POLICY UPDATES GlobalTech Solutions Inc. - February 2025 EMPLOYMENT AGREEMENT AND CONFIDENTIALITY PROVISIONS This Employment Agreement ("Agreement") is entered into between GlobalTech Solutions Inc. ("Company") and Sarah Johnson ("Employee") as of February 1, 2025. EMPLOYMENT TERMS Employee shall serve as Senior Software Engineer with responsibilities including software development, code review, and technical leadership. The position is full-time with an annual salary of $125,000. CONFIDENTIALITY OBLIGATIONS Employee acknowledges that during employment, they may have access to confidential information including proprietary algorithms, customer data, and business strategies. Employee agrees to maintain strict confidentiality of such information both during and after employment. NON-COMPETE PROVISIONS For a period of 12 months following termination, Employee agrees not to engage in any business activities that directly compete with Company's core services within the same geographic market. INTELLECTUAL PROPERTY All work products, inventions, and discoveries made during employment shall be the exclusive property of the Company. ADDITIONAL INFORMATION: FINANCIAL PERFORMANCE SUMMARY Q4 2024 revenue exceeded projections by 12%, reaching $3.2M. Cost optimization initiatives reduced operational expenses by 8%. The board approved a $500K investment in new data analytics infrastructure for 2025. PRODUCT LAUNCH TIMELINE The AI-powered customer analytics platform will launch Q2 2025. Marketing budget allocated: $200K for digital campaigns. Expected customer acquisition target: 150 new enterprise clients in the first quarter post-launch. """ doc = Document(raw_text=mixed_content_text) # Define a LabelConcept with justifications and references enabled document_classification_concept = LabelConcept( name="Document Classification with Evidence", description="Classify this document type and provide reasoning for the classification", labels=[ "Employment Contract", "NDA", "Consulting Agreement", "Service Agreement", "Partnership Agreement", "Other", ], classification_type="multi_class", # a single label is always returned add_justifications=True, # enable justifications to understand classification reasoning justification_depth="comprehensive", # provide detailed reasoning justification_max_sents=5, # allow up to 5 sentences for justification add_references=True, # include references to source text reference_depth="paragraphs", # reference specific paragraphs that informed classification singular_occurrence=True, # expect only one classification result ) # Attach the concept to the document doc.add_concepts([document_classification_concept]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract the concept from the document document_classification_concept = llm.extract_concepts_from_document(doc)[0] # Display the classification results with evidence if document_classification_concept.extracted_items: item = document_classification_concept.extracted_items[0] print("=== DOCUMENT CLASSIFICATION RESULTS ===") print(f"Classification: {item.value[0]}") print("\nJustification:") print(f"{item.justification}") print("\nEvidence from document:") for i, paragraph in enumerate(item.reference_paragraphs, 1): print(f"{i}. {paragraph.raw_text}") else: print("No classification could be determined - none of the predefined labels apply") # This example demonstrates how justifications help explain why the LLM # chose a specific classification and how references show which parts # of the document informed that decision 🎯 Document Aspect Analysis --------------------------- "LabelConcept" can be used to classify extracted "Aspect" instances, providing a powerful way to analyze and categorize specific information that has been extracted from documents. This approach allows you to first extract relevant content using aspects, then apply classification logic to those extracted items. Here's an example that demonstrates using "LabelConcept" to classify the financial risk level of extracted financial obligations from legal contracts: # ContextGem: Aspect Analysis with LabelConcept import os from contextgem import Aspect, Document, DocumentLLM, LabelConcept # Create a Document object from contract text contract_text = """ SOFTWARE DEVELOPMENT AGREEMENT ... SECTION 5. PAYMENT TERMS Client shall pay Developer a total fee of $150,000 for the complete software development project, payable in three installments: $50,000 upon signing, $50,000 at milestone completion, and $50,000 upon final delivery. ... SECTION 8. MAINTENANCE AND SUPPORT Following project completion, Developer shall provide 12 months of maintenance and support services at a rate of $5,000 per month, totaling $60,000 annually. ... SECTION 12. PENALTY CLAUSES In the event of project delay beyond the agreed timeline, Developer shall pay liquidated damages of $2,000 per day of delay, with a maximum penalty cap of $50,000. ... SECTION 15. INTELLECTUAL PROPERTY LICENSING Client agrees to pay ongoing licensing fees of $10,000 annually for the use of Developer's proprietary frameworks and libraries integrated into the software solution. ... SECTION 18. TERMINATION COSTS Should Client terminate this agreement without cause, Client shall pay Developer 75% of all remaining unpaid fees, estimated at approximately $100,000 based on current project status. ... """ doc = Document(raw_text=contract_text) # Define a LabelConcept to classify the financial risk level of the obligations risk_classification_concept = LabelConcept( name="Client Financial Risk Level", description=( "Classify the financial risk level for the Client's financial obligations based on:\n" "- Amount size and impact on Client's cash flow\n" "- Payment timing and predictability for the Client\n" "- Penalty or liability exposure for the Client\n" "- Ongoing vs. one-time obligations for the Client" ), labels=["Low Risk", "Moderate Risk", "High Risk", "Critical Risk"], classification_type="multi_class", add_justifications=True, justification_depth="comprehensive", # provide a comprehensive justification justification_max_sents=10, # set an adequate justification length singular_occurrence=True, # global risk level for the client's financial obligations ) # Define Aspect containing the concept financial_obligations_aspect = Aspect( name="Client Financial Obligations", description="Financial obligations that the Client must fulfill under the contract", concepts=[risk_classification_concept], ) # Attach the aspect to the document doc.add_aspects([financial_obligations_aspect]) # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract all data from the document doc = llm.extract_all(doc) # Get the extracted aspect and concept financial_obligations_aspect = doc.get_aspect_by_name( "Client Financial Obligations" ) # or `doc.aspects[0]` risk_classification_concept = financial_obligations_aspect.get_concept_by_name( "Client Financial Risk Level" ) # or `financial_obligations_aspect.concepts[0]` # Display the extracted information print("Extracted Client Financial Obligations:") for extracted_item in financial_obligations_aspect.extracted_items: print(f"- {extracted_item.value}") if risk_classification_concept.extracted_items: assert ( len(risk_classification_concept.extracted_items) == 1 ) # as we have set `singular_occurrence=True` on the concept risk_item = risk_classification_concept.extracted_items[0] print(f"\nClient Financial Risk Level: {risk_item.value[0]}") print(f"Justification: {risk_item.justification}") else: print("\nRisk level could not be determined") 📊 Extracted Items ================== When a "LabelConcept" is extracted, it is populated with **a list of extracted items** accessible through the ".extracted_items" property. Each item is an instance of the "_LabelItem" class with the following attributes: +----------------------+----------------------+--------------------------------------------------------------+ | Attribute | Type | Description | |======================|======================|==============================================================| | "value" | list[str] | List of selected labels (always a list for API consistency, | | | | even for multi-class with single selection) | +----------------------+----------------------+--------------------------------------------------------------+ | "justification" | str | Explanation of why these labels were selected (only if | | | | "add_justifications=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_paragrap | list["Paragraph"] | List of paragraph objects that informed the classification | | hs" | | (only if "add_references=True") | +----------------------+----------------------+--------------------------------------------------------------+ | "reference_sentence | list["Sentence"] | List of sentence objects that informed the classification | | s" | | (only if "add_references=True" and | | | | "reference_depth="sentences"") | +----------------------+----------------------+--------------------------------------------------------------+ 💡 Best Practices ================= Here are some best practices to optimize your use of "LabelConcept": * **Choose meaningful labels**: Use clear, distinct labels that cover your classification needs without overlap. * **Provide clear descriptions**: Explain what each classification represents and when each label should be applied. * **Consider label granularity**: Balance between too few labels (insufficient precision) and too many labels (classification complexity). * **For multi-class classification**: Consider including a general "other" label (like "Other", "N/A", "Mixed", etc.) since a label is always returned, even when none of the specific labels perfectly fit the content, unless your labels are broad enough to cover all cases, or you know that the classified content always falls under one of the predefined labels without edge cases. * **For multi-label classification**: Design your workflow to handle cases where none of the predefined labels apply (resulting in empty "extracted_items"), as this classification type can return zero labels. * **Use appropriate classification type**: Set "classification_type="multi_class"" for mutually exclusive categories where exactly one choice is required, "classification_type="multi_label"" for potentially overlapping attributes where zero, one, or multiple labels can apply. * **Enable justifications**: Use "add_justifications=True" to understand and validate classification decisions, especially for complex or ambiguous content. # ==== pipelines/extraction_pipelines ==== Extraction Pipelines ******************** "ExtractionPipeline" is a powerful component that enables you to create reusable collections of predefined aspects and concepts for consistent document analysis. Pipelines serve as templates that can be applied to multiple documents, ensuring standardized data extraction across your application. 📝 Overview =========== Extraction pipelines package common extraction patterns into reusable units, allowing you to: * **Standardize document processing**: Define a consistent set of aspects and concepts once, then apply them to multiple documents * **Create reusable templates**: Build domain-specific pipelines (e.g., contract analysis, invoice processing, report analysis) * **Ensure consistent analysis**: Maintain uniform extraction criteria across document batches * **Simplify workflow management**: Organize complex extraction workflows into manageable, reusable components Pipelines are particularly valuable when processing multiple documents of the same type, where you need to extract the same categories of information consistently. ⭐ Key Features =============== Template-Based Extraction ------------------------- Pipelines act as extraction templates that define what information to extract from documents. Once created, a pipeline can be assigned to any number of documents, ensuring consistent analysis criteria. Aspect and Concept Organization ------------------------------- Pipelines can contain both: * **Aspects**: For extracting document sections and organizing content hierarchically * **Concepts**: For extracting specific data points with intelligent inference This allows you to create comprehensive extraction workflows that combine broad content organization with detailed data extraction. Reusability and Scalability --------------------------- A single pipeline can be applied to multiple documents, making it ideal for batch processing, automated workflows, and applications that need to process similar document types repeatedly. 💻 Basic Usage ============== Simple Pipeline Creation ------------------------ Here's how to create and use a basic extraction pipeline: from contextgem import ( Aspect, BooleanConcept, DateConcept, Document, ExtractionPipeline, StringConcept, ) # Create a pipeline for NDA (Non-Disclosure Agreement) review nda_pipeline = ExtractionPipeline( aspects=[ Aspect( name="Confidential information", description="Clauses defining the confidential information", ), Aspect( name="Exclusions", description="Clauses defining exclusions from confidential information", ), Aspect( name="Obligations", description="Clauses defining confidentiality obligations", ), Aspect( name="Liability", description="Clauses defining liability for breach of the agreement", ), # ... Add more aspects as needed ], concepts=[ StringConcept( name="Anomaly", description="Anomaly in the contract, e.g. out-of-context or nonsensical clauses", llm_role="reasoner_text", add_references=True, # Add references to the source text reference_depth="sentences", # Reference to the sentence level add_justifications=True, # Add justifications for the anomaly justification_depth="balanced", # Justification at the sentence level justification_max_sents=5, # Maximum number of sentences in the justification ), BooleanConcept( name="Is mutual", description="Whether the NDA is mutual (bidirectional) or one-way", singular_occurrence=True, llm_role="reasoner_text", # Use the reasoner role for this concept ), DateConcept( name="Effective date", description="The date when the NDA agreement becomes effective", singular_occurrence=True, ), StringConcept( name="Term", description="The term of the NDA", ), StringConcept( name="Governing law", description="The governing law of the agreement", singular_occurrence=True, ), # ... Add more concepts as needed ], ) # Assign the pipeline to the NDA document nda_document = Document(raw_text="[NDA text]") nda_document.assign_pipeline(nda_pipeline) # Now the document is ready for processing with the NDA review pipeline! # The document can be processed to extract the defined aspects and concepts # Extract all aspects and concepts from the NDA using an LLM group # with LLMs with roles "extractor_text" and "reasoner_text". # llm_group.extract_all(nda_document) Pipeline Assignment to Documents -------------------------------- Once created, pipelines can be easily assigned to documents: from contextgem import Document, ExtractionPipeline # Create your pipeline my_pipeline = ExtractionPipeline(aspects=[...], concepts=[...]) # Create documents doc1 = Document(raw_text="First document content...") doc2 = Document(raw_text="Second document content...") # Assign the same pipeline to multiple documents doc1.assign_pipeline(my_pipeline) doc2.assign_pipeline(my_pipeline) # Now both documents have the same extraction configuration ⚙️ Parameters ============= When creating an "ExtractionPipeline", you can configure the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "aspects" | "list[Aspect]" | "[]" | *Optional*. List of "Aspect" instances to extract | | | | | from documents. Aspects represent structural | | | | | categories of information and can contain their | | | | | own sub-aspects and concepts for detailed | | | | | analysis. See Aspect Extraction for more | | | | | information. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "concepts" | "list[_Concept | "[]" | *Optional*. List of "_Concept" instances to | | | ]" | | identify within or infer from documents. These are | | | | | document-level concepts that apply to the entire | | | | | document content. See supported concept types in | | | | | Supported Concepts. | +----------------------+-----------------+-----------------+----------------------------------------------------+ 📊 Pipeline Assignment ====================== The "assign_pipeline()" method is used to apply a pipeline to a document. This method: * **Assigns aspects and concepts**: Transfers the pipeline's aspects and concepts to the document * **Validates compatibility**: Ensures no conflicts with existing document configuration Assignment Options ------------------ # Basic assignment (will raise error if document already has aspects/concepts) document.assign_pipeline(my_pipeline) # Overwrite existing configuration document.assign_pipeline(my_pipeline, overwrite_existing=True) 🚀 Advanced Usage ================= Multi-Document Processing ------------------------- Pipelines excel at processing multiple documents of the same type. Here's a comprehensive example: # Advanced Usage Example - analyzing multiple documents with a single pipeline, # with different LLMs, concurrency and cost tracking import os from contextgem import ( Aspect, DateConcept, Document, DocumentLLM, DocumentLLMGroup, ExtractionPipeline, JsonObjectConcept, JsonObjectExample, LLMPricing, NumericalConcept, RatingConcept, StringConcept, StringExample, ) # Construct documents # Document 1 - Consultancy Agreement (shortened for brevity) doc1 = Document( raw_text=( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "All intellectual property created during the provision of services shall belong to the Customer...\n" "This agreement is governed by the laws of Norway...\n" "Annex 1: Data processing agreement...\n" "Annex 2: Statement of Work...\n" "Annex 3: Service Level Agreement...\n" ), ) # Document 2 - Service Level Agreement (shortened for brevity) doc2 = Document( raw_text=( "Service Level Agreement\n" "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n" "The agreement shall commence on January 1, 2023 and continue for 2 years...\n" "The Provider shall deliver IT support services as outlined in Schedule A...\n" "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n" "The Provider guarantees [99.9%] uptime for all critical systems...\n" "Either party may terminate with 60 days written notice...\n" "This agreement is governed by the laws of California...\n" "Schedule A: Service Descriptions...\n" "Schedule B: Response Time Requirements...\n" ), ) # Create a reusable extraction pipeline contract_pipeline = ExtractionPipeline() # Define aspects and aspect-level concepts in the pipeline # Concepts in the aspects will be extracted from the extracted aspect context contract_pipeline.aspects = [ # or use .add_aspects([...]) Aspect( name="Contract Parties", description="Clauses defining the parties to the agreement", concepts=[ # define aspect-level concepts, if any StringConcept( name="Party names and roles", description="Names of all parties entering into the agreement and their roles", examples=[ # optional StringExample( content="X (Client)", # guidance regarding the expected output format ) ], ) ], ), Aspect( name="Term", description="Clauses defining the term of the agreement", concepts=[ NumericalConcept( name="Contract term", description="The term of the agreement in years", numeric_type="int", # or "float", or "any" for auto-detection add_references=True, # extract references to the source text reference_depth="paragraphs", ) ], ), ] # Define document-level concepts # Concepts in the document will be extracted from the whole document content contract_pipeline.concepts = [ # or use .add_concepts() DateConcept( name="Effective date", description="The effective date of the agreement", ), StringConcept( name="Contract type", description="The type of agreement", llm_role="reasoner_text", # for this concept, we use a more advanced LLM for reasoning ), StringConcept( name="Governing law", description="The law that governs the agreement", ), JsonObjectConcept( name="Attachments", description="The titles and concise descriptions of the attachments to the agreement", structure={"title": str, "description": str | None}, examples=[ # optional JsonObjectExample( # guidance regarding the expected output format content={ "title": "Appendix A", "description": "Code of conduct", } ), ], ), RatingConcept( name="Duration adequacy", description="Contract duration adequacy considering the subject matter and best practices.", llm_role="reasoner_text", # for this concept, we use a more advanced LLM for reasoning rating_scale=(1, 10), add_justifications=True, # add justifications for the rating justification_depth="balanced", # provide a balanced justification justification_max_sents=3, ), ] # Assign pipeline to the documents # You can re-use the same pipeline for multiple documents doc1.assign_pipeline( contract_pipeline ) # assigns pipeline aspects and concepts to the document doc2.assign_pipeline( contract_pipeline ) # assigns pipeline aspects and concepts to the document # Create an LLM group for data extraction and reasoning llm_extractor = DocumentLLM( model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"], # your API key role="extractor_text", # signifies the LLM is used for data extraction tasks pricing_details=LLMPricing( # optional, for costs calculation input_per_1m_tokens=0.150, output_per_1m_tokens=0.600, ), # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider ) llm_reasoner = DocumentLLM( model="openai/o3-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"], # your API key role="reasoner_text", # signifies the LLM is used for reasoning tasks pricing_details=LLMPricing( # optional, for costs calculation input_per_1m_tokens=1.10, output_per_1m_tokens=4.40, ), # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider ) # The LLM group is used for all extraction tasks within the pipeline llm_group = DocumentLLMGroup(llms=[llm_extractor, llm_reasoner]) # Extract all information from the documents at once doc1 = llm_group.extract_all( doc1, use_concurrency=True ) # use concurrency to speed up extraction doc2 = llm_group.extract_all( doc2, use_concurrency=True ) # use concurrency to speed up extraction # Or use async variants .extract_all_async(...) # Get the extracted data print("Some extracted data from doc 1:") print("Contract Parties > Party names and roles:") print( doc1.get_aspect_by_name("Contract Parties") .get_concept_by_name("Party names and roles") .extracted_items ) print("Attachments:") print(doc1.get_concept_by_name("Attachments").extracted_items) # ... print("\nSome extracted data from doc 2:") print("Term > Contract term:") print( doc2.get_aspect_by_name("Term") .get_concept_by_name("Contract term") .extracted_items[0] .value ) print("Duration adequacy:") print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].value) print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].justification) # ... # Output processing costs (requires setting the pricing details for each LLM) print("\nProcessing costs:") print(llm_group.get_cost()) Pipeline Serialization ---------------------- Pipelines can be serialized for storage and later reuse: # Serialize the pipeline pipeline_json = pipeline.to_json() # or to_dict() / to_disk() # Deserialize the pipeline pipeline_deserialized = ExtractionPipeline.from_json( pipeline_json ) # or from_dict() / from_disk() 💡 Best Practices ================= Pipeline Design --------------- * **Domain-specific organization**: Create pipelines tailored to specific document types (contracts, invoices, reports, etc.) * **Logical grouping**: Group related aspects and concepts together for coherent analysis * **Reusable templates**: Design pipelines to be generic enough for reuse across similar documents Concept Placement Strategy -------------------------- * **Document-level concepts**: Place concepts that apply to the entire document in the pipeline's "concepts" list * **Aspect-level concepts**: Place concepts that are specific to particular document sections within the relevant aspects * **Avoid duplication**: Don't create similar concepts at both document and aspect levels 🎯 Example Use Cases ==================== Invoice Processing Pipeline --------------------------- invoice_pipeline = ExtractionPipeline( concepts=[ StringConcept(name="Vendor Name", description="Name of the vendor/supplier"), StringConcept(name="Invoice Number", description="Unique invoice identifier"), DateConcept(name="Invoice Date", description="Date the invoice was issued"), DateConcept(name="Due Date", description="Payment due date"), NumericalConcept(name="Total Amount", description="Total invoice amount"), StringConcept(name="Currency", description="Currency of the invoice"), ] ) Research Paper Analysis Pipeline -------------------------------- research_pipeline = ExtractionPipeline( aspects=[ Aspect(name="Abstract", description="Paper abstract and summary"), Aspect(name="Methodology", description="Research methods and approach"), Aspect(name="Results", description="Findings and outcomes"), Aspect(name="Conclusions", description="Conclusions and implications"), ], concepts=[ StringConcept(name="Research Field", description="Primary research domain"), StringConcept(name="Keywords", description="Paper keywords and topics"), DateConcept(name="Publication Date", description="When the paper was published"), RatingConcept(name="Novelty Score", description="Novelty of the research", rating_scale=(1, 10)), ] ) ⚡ Pipeline Reuse Benefits ========================== * **Consistency**: Ensures all documents are processed with identical extraction criteria * **Efficiency**: Eliminates the need to recreate aspects and concepts for each document * **Maintainability**: Changes to extraction logic only need to be made in one place 📚 Related Documentation ======================== * Aspect Extraction - Learn about aspect extraction * Supported Concepts - Explore available concept types and how to use them * Advanced usage examples - See advanced pipeline usage examples * Extraction Methods - Understand LLM extraction methods * Serializing objects and results - Learn about pipeline serialization and storage # ==== llms/supported_llms ==== Supported LLMs ************** ContextGem supports all LLM providers and models available through the LiteLLM integration. This means you can use models from major cloud providers like OpenAI, Anthropic, Google, Azure, and xAI, as well as run local models through providers like Ollama and LM Studio. ContextGem works with both types of LLM architectures: * Reasoning/CoT-capable models (e.g., "openai/o4-mini", "ollama_chat/deepseek-r1:32b") * Non-reasoning models (e.g., "openai/gpt-4.1", "ollama_chat/llama3.3:70b") For a complete list of supported providers, see the LiteLLM Providers documentation. ☁️ Cloud-based LLMs =================== You can initialize cloud-based LLMs by specifying the provider and model name in the format "/": Using cloud LLM providers from contextgem import DocumentLLM # Pattern for using any cloud LLM provider llm = DocumentLLM( model="/", api_key="", ) # Example - Using OpenAI LLM llm_openai = DocumentLLM( model="openai/gpt-4.1-mini", api_key="", # see DocumentLLM API reference for all configuration options ) # Example - Using Azure OpenAI LLM llm_azure_openai = DocumentLLM( model="azure/o4-mini", api_key="", api_version="", api_base="", # see DocumentLLM API reference for all configuration options ) 💻 Local LLMs ============= For local LLMs, you'll need to specify the provider, model name, and the appropriate API base URL: Using local LLM providers from contextgem import DocumentLLM local_llm = DocumentLLM( model="ollama_chat/", api_base="http://localhost:11434", # Default Ollama endpoint ) # Example - Using Llama 3.1 LLM via Ollama llm_llama = DocumentLLM( model="ollama_chat/llama3.3:70b", api_base="http://localhost:11434", # see DocumentLLM API reference for all configuration options ) # Example - Using DeepSeek R1 reasoning model via Ollama llm_deepseek = DocumentLLM( model="ollama_chat/deepseek-r1:32b", api_base="http://localhost:11434", # see DocumentLLM API reference for all configuration options ) Note: **Vision Models with Ollama**: For local vision models that process images, use the "ollama/" prefix instead of "ollama_chat/", as the latter does not yet support image inputs. For more details, see the relevant Ollama GitHub issue and LiteLLM GitHub issue. Note: **LM Studio Connection Error**: If you encounter a connection error ("litellm.APIError: APIError: Lm_studioException - Connection error") when using LM Studio, check that you have provided a dummy API key. While API keys are usually not expected for local models, this is a specific case where LM Studio requires one:LM Studio with dummy API key from contextgem import DocumentLLM llm = DocumentLLM( model="lm_studio/mistralai/mistral-small-3.2", api_base="http://localhost:1234/v1", api_key="dummy-key", # dummy key to avoid connection error ) # This is a known issue with calling LM Studio API in litellm: # https://github.com/openai/openai-python/issues/961 This is a known issue with calling LM Studio API in litellm: https://github.com/openai/openai-python/issues/961 For a complete list of configuration options available when initializing DocumentLLM instances, see the next section Configuring LLM(s). # ==== llms/llm_config ==== Configuring LLM(s) ****************** This guide explains how to configure "DocumentLLM" instances to process documents using various LLM providers. ContextGem uses LiteLLM under the hood, providing uniform access to a wide range of models. For more information on supported LLMs, see Supported LLMs. 🚀 Basic Configuration ====================== The minimum configuration for a cloud-based LLM requires the "model" parameter and an "api_key": Using a cloud-based LLM from contextgem import DocumentLLM # Pattern for using any cloud LLM provider llm = DocumentLLM( model="/", api_key="", ) # Example - Using OpenAI LLM llm_openai = DocumentLLM( model="openai/gpt-4.1-mini", api_key="", # see DocumentLLM API reference for all configuration options ) # Example - Using Azure OpenAI LLM llm_azure_openai = DocumentLLM( model="azure/o4-mini", api_key="", api_version="", api_base="", # see DocumentLLM API reference for all configuration options ) For local models, usually you need to specify the "api_base" instead of the API key: Using a local LLM from contextgem import DocumentLLM local_llm = DocumentLLM( model="ollama_chat/", api_base="http://localhost:11434", # Default Ollama endpoint ) # Example - Using Llama 3.1 LLM via Ollama llm_llama = DocumentLLM( model="ollama_chat/llama3.3:70b", api_base="http://localhost:11434", # see DocumentLLM API reference for all configuration options ) # Example - Using DeepSeek R1 reasoning model via Ollama llm_deepseek = DocumentLLM( model="ollama_chat/deepseek-r1:32b", api_base="http://localhost:11434", # see DocumentLLM API reference for all configuration options ) Note: **LM Studio Connection Error**: If you encounter a connection error ("litellm.APIError: APIError: Lm_studioException - Connection error") when using LM Studio, check that you have provided a dummy API key. While API keys are usually not expected for local models, this is a specific case where LM Studio requires one:LM Studio with dummy API key from contextgem import DocumentLLM llm = DocumentLLM( model="lm_studio/mistralai/mistral-small-3.2", api_base="http://localhost:1234/v1", api_key="dummy-key", # dummy key to avoid connection error ) # This is a known issue with calling LM Studio API in litellm: # https://github.com/openai/openai-python/issues/961 This is a known issue with calling LM Studio API in litellm: https://github.com/openai/openai-python/issues/961 📝 Configuration Parameters =========================== The "DocumentLLM" class accepts the following parameters: +----------------------+-----------------+-----------------+----------------------------------------------------+ | Parameter | Type | Default Value | Description | |======================|=================|=================|====================================================| | "model" | "str" | (Required) | Model identifier in format | | | | | "/". See LiteLLM Providers | | | | | for all supported providers. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "api_key" | "str | None" | "None" | API key for authentication. Required for most | | | | | cloud providers but not for local models. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "api_base" | "str | None" | "None" | Base URL of the API endpoint. Required for local | | | | | models and some cloud providers (e.g. Azure | | | | | OpenAI). | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "deployment_id" | "str | None" | "None" | Deployment ID for the model. Primarily used with | | | | | Azure OpenAI. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "api_version" | "str | None" | "None" | API version. Primarily used with Azure OpenAI. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "role" | "str" | ""extractor_te | Role type for the LLM. Values: ""extractor_text"", | | | | xt"" | ""reasoner_text"", ""extractor_vision"", | | | | | ""reasoner_vision"", ""extractor_multimodal"", | | | | | ""reasoner_multimodal"". The role parameter is an | | | | | abstraction that can be explicitly assigned to | | | | | extraction components (aspects and concepts) in | | | | | the pipeline. ContextGem then routes extraction | | | | | tasks based on these assigned roles, matching | | | | | components with LLMs of the same role. This allows | | | | | you to structure your pipeline with different | | | | | models for different tasks (e.g., using simpler | | | | | models for basic extractions and more powerful | | | | | models for complex reasoning). For more details, | | | | | see 🏷️ LLM Roles. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "system_message" | "str | None" | "None" | If not provided (or set to None), ContextGem | | | | | automatically sets a default system message | | | | | optimized for extraction tasks, rendered based on | | | | | the configured "output_language". This default | | | | | system message template can be found here in the | | | | | source code. Note that for certain models (such as | | | | | OpenAI's o1-preview), system messages are not | | | | | supported and will be ignored. Overriding this is | | | | | typically only necessary for advanced use cases, | | | | | such as custom priming or non- extraction tasks. | | | | | For simple chat interactions, consider setting | | | | | "system_message=''" to disable the default | | | | | entirely (meaning no system message will be sent). | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "max_tokens" | "int" | "4096" | Maximum tokens in the generated response | | | | | (applicable to most models). | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "max_completion_tok | "int" | "16000" | Maximum tokens for output completions in reasoning | | ens" | | | (CoT-capable) models. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "reasoning_effort" | "str | None" | "None" | Reasoning effort for reasoning (CoT-capable) | | | | | models. Values: ""minimal"" (gpt-5 models only), | | | | | ""low"", ""medium"", ""high"". | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "timeout" | "int" | "120" | Timeout in seconds for LLM API calls. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "num_retries_failed | "int" | "3" | Number of retries when LLM request fails. | | _request" | | | | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "max_retries_failed | "int" | "0" | LLM provider-specific retry count for failed | | _request" | | | requests. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "max_retries_invali | "int" | "3" | Number of retries when LLM request succeeds but | | d_data" | | | returns invalid data (JSON parsing and validation | | | | | fails). | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "pricing_details" | "LLMPricing | | "None" | "LLMPricing" object with pricing details for cost | | | None" | | calculation. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "auto_pricing" | "bool" | "False" | Enable automatic cost calculation using "genai- | | | | | prices" based on the configured "model". Mutually | | | | | exclusive with "pricing_details". Not supported | | | | | for local models (e.g., "ollama/", "ollama_chat/", | | | | | "lm_studio/"). | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "auto_pricing_refre | "bool" | "False" | When "auto_pricing" is enabled, allow "genai- | | sh" | | | prices" to auto-refresh its cached pricing data at | | | | | runtime. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "is_fallback" | "bool" | "False" | Indicates whether the LLM is a fallback model. | | | | | Fallback LLMs are optionally assigned to the | | | | | primary LLM instance and are used when the primary | | | | | LLM fails. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "fallback_llm" | "DocumentLLM | | "None" | "DocumentLLM" to use as fallback if current one | | | None" | | fails. Must have the same role as the primary LLM. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "output_language" | "str" | ""en"" | Language for output text. Values: ""en"" or | | | | | ""adapt"" (adapts to document language). Setting | | | | | value to ""adapt"" ensures that the text output | | | | | (e.g. justifications, conclusions, explanations) | | | | | is in the same language as the document. This is | | | | | particularly useful when working with non-English | | | | | documents. For example, if you're extracting | | | | | anomalies from a contract in Spanish, setting | | | | | "output_language="adapt"" ensures that anomaly | | | | | justifications are also in Spanish, making them | | | | | immediately understandable by local end users | | | | | reviewing the document. This parameter applies | | | | | only when the default system message is used. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "temperature" | "float | None" | "0.3" | Sampling temperature (0.0 to 1.0) controlling | | | | | response creativity. Lower values produce more | | | | | predictable outputs, higher values generate more | | | | | varied responses. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "top_p" | "float | None" | "0.3" | Nucleus sampling value (0.0 to 1.0) controlling | | | | | output focus/randomness. Lower values make output | | | | | more deterministic, higher values produce more | | | | | diverse outputs. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "seed" | "int | None" | "None" | Seed for random number generation to help produce | | | | | more consistent outputs across multiple runs. When | | | | | set to a specific integer value, the LLM will | | | | | attempt to use this seed for sampling operations. | | | | | However, deterministic output is still not | | | | | guaranteed even with the same seed, as other | | | | | factors may influence the model's response. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "tools" | "list[dict] | | "None" | OpenAI-compatible tool schema used only for chat | | | None" | | via "DocumentLLM.chat(...)"/".chat_async(...)". | | | | | Each tool must have a registered Python handler | | | | | decorated with "@register_tool" and available in | | | | | scope when creating the LLM. Handlers must return | | | | | a string; for structured data, serialize it (e.g., | | | | | with "json.dumps") before returning. Ignored by | | | | | extraction methods. For more details, see 🛠️ Chat | | | | | with Tools. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "tool_choice" | "str | dict | | "None" | Tool choice control passed through to the provider | | | None" | | during chat. Ignored by extraction methods. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "parallel_tool_call | "bool | None" | "None" | Enable parallel tool calls during chat tool usage, | | s" | | | if supported by the model/provider. Ignored by | | | | | extraction methods. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "tool_max_rounds" | "int" | "10" | Safety limit on the number of tool-execution | | | | | rounds per chat request to prevent infinite loops. | +----------------------+-----------------+-----------------+----------------------------------------------------+ | "async_limiter" | "AsyncLimiter" | "AsyncLimiter( | Relevant when concurrency is enabled for | | | | 3, 10)" | extraction tasks. Controls frequency of async LLM | | | | | API requests for concurrent tasks. Defaults to | | | | | allowing 3 acquisitions per 10-second period to | | | | | prevent rate limit issues. See aiolimiter | | | | | documentation for AsyncLimiter configuration | | | | | details. See Optimizing for Speed for an example | | | | | of how to easily set up concurrency for | | | | | extraction. | +----------------------+-----------------+-----------------+----------------------------------------------------+ Warning: **Auto-pricing accuracy**When using "auto_pricing=True", cost estimates are approximate. These prices will not be 100% accurate. The price data cannot be exactly correct because model providers do not provide exact price information for their APIs in a format which can be reliably processed. See Pydantic's genai-prices for more details. 💡 Advanced Configuration Examples ================================== 🔄 Configuring a Fallback LLM ----------------------------- You can set up a fallback LLM that will be used if the primary LLM fails: Configuring a fallback LLM from contextgem import DocumentLLM # Primary LLM primary_llm = DocumentLLM( model="openai/gpt-4o-mini", api_key="", role="extractor_text", # default role ) # Fallback LLM fallback_llm = DocumentLLM( model="anthropic/claude-3-5-haiku", api_key="", role="extractor_text", # Must match the primary LLM's role is_fallback=True, ) # Assign fallback LLM to primary primary_llm.fallback_llm = fallback_llm # Then use the primary LLM as usual # document = primary_llm.extract_all(document) 💰 Setting Up Cost Tracking --------------------------- You can configure pricing parameters to track costs: Setting up LLM cost tracking from contextgem import DocumentLLM, LLMPricing # Option 1: Set up a LLM with pricing details llm = DocumentLLM( model="openai/gpt-4o-mini", api_key="", pricing_details=LLMPricing( input_per_1m_tokens=0.150, # Cost per 1M input tokens output_per_1m_tokens=0.600, # Cost per 1M output tokens ), ) # Option 2: Set up a LLM with auto-pricing llm = DocumentLLM( model="openai/gpt-4o-mini", api_key="", auto_pricing=True, ) # Perform some extraction tasks # Later, you can check the cost cost_info = llm.get_cost() 🧠 Using Model-Specific Parameters ---------------------------------- For reasoning (CoT-capable) models (such as OpenAI's o1/o3/o4), you can set reasoning-specific parameters: Using model-specific parameters from contextgem import DocumentLLM llm = DocumentLLM( model="openai/o3-mini", api_key="", max_completion_tokens=8000, # Specific to reasoning (CoT-capable) models reasoning_effort="medium", # Optional ) ⚙️ Explicit Capability Declaration ---------------------------------- Model vision capabilities are automatically detected using "litellm.supports_vision()". If this function does not correctly identify your model's capabilities, ContextGem will typically issue a warning, and you can explicitly declare the capability by setting "_supports_vision=True" on the LLM instance: from contextgem import DocumentLLM # Example: Explicitly declare vision capability # Warning will be issued if automatic vision capability detection fails llm = DocumentLLM( model="some_provider/custom_vision_model", api_base="http://localhost:3456/v1", role="extractor_vision" ) # Declare capability if automatic detection fails (warning was issued) llm._supports_vision = True Warning: Explicit capability declarations should only be used when automatic capability detection fails. Incorrectly setting this flag may lead to unexpected behavior or API errors. 🤖🤖 LLM Groups =============== For complex document processing, you can organize multiple LLMs with different roles into a group: Using LLM group from contextgem import DocumentLLM, DocumentLLMGroup # Create LLMs with different roles text_extractor = DocumentLLM( model="openai/gpt-4o-mini", api_key="", role="extractor_text", output_language="adapt", ) text_reasoner = DocumentLLM( model="openai/o3-mini", api_key="", role="reasoner_text", max_completion_tokens=16000, reasoning_effort="high", output_language="adapt", ) # Create a group llm_group = DocumentLLMGroup( llms=[text_extractor, text_reasoner], output_language="adapt", # All LLMs in the group must share the same output language setting ) # Then use the group as usual # document = llm_group.extract_all(document) See a practical example of using an LLM group in 🔄 Using a Multi-LLM Pipeline to Extract Data from Several Documents. 📊 Accessing Usage and Cost Statistics ====================================== You can track input/output token usage and costs: Tracking usage and cost from contextgem import DocumentLLM llm = DocumentLLM( model="anthropic/claude-3-5-haiku", api_key="", auto_pricing=True, # or set `pricing_details=LLMPricing(...)` manually ) # Perform some extraction tasks # Get usage statistics usage_info = llm.get_usage() # Get cost statistics cost_info = llm.get_cost() # Reset usage and cost statistics llm.reset_usage_and_cost() # The same methods are available for LLM groups, with optional filtering by LLM role # usage_info = llm_group.get_usage(llm_role="extractor_text") # cost_info = llm_group.get_cost(llm_role="extractor_text") # llm_group.reset_usage_and_cost(llm_role="extractor_text") The usage statistics include not only token counts but also detailed information about each individual call made to the LLM. You can access the call history, including prompts, responses, and timestamps: Accessing detailed usage information from contextgem import DocumentLLM llm = DocumentLLM( model="openai/gpt-4.1", api_key="", ) # Perform some extraction tasks usage_info = llm.get_usage() # Access the first usage container in the list (for the primary LLM) llm_usage = usage_info[0] # Get detailed call information for call in llm_usage.usage.calls: print(f"Prompt: {call.prompt}") print(f"Response: {call.response}") # original, unprocessed response print(f"Sent at: {call.timestamp_sent}") print(f"Received at: {call.timestamp_received}") # ==== llms/llm_extraction_methods ==== Extraction Methods ****************** This guide documents the extraction methods provided by the "DocumentLLM" and "DocumentLLMGroup" classes for extracting aspects and concepts from documents using large language models. 📄🧠 Complete Document Processing ================================= "extract_all()" --------------- Performs comprehensive extraction by processing a "Document" for all "Aspect" and "_Concept" instances. This is the most commonly used method for complete document analysis. Note: See supported concept types in Supported Concepts. All public concept types inherit from the internal "_Concept" base class. **Method Signature:** def extract_all( self, document: Document, overwrite_existing: bool = False, max_items_per_call: int = 0, use_concurrency: bool = False, max_paragraphs_to_analyze_per_call: int = 0, max_images_to_analyze_per_call: int = 0, ) -> Document Note: An async equivalent "extract_all_async()" is also available. **Parameters:** +-----------------+-----------------+------------+--------------------------------------------------------------+ | Parameter | Type | Default | Description | |=================|=================|============|==============================================================| | "document" | "Document" | (Required) | The document with attached "Aspect" and/or "_Concept" | | | | | instances to extract. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "overwrite_exi | "bool" | "False" | Whether to overwrite already processed "Aspect" and | | sting" | | | "_Concept" instances with newly extracted information. This | | | | | is particularly useful when reprocessing documents with | | | | | updated LLMs or extraction parameters. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_items_per | "int" | "0" | Maximum number of "Aspect" and/or "_Concept" instances with | | _call" | | | the same extraction parameters to process in a single LLM | | | | | call (single LLM prompt). "0" means all aspect and/or | | | | | concept instances with same extraction params in a one call. | | | | | This is particularly useful for complex tasks or long | | | | | documents to prevent prompt overloading and allow the LLM to | | | | | focus on a smaller set of extraction tasks at once. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "use_concurren | "bool" | "False" | Enable concurrent processing of multiple "Aspect" and/or | | cy" | | | "_Concept" instances. Can significantly reduce processing | | | | | time by executing multiple extraction tasks in parallel, | | | | | especially beneficial for documents with many aspects and | | | | | concepts. However, it might cause rate limit errors with LLM | | | | | providers. When enabled, adjust the "async_limiter" on your | | | | | "DocumentLLM" to control request frequency (default is 3 | | | | | acquisitions per 10 seconds). For optimal results, combine | | | | | with "max_items_per_call=1" to maximize concurrency, | | | | | although this would cause increase in LLM API costs as each | | | | | aspect/concept will be processed in a separate LLM call (LLM | | | | | prompt). See Optimizing for Speed for examples of | | | | | concurrency configuration. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_paragraph | "int" | "0" | Maximum paragraphs to include in a single LLM call (single | | s_to_analyze_p | | | LLM prompt). "0" means all paragraphs. This parameter is | | er_call" | | | crucial when working with long documents that exceed the | | | | | LLM's context window. By limiting the number of paragraphs | | | | | per call, you can ensure the LLM processes the document in | | | | | manageable segments while maintaining semantic coherence. | | | | | This prevents token limit errors and often improves | | | | | extraction quality by allowing the model to focus on smaller | | | | | portions of text at a time. For more details on handling | | | | | long documents, see Dealing with Long Documents. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_images_to | "int" | "0" | Maximum "Image" instances to analyze in a single LLM call | | _analyze_per_c | | | (single LLM prompt). "0" means all images. This parameter is | | all" | | | crucial when working with documents containing multiple | | | | | images that might exceed the LLM's context window. By | | | | | limiting the number of images per call, you can ensure the | | | | | LLM processes the document's visual content in manageable | | | | | batches. Relevant only when extracting document-level | | | | | concepts from document images. See 🖼️ Concept Extraction | | | | | from Document (vision) for an example of extracting concepts | | | | | from document images. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "raise_excepti | "bool" | "True" | Whether to raise an exception if the extraction fails due to | | on_on_extracti | | | invalid data returned by an LLM or an error in the LLM API. | | on_error" | | | If True (default): if the LLM returns invalid data, | | | | | "LLMExtractionError" will be raised, and if the LLM API call | | | | | fails, "LLMAPIError" will be raised. If False, a warning | | | | | will be issued instead, and no extracted items will be | | | | | returned. | +-----------------+-----------------+------------+--------------------------------------------------------------+ **Return Value:** Returns the same "Document" instance passed as input, but with all attached "Aspect" and "_Concept" instances populated with their extracted items. The document's aspects and concepts will have their "extracted_items" field populated with the extracted information, and if applicable, "reference_paragraphs"/ "reference_sentences" will be set based on the extraction parameters. The exact structure of references depends on the "reference_depth" setting of each aspect and concept. **Example Usage:** Extracting all aspects and concepts from a document # ContextGem: Extracting All Aspects and Concepts from Document import os from contextgem import Aspect, Document, DocumentLLM, StringConcept # Sample text content text_content = """ John Smith is a 30-year-old software engineer working at TechCorp. He has 5 years of experience in Python development and leads a team of 8 developers. His annual salary is $95,000 and he graduated from MIT with a Computer Science degree. """ # Create a Document object from text doc = Document(raw_text=text_content) # Define aspects and concepts directly on the document doc.aspects = [ Aspect( name="Professional Information", description="Information about the person's career, job, and work experience", ) ] doc.concepts = [ StringConcept( name="Person name", description="Full name of the person", ) ] # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract all aspects and concepts from the document processed_doc = llm.extract_all(doc) # Access extracted aspect information aspect = processed_doc.aspects[0] print(f"Aspect: {aspect.name}") print(f"Extracted items: {[item.value for item in aspect.extracted_items]}") # Access extracted concept information concept = processed_doc.concepts[0] print(f"Concept: {concept.name}") print(f"Extracted value: {concept.extracted_items[0].value}") 📄 Aspect Extraction Methods ============================ "extract_aspects_from_document()" --------------------------------- Extracts "Aspect" instances from a "Document". **Method Signature:** def extract_aspects_from_document( self, document: Document, from_aspects: list[Aspect] | None = None, overwrite_existing: bool = False, max_items_per_call: int = 0, use_concurrency: bool = False, max_paragraphs_to_analyze_per_call: int = 0, ) -> list[Aspect] Note: An async equivalent "extract_aspects_from_document_async()" is also available. **Parameters:** +-----------------+-----------------+------------+--------------------------------------------------------------+ | Parameter | Type | Default | Description | |=================|=================|============|==============================================================| | "document" | "Document" | (Required) | The document with attached "Aspect" instances to be | | | | | extracted. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "from_aspects" | "list[Aspect] | | "None" | Specific aspects to extract from the document. If "None", | | | None" | | extracts all aspects attached to the document. This allows | | | | | you to selectively process only certain aspects rather than | | | | | the entire set. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "overwrite_exi | "bool" | "False" | Whether to overwrite already processed aspects with newly | | sting" | | | extracted information. This is particularly useful when | | | | | reprocessing documents with updated LLMs or extraction | | | | | parameters. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_items_per | "int" | "0" | Maximum number of "Aspect" instances with the same | | _call" | | | extraction parameters to process in a single LLM call | | | | | (single LLM prompt). "0" means all aspect instances with | | | | | same extraction params in a one call. This is particularly | | | | | useful for complex tasks or long documents to prevent prompt | | | | | overloading and allow the LLM to focus on a smaller set of | | | | | extraction tasks at once. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "use_concurren | "bool" | "False" | Enable concurrent processing of multiple "Aspect" instances. | | cy" | | | Can significantly reduce processing time by executing | | | | | multiple extraction tasks concurrently, especially | | | | | beneficial for documents with many aspects. However, it | | | | | might cause rate limit errors with LLM providers. When | | | | | enabled, adjust the "async_limiter" on your "DocumentLLM" to | | | | | control request frequency (default is 3 acquisitions per 10 | | | | | seconds). For optimal results, combine with | | | | | "max_items_per_call=1" to maximize concurrency, although | | | | | this would cause increase in LLM API costs as each aspect | | | | | will be processed in a separate LLM call (LLM prompt). See | | | | | Optimizing for Speed for examples of concurrency | | | | | configuration. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_paragraph | "int" | "0" | Maximum paragraphs to include in a single LLM call (single | | s_to_analyze_p | | | LLM prompt). "0" means all paragraphs. This parameter is | | er_call" | | | crucial when working with long documents that exceed the | | | | | LLM's context window. By limiting the number of paragraphs | | | | | per call, you can ensure the LLM processes the document in | | | | | manageable segments while maintaining semantic coherence. | | | | | This prevents token limit errors and often improves | | | | | extraction quality by allowing the model to focus on smaller | | | | | portions of text at a time. For more details on handling | | | | | long documents, see Dealing with Long Documents. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "raise_excepti | "bool" | "True" | Whether to raise an exception if the extraction fails due to | | on_on_extracti | | | invalid data returned by an LLM or an error in the LLM API. | | on_error" | | | If True (default): if the LLM returns invalid data, | | | | | "LLMExtractionError" will be raised, and if the LLM API call | | | | | fails, "LLMAPIError" will be raised. If False, a warning | | | | | will be issued instead, and no extracted items will be | | | | | returned. | +-----------------+-----------------+------------+--------------------------------------------------------------+ **Return Value:** Returns a list of "Aspect" instances that were processed during extraction. If "from_aspects" was specified, returns only those aspects; otherwise returns all aspects attached to the document. Each aspect in the returned list will have its "extracted_items" field populated with the extracted information, and its "reference_paragraphs" field will always be set. The "reference_sentences" field will only be populated when the aspect's "reference_depth" is set to ""sentences"". **Example Usage:** Extracting aspects from a document # ContextGem: Extracting Aspects from Documents import os from contextgem import Aspect, Document, DocumentLLM # Sample text content text_content = """ TechCorp is a leading software development company founded in 2015 with headquarters in San Francisco. The company specializes in cloud-based solutions and has grown to 500 employees across 12 countries. Their flagship product, CloudManager Pro, serves over 10,000 enterprise clients worldwide. TechCorp reported $50 million in revenue for 2023, representing a 25% growth from the previous year. The company is known for its innovative AI-powered analytics platform and excellent customer support. They recently expanded into the European market and plan to launch three new products in 2024. """ # Create a Document object from text doc = Document(raw_text=text_content) # Define aspects to extract from the document doc.aspects = [ Aspect( name="Company Overview", description="Basic information about the company, founding, location, and size", ), Aspect( name="Financial Performance", description="Revenue, growth metrics, and financial indicators", ), Aspect( name="Products and Services", description="Information about the company's products, services, and offerings", ), ] # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract aspects from the document extracted_aspects = llm.extract_aspects_from_document(doc) # Access extracted aspect information for aspect in extracted_aspects: print(f"Aspect: {aspect.name}") print(f"Extracted items: {[item.value for item in aspect.extracted_items]}") print("---") 🧠 Concept Extraction Methods ============================= "extract_concepts_from_document()" ---------------------------------- Extracts "_Concept" instances from a "Document" object. Note: See supported concept types in Supported Concepts. All public concept types inherit from the internal "_Concept" base class. **Method Signature:** def extract_concepts_from_document( self, document: Document, from_concepts: list[_Concept] | None = None, overwrite_existing: bool = False, max_items_per_call: int = 0, use_concurrency: bool = False, max_paragraphs_to_analyze_per_call: int = 0, max_images_to_analyze_per_call: int = 0, ) -> list[_Concept] Note: An async equivalent "extract_concepts_from_document_async()" is also available. **Parameters:** +-----------------+-----------------+------------+--------------------------------------------------------------+ | Parameter | Type | Default | Description | |=================|=================|============|==============================================================| | "document" | "Document" | (Required) | The document from which concepts are to be extracted. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "from_concepts" | "list[_Concept] | "None" | Specific concepts to extract from the document. If "None", | | | | None" | | extracts all concepts attached to the document. This allows | | | | | you to selectively process only certain concepts rather than | | | | | the entire set. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "overwrite_exi | "bool" | "False" | Whether to overwrite already processed concepts with newly | | sting" | | | extracted information. This is particularly useful when | | | | | reprocessing documents with updated LLMs or extraction | | | | | parameters. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_items_per | "int" | "0" | Maximum number of "_Concept" instances with the same | | _call" | | | extraction parameters to process in a single LLM call | | | | | (single LLM prompt). "0" means all concept instances with | | | | | same extraction params in a one call. This is particularly | | | | | useful for complex tasks or long documents to prevent prompt | | | | | overloading and allow the LLM to focus on a smaller set of | | | | | extraction tasks at once. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "use_concurren | "bool" | "False" | Enable concurrent processing of multiple "_Concept" | | cy" | | | instances. Can significantly reduce processing time by | | | | | executing multiple extraction tasks concurrently, especially | | | | | beneficial for documents with many concepts. However, it | | | | | might cause rate limit errors with LLM providers. When | | | | | enabled, adjust the "async_limiter" on your "DocumentLLM" to | | | | | control request frequency (default is 3 acquisitions per 10 | | | | | seconds). For optimal results, combine with | | | | | "max_items_per_call=1" to maximize concurrency, although | | | | | this would cause increase in LLM API costs as each concept | | | | | will be processed in a separate LLM call (LLM prompt). See | | | | | Optimizing for Speed for examples of concurrency | | | | | configuration. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_paragraph | "int" | "0" | Maximum paragraphs to include in a single LLM call (single | | s_to_analyze_p | | | LLM prompt). "0" means all paragraphs. This parameter is | | er_call" | | | crucial when working with long documents that exceed the | | | | | LLM's context window. By limiting the number of paragraphs | | | | | per call, you can ensure the LLM processes the document in | | | | | manageable segments while maintaining semantic coherence. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_images_to | "int" | "0" | Maximum images to include in a single LLM call (single LLM | | _analyze_per_c | | | prompt). "0" means all images. This parameter is crucial | | all" | | | when extracting concepts from documents with multiple images | | | | | using vision-capable LLMs. It helps prevent overwhelming the | | | | | model with too many visual inputs at once, manages token | | | | | usage more effectively, and enables more focused concept | | | | | extraction from visual content. See 🖼️ Concept Extraction | | | | | from Document (vision) for an example of extracting concepts | | | | | from document images. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "raise_excepti | "bool" | "True" | Whether to raise an exception if the extraction fails due to | | on_on_extracti | | | invalid data returned by an LLM or an error in the LLM API. | | on_error" | | | If True (default): if the LLM returns invalid data, | | | | | "LLMExtractionError" will be raised, and if the LLM API call | | | | | fails, "LLMAPIError" will be raised. If False, a warning | | | | | will be issued instead, and no extracted items will be | | | | | returned. | +-----------------+-----------------+------------+--------------------------------------------------------------+ **Return Value:** Returns a list of "_Concept" instances that were processed during extraction. If "from_concepts" was specified, returns only those concepts; otherwise returns all concepts attached to the document. Each concept in the returned list will have its "extracted_items" field populated with the extracted information, and if applicable, "reference_paragraphs"/ "reference_sentences" will be set based on the extraction parameters. **Example Usage:** Extracting concepts from a document # ContextGem: Extracting Concepts Directly from Documents import os from contextgem import Document, DocumentLLM, NumericalConcept, StringConcept # Sample text content text_content = """ GreenTech Solutions is an environmental technology company founded in 2018 in Portland, Oregon. The company develops sustainable energy solutions and has 75 employees working remotely across the United States. Their primary product, EcoMonitor, helps businesses track carbon emissions and has been adopted by 2,500 organizations. GreenTech Solutions reported strong financial performance with $8.5 million in revenue for 2024. The company's CEO, Sarah Johnson, announced plans to achieve carbon neutrality by 2025. They recently opened a new research facility in Seattle and hired 20 additional engineers. """ # Create a Document object from text doc = Document(raw_text=text_content) # Define concepts to extract from the document doc.concepts = [ StringConcept( name="Company Name", description="Full name of the company", ), StringConcept( name="CEO Name", description="Full name of the company's CEO", ), NumericalConcept( name="Employee Count", description="Total number of employees at the company", numeric_type="int", ), StringConcept( name="Annual Revenue", description="Company's total revenue for the year", ), ] # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract concepts from the document extracted_concepts = llm.extract_concepts_from_document(doc) # Access extracted concept information print("Concepts extracted from document:") for concept in extracted_concepts: print(f" {concept.name}: {[item.value for item in concept.extracted_items]}") "extract_concepts_from_aspect()" -------------------------------- Extracts "_Concept" instances associated with a given "Aspect" in a "Document". The aspect must be previously processed before concept extraction can occur. This means that the aspect should have already gone through extraction, which identifies the relevant context (text segments) in the document that match the aspect's description. This extracted context is then used as the foundation for concept extraction, allowing concepts to be identified specifically within the scope of the aspect. Note: See supported concept types in Supported Concepts. All public concept types inherit from the internal "_Concept" base class. **Method Signature:** def extract_concepts_from_aspect( self, aspect: Aspect, document: Document, from_concepts: list[_Concept] | None = None, overwrite_existing: bool = False, max_items_per_call: int = 0, use_concurrency: bool = False, max_paragraphs_to_analyze_per_call: int = 0, ) -> list[_Concept] Note: An async equivalent "extract_concepts_from_aspect_async()" is also available. **Parameters:** +-----------------+-----------------+------------+--------------------------------------------------------------+ | Parameter | Type | Default | Description | |=================|=================|============|==============================================================| | "aspect" | "Aspect" | (Required) | The aspect from which to extract concepts. Must be | | | | | previously processed through aspect extraction before | | | | | concepts can be extracted. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "document" | "Document" | (Required) | The document that contains the aspect with the attached | | | | | concepts to be extracted. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "from_concepts" | "list[_Concept] | "None" | Specific concepts to extract from the aspect. If "None", | | | | None" | | extracts all concepts attached to the aspect. This allows | | | | | you to selectively process only certain concepts rather than | | | | | the entire set. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "overwrite_exi | "bool" | "False" | Whether to overwrite already processed concepts with newly | | sting" | | | extracted information. This is particularly useful when | | | | | reprocessing documents with updated LLMs or extraction | | | | | parameters. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_items_per | "int" | "0" | Maximum number of "_Concept" instances with the same | | _call" | | | extraction parameters to process in a single LLM call | | | | | (single LLM prompt). "0" means all concept instances with | | | | | same extraction params in one call. This is particularly | | | | | useful for complex tasks to prevent prompt overloading and | | | | | allow the LLM to focus on a smaller set of extraction tasks | | | | | at once. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "use_concurren | "bool" | "False" | Enable concurrent processing of multiple "_Concept" | | cy" | | | instances. Can significantly reduce processing time by | | | | | executing multiple extraction tasks concurrently, especially | | | | | beneficial for aspects with many concepts. However, it might | | | | | cause rate limit errors with LLM providers. When enabled, | | | | | adjust the "async_limiter" on your "DocumentLLM" to control | | | | | request frequency (default is 3 acquisitions per 10 | | | | | seconds). For optimal results, combine with | | | | | "max_items_per_call=1" to maximize concurrency, although | | | | | this would cause increase in LLM API costs as each concept | | | | | will be processed in a separate LLM call (LLM prompt). See | | | | | Optimizing for Speed for examples of concurrency | | | | | configuration. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "max_paragraph | "int" | "0" | Maximum number of the aspect's paragraphs to analyze in a | | s_to_analyze_p | | | single LLM call (single LLM prompt). "0" means all the | | er_call" | | | aspect's paragraphs. This parameter is crucial when working | | | | | with long documents or aspects that cover extensive portions | | | | | of text that might exceed the LLM's context window. By | | | | | limiting the number of paragraphs per call, you can break | | | | | down analysis into manageable chunks or allow the LLM to | | | | | focus more deeply on smaller sections of text at a time. For | | | | | more details on handling long documents, see Dealing with | | | | | Long Documents. | +-----------------+-----------------+------------+--------------------------------------------------------------+ | "raise_excepti | "bool" | "True" | Whether to raise an exception if the extraction fails due to | | on_on_extracti | | | invalid data returned by an LLM or an error in the LLM API. | | on_error" | | | If True (default): if the LLM returns invalid data, | | | | | "LLMExtractionError" will be raised, and if the LLM API call | | | | | fails, "LLMAPIError" will be raised. If False, a warning | | | | | will be issued instead, and no extracted items will be | | | | | returned. | +-----------------+-----------------+------------+--------------------------------------------------------------+ **Return Value:** Returns a list of "_Concept" instances that were processed during extraction from the specified aspect. If "from_concepts" was specified, returns only those concepts; otherwise returns all concepts attached to the aspect. Each concept in the returned list will have its "extracted_items" field populated with the extracted information, and if applicable, "reference_paragraphs"/ "reference_sentences" will be set based on the extraction parameters. **Example Usage:** Extracting concepts from an aspect # ContextGem: Extracting Concepts from Specific Aspects import os from contextgem import Aspect, Document, DocumentLLM, NumericalConcept, StringConcept # Sample text content text_content = """ DataFlow Systems is an innovative fintech startup that was established in 2020 in Austin, Texas. The company has rapidly grown to 150 employees and operates in 8 major cities across North America. DataFlow's core platform, FinanceStream, is used by more than 5,000 small businesses for automated accounting. In their latest financial report, DataFlow Systems announced $12 million in annual revenue for 2024. This represents an impressive 40% increase compared to their 2023 performance. The company has secured $25 million in Series B funding and plans to expand internationally next year. """ # Create a Document object from text doc = Document(raw_text=text_content) # Define an aspect to extract from the document financial_aspect = Aspect( name="Financial Performance", description="Revenue, growth metrics, and financial indicators", ) # Add concepts to the aspect financial_aspect.concepts = [ StringConcept( name="Annual Revenue", description="Total revenue reported for the year", ), NumericalConcept( name="Growth Rate", description="Percentage growth rate compared to previous year", numeric_type="float", ), NumericalConcept( name="Revenue Year", description="The year for which revenue is reported", ), ] # Attach the aspect to the document doc.aspects = [financial_aspect] # Configure DocumentLLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # First, extract the aspect from the document (required before concept extraction) extracted_aspects = llm.extract_aspects_from_document(doc) financial_aspect = extracted_aspects[0] # Extract concepts from the specific aspect extracted_concepts = llm.extract_concepts_from_aspect(financial_aspect, doc) # Access extracted concepts for the aspect print(f"Aspect: {financial_aspect.name}") print(f"Extracted items: {[item.value for item in financial_aspect.extracted_items]}") print("\nConcepts extracted from this aspect:") for concept in extracted_concepts: print(f" {concept.name}: {[item.value for item in concept.extracted_items]}") # ==== advanced_usage ==== Advanced usage examples *********************** Below are complete, self-contained examples demonstrating advanced usage of ContextGem. 🔍 Extracting Aspects Containing Concepts ========================================= Tip: Concept extraction is useful for extracting specific data points from a document or an aspect. For example, a "Payment terms" aspect in a contract may have multiple concepts: * "Payment amount" * "Payment due date" * "Payment method" # Advanced Usage Example - extracting a single aspect with inner concepts from a legal document import os from contextgem import Aspect, Document, DocumentLLM, StringConcept, StringExample # Create a document instance with e.g. a legal contract text # The text is shortened for brevity doc = Document( raw_text=( "EMPLOYMENT AGREEMENT\n\n" 'This Employment Agreement (the "Agreement") is made and entered into as of January 15, 2023 (the "Effective Date"), ' 'by and between ABC Corporation, a Delaware corporation (the "Company"), and Jane Smith, an individual (the "Employee").\n\n' "1. EMPLOYMENT TERM\n" "The Company hereby employs the Employee, and the Employee hereby accepts employment with the Company, upon the terms and " "conditions set forth in this Agreement. The term of this Agreement shall commence on the Effective Date and shall continue " 'for a period of two (2) years, unless earlier terminated in accordance with Section 8 (the "Term").\n\n' "2. POSITION AND DUTIES\n" "During the Term, the Employee shall serve as Chief Technology Officer of the Company, with such duties and responsibilities " "as are commensurate with such position.\n\n" "8. TERMINATION\n" "8.1 Termination by the Company. The Company may terminate the Employee's employment for Cause at any time upon written notice. " "\"Cause\" shall mean: (i) Employee's material breach of this Agreement; (ii) Employee's conviction of a felony; or " "(iii) Employee's willful misconduct that causes material harm to the Company.\n" "8.2 Termination by the Employee. The Employee may terminate employment for Good Reason upon 30 days' written notice to the Company. " "\"Good Reason\" shall mean a material reduction in Employee's base salary or a material diminution in Employee's duties.\n" "8.3 Severance. If the Employee's employment is terminated by the Company without Cause or by the Employee for Good Reason, " "the Employee shall be entitled to receive severance pay equal to six (6) months of the Employee's base salary.\n\n" "IN WITNESS WHEREOF, the parties have executed this Agreement as of the date first written above.\n\n" "ABC CORPORATION\n\n" "By: ______________________\n" "Name: John Johnson\n" "Title: CEO\n\n" "EMPLOYEE\n\n" "______________________\n" "Jane Smith" ) ) # Define an aspect focused on termination clauses termination_aspect = Aspect( name="Termination Provisions", description="Analysis of contract termination conditions, notice requirements, and severance terms.", reference_depth="paragraphs", ) # Define concepts for the termination aspect termination_for_cause = StringConcept( name="Termination for Cause", description="Conditions under which the company can terminate the employee for cause.", examples=[ # optional, examples help the LLM to understand the concept better StringExample(content="Employee may be terminated for misconduct"), StringExample(content="Termination for breach of contract"), ], add_references=True, reference_depth="sentences", ) notice_period = StringConcept( name="Notice Period", description="Required notification period before employment termination.", add_references=True, reference_depth="sentences", ) severance_terms = StringConcept( name="Severance Package", description="Compensation and benefits provided upon termination.", add_references=True, reference_depth="sentences", ) # Add concepts to the aspect termination_aspect.add_concepts([termination_for_cause, notice_period, severance_terms]) # Add the aspect to the document doc.add_aspects([termination_aspect]) # Create an LLM for extracting data from the document llm = DocumentLLM( model="openai/gpt-4o", # You can use models from other providers as well, e.g. "anthropic/claude-3-5-sonnet" api_key=os.environ.get( "CONTEXTGEM_OPENAI_API_KEY" ), # your API key for OpenAI or another LLM provider ) # Extract all information from the document doc = llm.extract_all(doc) # Access the extracted information in the document object print("=== Termination Provisions Analysis ===") print(f"Extracted {len(doc.aspects[0].extracted_items)} items from the aspect") # Access extracted aspect concepts in the document object for concept in doc.aspects[0].concepts: print(f"--- {concept.name} ---") for item in concept.extracted_items: print(f"• {item.value}") print(f" Reference sentences: {len(item.reference_sentences)}") 📊 Extracting Aspects and Concepts from a Document ================================================== Tip: This example demonstrates how to extract both document-level concepts and aspect-specific concepts from a document with references. Using concurrency can significantly speed up extraction when working with multiple aspects and concepts.Document-level concepts apply to the entire document (like "Is Privacy Policy" or "Last Updated Date"), while aspect-specific concepts are tied to particular sections or themes within the document. # Advanced Usage Example - Extracting aspects and concepts from a document, with references, # using concurrency import os from aiolimiter import AsyncLimiter from contextgem import ( Aspect, BooleanConcept, DateConcept, Document, DocumentLLM, JsonObjectConcept, StringConcept, ) # Example privacy policy document (shortened for brevity) doc = Document( raw_text=( "Privacy Policy\n\n" "Last Updated: March 15, 2024\n\n" "1. Data Collection\n" "We collect various types of information from our users, including:\n" "- Personal information (name, email address, phone number)\n" "- Device information (IP address, browser type, operating system)\n" "- Usage data (pages visited, time spent on site)\n" "- Location data (with your consent)\n\n" "2. Data Usage\n" "We use your information to:\n" "- Provide and improve our services\n" "- Send you marketing communications (if you opt-in)\n" "- Analyze website performance\n" "- Comply with legal obligations\n\n" "3. Data Sharing\n" "We may share your information with:\n" "- Service providers (for processing payments and analytics)\n" "- Law enforcement (when legally required)\n" "- Business partners (with your explicit consent)\n\n" "4. Data Retention\n" "We retain personal data for 24 months after your last interaction with our services. " "Analytics data is kept for 36 months.\n\n" "5. User Rights\n" "You have the right to:\n" "- Access your personal data\n" "- Request data deletion\n" "- Opt-out of marketing communications\n" "- Lodge a complaint with supervisory authorities\n\n" "6. Contact Information\n" "For privacy-related inquiries, contact our Data Protection Officer at privacy@example.com\n" ), ) # Define all document-level concepts in a single declaration document_concepts = [ BooleanConcept( name="Is Privacy Policy", description="Verify if this document is a privacy policy", singular_occurrence=True, # explicitly enforce singular extracted item (optional) ), DateConcept( name="Last Updated Date", description="The date when the privacy policy was last updated", singular_occurrence=True, # explicitly enforce singular extracted item (optional) ), StringConcept( name="Contact Information", description="Contact details for privacy-related inquiries", add_references=True, reference_depth="sentences", ), ] # Define all aspects with their concepts in a single declaration aspects = [ Aspect( name="Data Collection", description="Information about what types of data are collected from users", concepts=[ JsonObjectConcept( name="Collected Data Types", description="List of different types of data collected from users", structure={ "personal_info": list[str], "technical_info": list[str], "usage_info": list[str], }, # simply use a dictionary with type hints (including generic aliases and union types) add_references=True, reference_depth="sentences", ) ], ), Aspect( name="Data Retention", description="Information about how long different types of data are retained", concepts=[ JsonObjectConcept( name="Retention Periods", description="The durations for which different types of data are retained", structure={ "personal_info": str | None, "technical_info": str | None, "usage_info": str | None, }, # use `str | None` type hints to allow for None values if not specified add_references=True, reference_depth="sentences", singular_occurrence=True, # explicitly enforce singular extracted item (optional) ) ], ), Aspect( name="Data Subject Rights", description="Information about the rights users have regarding their data", concepts=[ StringConcept( name="Data Subject Rights", description="Rights available to users regarding their personal data", add_references=True, reference_depth="sentences", ) ], ), ] # Add aspects and concepts to the document doc.add_aspects(aspects) doc.add_concepts(document_concepts) # Create an LLM for extraction llm = DocumentLLM( model="openai/gpt-4o", # or another LLM from e.g. Anthropic, Ollama, etc. api_key=os.environ.get( "CONTEXTGEM_OPENAI_API_KEY" ), # your API key for the applicable LLM provider ) llm.async_limiter = AsyncLimiter( 3, 3 ) # customize async limiter for concurrency (optional) # Extract all information from the document, using concurrency doc = llm.extract_all(doc, use_concurrency=True) # Access / print extracted information on the document object print("Document Concepts:") for concept in doc.concepts: print(f"{concept.name}:") for item in concept.extracted_items: print(f"• {item.value}") print() print("Aspects and Concepts:") for aspect in doc.aspects: print(f"[{aspect.name}]") for item in aspect.extracted_items: print(f"• {item.value}") print() for concept in aspect.concepts: print(f"{concept.name}:") for item in concept.extracted_items: print(f"• {item.value}") print() 🔄 Using a Multi-LLM Pipeline to Extract Data from Several Documents ==================================================================== Tip: A pipeline is a reusable configuration of extraction steps. You can use the same pipeline to extract data from multiple documents.For example, if your app extracts data from invoices, you can configure a pipeline once, and then use it for each incoming invoice. # Advanced Usage Example - analyzing multiple documents with a single pipeline, # with different LLMs, concurrency and cost tracking import os from contextgem import ( Aspect, DateConcept, Document, DocumentLLM, DocumentLLMGroup, ExtractionPipeline, JsonObjectConcept, JsonObjectExample, LLMPricing, NumericalConcept, RatingConcept, StringConcept, StringExample, ) # Construct documents # Document 1 - Consultancy Agreement (shortened for brevity) doc1 = Document( raw_text=( "Consultancy Agreement\n" "This agreement between Company A (Supplier) and Company B (Customer)...\n" "The term of the agreement is 1 year from the Effective Date...\n" "The Supplier shall provide consultancy services as described in Annex 2...\n" "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n" "All intellectual property created during the provision of services shall belong to the Customer...\n" "This agreement is governed by the laws of Norway...\n" "Annex 1: Data processing agreement...\n" "Annex 2: Statement of Work...\n" "Annex 3: Service Level Agreement...\n" ), ) # Document 2 - Service Level Agreement (shortened for brevity) doc2 = Document( raw_text=( "Service Level Agreement\n" "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n" "The agreement shall commence on January 1, 2023 and continue for 2 years...\n" "The Provider shall deliver IT support services as outlined in Schedule A...\n" "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n" "The Provider guarantees [99.9%] uptime for all critical systems...\n" "Either party may terminate with 60 days written notice...\n" "This agreement is governed by the laws of California...\n" "Schedule A: Service Descriptions...\n" "Schedule B: Response Time Requirements...\n" ), ) # Create a reusable extraction pipeline contract_pipeline = ExtractionPipeline() # Define aspects and aspect-level concepts in the pipeline # Concepts in the aspects will be extracted from the extracted aspect context contract_pipeline.aspects = [ # or use .add_aspects([...]) Aspect( name="Contract Parties", description="Clauses defining the parties to the agreement", concepts=[ # define aspect-level concepts, if any StringConcept( name="Party names and roles", description="Names of all parties entering into the agreement and their roles", examples=[ # optional StringExample( content="X (Client)", # guidance regarding the expected output format ) ], ) ], ), Aspect( name="Term", description="Clauses defining the term of the agreement", concepts=[ NumericalConcept( name="Contract term", description="The term of the agreement in years", numeric_type="int", # or "float", or "any" for auto-detection add_references=True, # extract references to the source text reference_depth="paragraphs", ) ], ), ] # Define document-level concepts # Concepts in the document will be extracted from the whole document content contract_pipeline.concepts = [ # or use .add_concepts() DateConcept( name="Effective date", description="The effective date of the agreement", ), StringConcept( name="Contract type", description="The type of agreement", llm_role="reasoner_text", # for this concept, we use a more advanced LLM for reasoning ), StringConcept( name="Governing law", description="The law that governs the agreement", ), JsonObjectConcept( name="Attachments", description="The titles and concise descriptions of the attachments to the agreement", structure={"title": str, "description": str | None}, examples=[ # optional JsonObjectExample( # guidance regarding the expected output format content={ "title": "Appendix A", "description": "Code of conduct", } ), ], ), RatingConcept( name="Duration adequacy", description="Contract duration adequacy considering the subject matter and best practices.", llm_role="reasoner_text", # for this concept, we use a more advanced LLM for reasoning rating_scale=(1, 10), add_justifications=True, # add justifications for the rating justification_depth="balanced", # provide a balanced justification justification_max_sents=3, ), ] # Assign pipeline to the documents # You can re-use the same pipeline for multiple documents doc1.assign_pipeline( contract_pipeline ) # assigns pipeline aspects and concepts to the document doc2.assign_pipeline( contract_pipeline ) # assigns pipeline aspects and concepts to the document # Create an LLM group for data extraction and reasoning llm_extractor = DocumentLLM( model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"], # your API key role="extractor_text", # signifies the LLM is used for data extraction tasks pricing_details=LLMPricing( # optional, for costs calculation input_per_1m_tokens=0.150, output_per_1m_tokens=0.600, ), # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider ) llm_reasoner = DocumentLLM( model="openai/o3-mini", # or any other LLM from e.g. Anthropic, etc. api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"], # your API key role="reasoner_text", # signifies the LLM is used for reasoning tasks pricing_details=LLMPricing( # optional, for costs calculation input_per_1m_tokens=1.10, output_per_1m_tokens=4.40, ), # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider ) # The LLM group is used for all extraction tasks within the pipeline llm_group = DocumentLLMGroup(llms=[llm_extractor, llm_reasoner]) # Extract all information from the documents at once doc1 = llm_group.extract_all( doc1, use_concurrency=True ) # use concurrency to speed up extraction doc2 = llm_group.extract_all( doc2, use_concurrency=True ) # use concurrency to speed up extraction # Or use async variants .extract_all_async(...) # Get the extracted data print("Some extracted data from doc 1:") print("Contract Parties > Party names and roles:") print( doc1.get_aspect_by_name("Contract Parties") .get_concept_by_name("Party names and roles") .extracted_items ) print("Attachments:") print(doc1.get_concept_by_name("Attachments").extracted_items) # ... print("\nSome extracted data from doc 2:") print("Term > Contract term:") print( doc2.get_aspect_by_name("Term") .get_concept_by_name("Contract term") .extracted_items[0] .value ) print("Duration adequacy:") print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].value) print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].justification) # ... # Output processing costs (requires setting the pricing details for each LLM) print("\nProcessing costs:") print(llm_group.get_cost()) # ==== logging_config ==== Logging Configuration ********************* ContextGem provides comprehensive logging to help you monitor and debug the extraction process. You can control logging behavior using environment variables. ContextGem uses a **namespaced logger** under the name "contextgem". ⚙️ Environment Variables ======================== ContextGem uses a single environment variable for logging configuration: **CONTEXTGEM_LOGGER_LEVEL** Sets the logging level. Valid values are: * "TRACE" - Most verbose, shows all log messages * "DEBUG" - Shows debug information and above * "INFO" - Shows informational messages and above (default) * "SUCCESS" - Shows success messages and above * "WARNING" - Shows warnings and errors only * "ERROR" - Shows errors and critical messages only * "CRITICAL" - Shows only critical messages * "OFF" - Completely disables logging **Default:** "INFO" Warning: **Not recommended:** Setting the level to "OFF" or above "INFO" (such as "WARNING" or "ERROR") may cause you to miss helpful messages, guidance, recommendations, and important information about the extraction process. The default "INFO" level provides a good balance of useful information without being too verbose. 🔧 Setting Environment Variables ================================ **Before importing ContextGem:** # Set logging level to WARNING export CONTEXTGEM_LOGGER_LEVEL=WARNING # Disable logging completely export CONTEXTGEM_LOGGER_LEVEL=OFF **In Python before import:** import os # Set logging level to DEBUG os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "DEBUG" # Import ContextGem after setting environment variables import contextgem 🔄 Changing Settings at Runtime =============================== If you need to change logging settings after importing ContextGem, use the "reload_logger_settings()" function: Changing logger settings at runtime import os from contextgem import reload_logger_settings # Initial logger settings are loaded from environment variables at import time # Change logger level to WARNING os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "WARNING" print("Setting logger level to WARNING") reload_logger_settings() # Now the logger will only show WARNING level and above messages # Disable the logger completely os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "OFF" print("Disabling the logger") reload_logger_settings() # Now the logger is disabled and won't show any messages # You can re-enable the logger by setting it back to a valid level # os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "INFO" # reload_logger_settings() 📋 Log Format ============= ContextGem logs use the following format: [contextgem] 2025-01-11 15:30:45.123 | INFO | Your log message here Each log entry includes: * Timestamp (with milliseconds) * Log level * Log message # ==== optimizations/optimization_choosing_llm ==== Choosing the Right LLM(s) ************************* 🧭 General Guidance =================== Your choice of LLM directly affects the accuracy, speed, and cost of your extraction pipeline. ContextGem integrates with various LLM providers (via LiteLLM), enabling you to select models that best fit your needs. Since ContextGem specializes in deep single-document analysis, models with large context windows are recommended. While each use case has unique requirements, our experience suggests the following practical guidelines. However, please note that for sensitive applications (e.g., contract review) where accuracy is paramount and speed/cost are secondary concerns, using the most capable model available for all tasks is often the safest approach. Choosing LLMs - General Guidance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +----------------------------------------------------+----------------------------------------------------+ | Aspect Extraction | Concept Extraction | |====================================================|====================================================| | A **smaller/distilled non-reasoning model** | For *basic concepts* (e.g., titles, payment | | capable of identifying relevant document sections | amounts, dates), the same **smaller/distilled non- | | (e.g., "gpt-4o-mini"). This extraction resembles | reasoning model** is often sufficient (e.g., "gpt- | | multi-label classification. Complex aspects may | 4o-mini"). For *complex concepts* requiring | | occasionally require larger or reasoning models. | nuanced understanding within specific aspects or | | | the entire document, consider a **larger non- | | | reasoning model** (e.g., "gpt-4o"). For concepts | | | requiring advanced understanding or complex | | | reasoning (e.g., logical deductions, evaluation), | | | a **reasoning model** like "o3-mini" may be | | | appropriate. | +----------------------------------------------------+----------------------------------------------------+ See also: **Small Model Issues?** If you're experiencing issues with smaller models (e.g. 8B parameter models), such as JSON validation errors or inconsistent results, see our troubleshooting guide for specific solutions and workarounds. 🏷️ LLM Roles ============ The "role" of an LLM is an abstraction used to assign various LLMs tasks of different complexity. For example, if an aspect/concept is assigned "llm_role="extractor_text"", this aspect/concept is extracted from the document using the LLM with "role="extractor_text"". This helps to channel different tasks to different LLMs, ensuring that the task is handled by the most appropriate model. Usually, domain expertise is required to determine the most appropriate role for a specific aspect/concept. In LLM groups, unique role assignments are especially important: each model in the group must have a distinct role so routing can unambiguously send each aspect/concept to the intended model. For simple use cases, when working with text-only documents and a single LLM, you can skip the role assignments completely, in which case the roles will default to ""extractor_text"". Available LLM roles ^^^^^^^^^^^^^^^^^^^ +----------------------+----------------------+----------------------+----------------------+ | Role | Extraction Context | Extracted Item Types | Required LLM | | | | | Capabilities | |======================|======================|======================|======================| | ""extractor_text"" | Text | Aspects and concepts | No reasoning | | | | (aspect- and | required | | | | document-level) | | +----------------------+----------------------+----------------------+----------------------+ | ""reasoner_text"" | Text | Aspects and concepts | Reasoning-capable | | | | (aspect- and | model | | | | document-level) | | +----------------------+----------------------+----------------------+----------------------+ | ""extractor_vision"" | Images | Document-level | Vision-capable model | | | | concepts | | +----------------------+----------------------+----------------------+----------------------+ | ""reasoner_vision"" | Images | Document-level | Vision-capable and | | | | concepts | reasoning-capable | | | | | model | +----------------------+----------------------+----------------------+----------------------+ | ""extractor_multimo | Text and/or images | Document-level | Multimodal model | | dal"" | | concepts | supporting text and | | | | | image inputs | +----------------------+----------------------+----------------------+----------------------+ | ""reasoner_multimod | Text and/or images | Document-level | Reasoning-capable | | al"" | | concepts | multimodal model | | | | | supporting text and | | | | | image inputs | +----------------------+----------------------+----------------------+----------------------+ Note: 🧠 Only LLMs that support reasoning (chain of thought) should be assigned reasoning roles (""reasoner_text"", ""reasoner_vision""). For such models, internal prompts include reasoning-specific instructions intended for these models to produce higher-quality responses. Note: 👁️ Only LLMs that support vision can be assigned vision roles (""extractor_vision"", ""reasoner_vision""). Note: 🔀 Multimodal roles (""extractor_multimodal"", ""reasoner_multimodal"") reuse the existing text and vision extraction paths. If text exists, the text path runs first; if images exist, the vision path runs next. References are only supported for multimodal concepts when text is used. Example of selecting different LLMs for different tasks # Example of selecting different LLMs for different tasks import os from contextgem import Aspect, Document, DocumentLLM, DocumentLLMGroup, StringConcept # Define LLMs base_llm = DocumentLLM( model="openai/gpt-4o-mini", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), role="extractor_text", # default ) # Optional - attach a fallback LLM base_llm_fallback = DocumentLLM( model="openai/gpt-3-5-turbo", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), role="extractor_text", # must have the same role as the parent LLM is_fallback=True, ) base_llm.fallback_llm = base_llm_fallback advanced_llm = DocumentLLM( model="openai/o3-mini", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), role="reasoner_text", ) # You can organize LLMs in a group to use them in a pipeline llm_group = DocumentLLMGroup( llms=[base_llm, advanced_llm], ) # Assign the existing LLMs to aspects/concepts document = Document( raw_text="document_text", aspects=[ Aspect( name="aspect_name", description="aspect_description", llm_role="extractor_text", concepts=[ StringConcept( name="concept_name", description="concept_description", llm_role="reasoner_text", ) ], ) ], ) # Then use the LLM group to extract all information from the document # This will use different LLMs for different aspects/concepts under the hood # document = llm_group.extract_all(document) # ==== optimizations/optimization_accuracy ==== Optimizing for Accuracy *********************** When accuracy is paramount, ContextGem offers several techniques to improve extraction quality, some of which are pretty obvious: * **🚀 Use a Capable LLM**: Choose a powerful LLM model for extraction. * **🪄 Use Larger Segmentation Models**: Select a larger SaT model for intelligent segmentation of paragraphs or sentences, to ensure the highest segmentation accuracy in complex documents (e.g. contracts). * **💡 Provide Examples**: For most complex concepts, add examples to guide the LLM's extraction format and style. * **🧠 Request Justifications**: For most complex aspects/concepts, enable justifications to understand the LLM's reasoning and instruct the LLM to "think" when giving an answer. * **📏 Limit Paragraphs Per Call**: This will reduce each prompt's length and ensure a more focused analysis. * **🔢 Limit Aspects/Concepts Per Call**: Process a smaller number of aspects or concepts in each LLM call, preventing prompt overloading. * **🔄 Use a Fallback LLM**: Configure a fallback LLM to retry failed extractions with a different model. Example of optimizing extraction for accuracy # Example of optimizing extraction for accuracy import os from contextgem import Document, DocumentLLM, StringConcept, StringExample # Define document doc = Document( raw_text="Non-Disclosure Agreement...", sat_model_id="sat-6l-sm", # default is "sat-3l-sm" paragraph_segmentation_mode="sat", # default is "newlines" # sentence segmentation mode is always "sat", as other approaches proved to be less accurate ) # Define document concepts doc.concepts = [ StringConcept( name="Title", # A very simple concept, just an example for testing purposes description="Title of the document", add_justifications=True, # enable justifications justification_depth="brief", # default examples=[ StringExample( content="Supplier Agreement", ) ], ), # ... add other concepts ... ] # ... attach other aspects/concepts to the document ... # Define and configure LLM llm = DocumentLLM( model="openai/gpt-4o", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), fallback_llm=DocumentLLM( model="openai/gpt-4-turbo", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), is_fallback=True, ), # configure a fallback LLM ) # Extract data from document with specific configuration options doc = llm.extract_all( doc, max_paragraphs_to_analyze_per_call=30, # limit the number of paragraphs to analyze in an individual LLM call max_items_per_call=1, # limit the number of aspects/concepts to analyze in an individual LLM call use_concurrency=True, # optional: enable concurrent extractions ) # ... use the extracted data ... # ==== optimizations/optimization_speed ==== Optimizing for Speed ******************** For large-scale processing or time-sensitive applications, optimize your pipeline for speed: * **🚀 Enable and Configure Concurrency**: Process multiple extractions concurrently. Adjust the async limiter to adapt to your LLM API setup. * **📦 Use Smaller Models**: Select smaller/distilled LLMs that perform faster. (See Choosing the Right LLM(s) for guidance on choosing the right model.) * **🔄 Use a Fallback LLM**: Configure a fallback LLM to retry extractions that failed due to rate limits. * **⚙️ Use Default Parameters**: All the extractions will be processed in as few LLM calls as possible. * **📉 Enable Justifications Only When Necessary**: Do not use justifications for simple aspects or concepts. This will reduce the number of tokens generated. * **⚠️ Use Sentence-Level Reference Depth Sparingly**: Only use sentence-level reference depth for aspects or concepts when absolutely necessary, as it requires loading a SaT model and running sentence segmentation on text, which can be slow for long documents. Example of optimizing extraction for speed # Example of optimizing extraction for speed import os from aiolimiter import AsyncLimiter from contextgem import Document, DocumentLLM # Define document document = Document( raw_text="document_text", # aspects=[Aspect(...), ...], # concepts=[Concept(...), ...], ) # Define LLM with a fallback model llm = DocumentLLM( model="openai/gpt-4o-mini", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), fallback_llm=DocumentLLM( model="openai/gpt-3.5-turbo", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), is_fallback=True, ), ) llm.async_limiter = AsyncLimiter( 10, 5 ) # e.g. 10 acquisitions per 5-second period; adjust to your LLM API setup llm.fallback_llm.async_limiter = AsyncLimiter( # type: ignore 20, 5 ) # e.g. 20 acquisitions per 5-second period; adjust to your LLM API setup # Use the LLM for extraction with concurrency enabled llm.extract_all(document, use_concurrency=True) # ... use the extracted data ... # ==== optimizations/optimization_cost ==== Optimizing for Cost ******************* ContextGem offers several strategies to optimize for cost efficiency while maintaining extraction quality: * **💸 Select Cost-Efficient Models**: Use smaller/distilled non- reasoning LLMs for extracting aspects and basic concepts (e.g. titles, payment amounts, dates). * **⚙️ Use Default Parameters**: All the extractions will be processed in as few LLM calls as possible. * **📉 Enable Justifications Only When Necessary**: Do not use justifications for simple aspects or concepts. This will reduce the number of tokens generated. * **📊 Monitor Usage and Cost**: Track LLM calls, token consumption, and cost to identify optimization opportunities. Example of optimizing extraction for cost # Example of optimizing extraction for cost import os from contextgem import DocumentLLM, LLMPricing llm = DocumentLLM( model="openai/gpt-4o-mini", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), pricing_details=LLMPricing( input_per_1m_tokens=0.150, output_per_1m_tokens=0.600, ), # add pricing details to track costs # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider ) # ... use the LLM for extraction ... # ... monitor usage and cost ... usage = llm.get_usage() # get the usage details, including tokens and calls' details. cost = llm.get_cost() # get the cost details, including input, output, and total costs. print(usage) print(cost) # ==== optimizations/optimization_long_docs ==== Dealing with Long Documents *************************** ContextGem offers specialized configuration options for efficiently processing lengthy documents. ✂️ Segmentation Approach ======================== Unlike many systems that rely on chunking (e.g. RAG), ContextGem intelligently segments documents into natural semantic units like paragraphs and sentences. This preserves the contextual integrity of the content while allowing you to configure: * Maximum number of paragraphs per LLM call * Maximum number of aspects/concepts to analyze per LLM call * Maximum number of images per LLM call (if the document contains images) ⚙️ Effective Optimization Strategies ==================================== * **🔄 Use Long-Context Models**: Select models with large context windows. (See Choosing the Right LLM(s) for guidance on choosing the right model.) * **📏 Limit Paragraphs Per Call**: This will reduce each prompt's length and ensure a more focused analysis. * **🔢 Limit Aspects/Concepts Per Call**: Process a smaller number of aspects or concepts in each LLM call, preventing prompt overloading. * **⚠️ Use Sentence-Level Reference Depth Sparingly**: Only use sentence-level reference depth for aspects or concepts when absolutely necessary, as it requires loading a SaT model and running sentence segmentation on text, which can be slow for long documents. * **⚡ Optional: Enable Concurrency**: Enable running extractions concurrently if your API setup permits. This will reduce the overall processing time. (See Optimizing for Speed for guidance on configuring concurrency.) Since each use case has unique requirements, experiment with different configurations to find your optimal setup. Example of configuring LLM extraction for long documents # Example of configuring LLM extraction to process long documents import os from contextgem import Document, DocumentLLM # Define document long_doc = Document( raw_text="long_document_text", ) # ... attach aspects/concepts to the document ... # Define and configure LLM llm = DocumentLLM( model="openai/gpt-4o-mini", api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"), ) # Extract data from document with specific configuration options long_doc = llm.extract_all( long_doc, max_paragraphs_to_analyze_per_call=50, # limit the number of paragraphs to analyze in an individual LLM call max_items_per_call=2, # limit the number of aspects/concepts to analyze in an individual LLM call use_concurrency=True, # optional: enable concurrent extractions ) # ... use the extracted data ... # ==== optimizations/optimization_small_llm_troubleshooting ==== Troubleshooting Issues with Small Models **************************************** Small language models (e.g. 8B parameter models) often struggle with ContextGem's structured extraction tasks. This guide addresses common issues and provides practical solutions. See also: For general guidance on selecting appropriate models for your use case, see Choosing the Right LLM(s). ⚠️ Common Issues with Small Models ================================== **"LLM did not return valid JSON" Error** Small models frequently fail to follow the precise JSON schema required by ContextGem's internal prompts. This manifests as: * "Error when validating parsed JSON: parsed_json is None" * "LLM did not return valid JSON" **Inconsistent Results** Small models may produce: * Empty extraction results * Incomplete or partial extractions * Inconsistent formatting across multiple calls 🎯 Model Capability Requirements ================================ **Minimum Recommended Performance** All ContextGem tests use models with performance equivalent to or exceeding "openai/gpt-4o-mini". For reliable structured extraction, your model should: * Be able to follow detailed JSON schema instructions consistently * Have a sufficient context window to ingest the detailed prompt and the document content * Maintain attention across long prompts with complex instructions 🛠️ Mitigation Strategies for Small Models ========================================= Important: **The most effective solution is usually to upgrade to a larger, more capable model** (such as "gpt-4o-mini" or larger). The strategies below are workarounds for situations where upgrading isn't possible. If you must use a smaller model, try these approaches individually or in combination: **1. Reduce Task Complexity** # Extract one aspect/concept at a time instead of all at once results = llm.extract_all( document, max_items_per_call=1 # Analyze aspects/concepts individually ) **2. Limit Document Scope** # Process fewer document paragraphs per call results = llm.extract_all( document, max_paragraphs_to_analyze_per_call=50 # Default is 0 (all paragraphs) ) **3. Use More Specific Aspects/Concepts** Instead of generic aspects/concepts: # ❌ Too generic - may confuse small models Aspect( name="Contract Terms", description="Contractual/legal details" ) Use targeted concepts: # ✅ More specific - easier for small models Aspect( name="Termination Terms", description="Provisions on contract termination" ), Aspect( name="Payment Terms", description="Provisions on payment schedules and amounts" ) **4. Choose the Right API** For extracting document sections by topic, use **Aspects API** instead of **Concepts API**: # ✅ Aspects API is designed specifically for extracting document sections by topic, # while Concepts API is designed for extracting/inferring specific values or entities # from a document or a specific section. from contextgem import Aspect project_scope = Aspect( name="Project Scope", description="Details about the scope of work" ) # Paragraph references are automatically added to the extracted aspects results = llm.extract_aspects_from_document(document) Instead of: # ❌ Concepts API's core purpose is to extract/infer specific values or entities # from a document or a specific section, rather than extracting document sections # by topic. from contextgem import StringConcept project_scope = StringConcept( name="Project Scope", description="Details about the scope of work", add_references=True ) 🔍 Debugging LLM Responses ========================== To see what your LLM is supposed to return, you can inspect the prompt and the model's response: # Make an extraction call results = llm.extract_aspects_from_document(document) # Inspect the actual prompt sent to the LLM prompt = llm.get_usage()[-1].usage.calls[-1].prompt print("Prompt sent to LLM:") print(prompt) # Check the raw response (if available) response = llm.get_usage()[-1].usage.calls[-1].response print("LLM response:") print(response) 📊 Testing Local Models ======================= Before committing to a local model for production, test it on extraction tasks in the documentation, such as: * Aspect Extraction from Document * Extracting Aspect with Sub-Aspects * Concept Extraction from Aspect * Concept Extraction from Document Important: **Production Applications**: For production applications, especially those requiring high accuracy (like legal document analysis), using appropriately capable models is crucial. The cost of model inference is typically far outweighed by the cost of incorrect extractions or failed processing. # ==== serialization ==== Serializing objects and results ******************************* ContextGem provides multiple serialization methods to preserve your document processing pipeline components and results. These methods enable you to save your work, transfer data between systems, or integrate with other applications. When using serialization, all extracted data is preserved in the serialized objects. 💾 Serialization Methods ======================== The following ContextGem objects support serialization: * "Document" - Contains document content and extracted information * "ExtractionPipeline" - Defines extraction structure and logic * "DocumentLLM" - Stores LLM configuration for document processing Each object supports three serialization methods: * "to_json()" - Converts the object to a JSON string for cross- platform compatibility * "to_dict()" - Converts the object to a Python dictionary for in- memory operations * "to_disk(file_path)" - Saves the object directly to disk at the specified path 🔄 Deserialization Methods ========================== To reconstruct objects from their serialized forms, use the corresponding class methods: * "from_json(json_string)" - Creates an object from a JSON string * "from_dict(dict_object)" - Creates an object from a Python dictionary * "from_disk(file_path)" - Loads an object from a file on disk 📝 Example Usage ================ # Example of serializing and deserializing ContextGem document, # extraction pipeline, and LLM config. import os from pathlib import Path from contextgem import ( Aspect, BooleanConcept, Document, DocumentLLM, DocxConverter, ExtractionPipeline, StringConcept, ) # Create a document object converter = DocxConverter() docx_path = str( Path(__file__).resolve().parents[4] / "tests" / "docx_files" / "en_nda_with_anomalies.docx" ) # your file path here (Path adapted for testing) doc = converter.convert(docx_path, strict_mode=True) # Create an extraction pipeline extraction_pipeline = ExtractionPipeline( aspects=[ Aspect( name="Categories of confidential information", description="Clauses describing confidential information covered by the NDA", concepts=[ StringConcept( name="Types of disclosure", description="Types of disclosure of confidential information", ), # ... ], ), # ... ], concepts=[ BooleanConcept( name="Is mutual", description="Whether the NDA is mutual (both parties act as discloser/recipient)", add_justifications=True, ), # ... ], ) # Attach the pipeline to the document doc.assign_pipeline(extraction_pipeline) # Configure a document LLM with your API parameters llm = DocumentLLM( model="azure/gpt-4.1-mini", api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"), api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"), api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"), ) # Extract data from the document doc = llm.extract_all(doc) # Serialize the LLM config, pipeline and document llm_config_json = llm.to_json() # or to_dict() / to_disk() extraction_pipeline_json = extraction_pipeline.to_json() # or to_dict() / to_disk() processed_doc_json = doc.to_json() # or to_dict() / to_disk() # Deserialize the LLM config, pipeline and document llm_deserialized = DocumentLLM.from_json( llm_config_json ) # or from_dict() / from_disk() extraction_pipeline_deserialized = ExtractionPipeline.from_json( extraction_pipeline_json ) # or from_dict() / from_disk() processed_doc_deserialized = Document.from_json( processed_doc_json ) # or from_dict() / from_disk() # All extracted data is preserved! assert processed_doc_deserialized.aspects[0].concepts[0].extracted_items 🚀 Use Cases ============ * **Caching Results**: Save processed documents to avoid repeating expensive LLM calls * **Transfer Between Systems**: Export results from one environment and import in another * **API Integration**: Convert objects to JSON for API responses * **Workflow Persistence**: Save pipeline configurations for later reuse # ==== api/documents ==== Documents ********* Module for handling documents. This module provides the Document class, which represents a structured or unstructured file containing written or visual content. Documents can be processed to extract information, analyze content, and organize data into paragraphs, sentences, aspects, and concepts. class contextgem.public.documents.Document(**data) Bases: "_Document" Represents a document containing textual and visual content for analysis. A document serves as the primary container for content analysis within the ContextGem framework, enabling complex document understanding and information extraction workflows. Variables: * **raw_text** (*str** | **None*) -- The main text of the document as a single string. Defaults to None. * **paragraphs** (*list**[**Paragraph**]*) -- List of Paragraph instances in consecutive order as they appear in the document. Defaults to an empty list. * **images** (*list**[**Image**]*) -- List of Image instances attached to or representing the document. Defaults to an empty list. * **aspects** (*list**[**Aspect**]*) -- List of aspects associated with the document for focused analysis. Validated to ensure unique names and descriptions. Defaults to an empty list. * **concepts** (*list**[**_Concept**]*) -- List of concepts associated with the document for information extraction. Validated to ensure unique names and descriptions. Defaults to an empty list. * **paragraph_segmentation_mode** (*Literal**[**"newlines"**, **"sat"**]*) -- Mode for paragraph segmentation. When set to "sat", uses a SaT (Segment Any Text https://arxiv.org/abs/2406.16678) model. Defaults to "newlines". * **sat_model_id** (*SaTModelId*) -- SaT model ID for paragraph/sentence segmentation or a local path to a SaT model. For model IDs, defaults to "sat-3l-sm". See https://github.com/segment-any-text/wtpsplit for the list of available models. For local paths, provide either a string path or a Path object pointing to the directory containing the SaT model. * **pre_segment_sentences** (*bool*) -- Whether to pre-segment sentences during Document initialization. When False (default), sentence segmentation is deferred until sentences are actually needed, improving initialization performance. When True, sentences are segmented immediately during Document creation using the SaT model. Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **raw_text** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**] **| **None*) * **paragraphs** (*list**[**_Paragraph**]*) * **images** (*list**[**_Image**]*) * **aspects** (*Annotated**[**Sequence**[**_Aspect**]**, **Befo reValidator**(**func=~contextgem.internal.typings.validators. _validate_sequence_is_list**, **json_schema_input_type=PydanticUndefined**)**]*) * **concepts** (*Annotated**[**Sequence**[**_Concept**]**, **Be foreValidator**(**func=~contextgem.internal.typings.validator s._validate_sequence_is_list**, **json_schema_input_type=PydanticUndefined**)**]*) * **paragraph_segmentation_mode** (*Literal**[**'newlines'**, **'sat'**]*) * **sat_model_id** (*Literal**[**'sat-1l'**, **'sat-1l-sm'**, **'sat-3l'**, **'sat-3l-sm'**, **'sat-6l'**, **'sat-6l-sm'**, **'sat-9l'**, **'sat-12l'**, **'sat-12l-sm'**] **| **str** | **~pathlib._local.Path*) * **pre_segment_sentences** (*bool*) Note: Normally, you do not need to construct/populate paragraphs manually, as they are populated automatically from document's "raw_text" attribute. Only use this constructor for advanced use cases, such as when you have a custom paragraph segmentation tool. Example: Document definition from pathlib import Path from contextgem import Document, Paragraph, create_image # Create a document with raw text content contract_document = Document( raw_text=( "...This agreement is effective as of January 1, 2025.\n\n" "All parties must comply with the terms outlined herein. The terms include " "monthly reporting requirements and quarterly performance reviews.\n\n" "Failure to adhere to these terms may result in termination of the agreement. " "Additionally, any breach of confidentiality will be subject to penalties as " "described in this agreement.\n\n" "This agreement shall remain in force for a period of three (3) years unless " "otherwise terminated according to the provisions stated above..." ), paragraph_segmentation_mode="newlines", # Default mode, splits on newlines ) # Create a document with more advanced paragraph segmentation using a SaT model report_document = Document( raw_text=( "Executive Summary " "This report outlines our quarterly performance. " "Revenue increased by [15%] compared to the previous quarter.\n\n" "Customer satisfaction metrics show positive trends across all regions..." ), paragraph_segmentation_mode="sat", # Use SaT model for intelligent paragraph segmentation sat_model_id="sat-3l-sm", # Specify which SaT model to use ) # Create a document with predefined paragraphs, e.g. when you use a custom # paragraph segmentation tool document_from_paragraphs = Document( paragraphs=[ Paragraph(raw_text="This is the first paragraph."), Paragraph(raw_text="This is the second paragraph with more content."), Paragraph(raw_text="Final paragraph concluding the document."), # ... ] ) # Create document with images # Path is adapted for doc tests current_file = Path(__file__).resolve() root_path = current_file.parents[4] image_path = root_path / "tests" / "images" / "invoices" / "invoice.png" # Create a document with only images (no text) image_document = Document( images=[ create_image(image_path), # contextgem.Image instance # ... ] ) # Create a document with both text and images mixed_document = Document( raw_text="This document contains both text and visual elements.", images=[ create_image(image_path), # contextgem.Image instance # ... ], ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. add_aspects(aspects) Adds aspects to the existing aspects list of an instance and returns the updated instance. This method ensures that the provided aspects are deeply copied to avoid any unintended state modification of the original reusable aspects. Parameters: **aspects** (*list**[**_Aspect**]*) -- A list of aspects to be added. Each aspect is deeply copied to ensure the original list remains unaltered. Returns: Updated instance containing the newly added aspects. Return type: Self add_concepts(concepts) Adds a list of new concepts to the existing *concepts* attribute of the instance. This method ensures that the provided list of concepts is deep-copied to prevent unintended side effects from modifying the input list outside of this method. Parameters: **concepts** (*list**[**_Concept**]*) -- A list of concepts to be added. It will be deep-copied before being added to the instance's *concepts* attribute. Returns: Returns the instance itself after the modification. Return type: Self assign_pipeline(pipeline, overwrite_existing=False) Assigns a given pipeline to the document. The method deep-copies the input pipeline to prevent any modifications to the state of aspects or concepts in the original pipeline. If the aspects or concepts are already associated with the document, an error is raised unless the *overwrite_existing* parameter is explicitly set to *True*. Parameters: * **pipeline** (*_ExtractionPipeline** | **_DocumentPipeline*) -- The ExtractionPipeline (or deprecated DocumentPipeline) object to attach to the document. * **overwrite_existing** (*bool*) -- A boolean flag. If set to True, any existing aspects and concepts assigned to the document will be overwritten by the new pipeline. Defaults to False. Return type: "typing.Self" Returns: Returns the current instance of the document after assigning the pipeline. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. get_aspect_by_name(name) Finds and returns an aspect with the specified name from the list of available aspects, if the instance has *aspects* attribute. Parameters: **name** (*str*) -- The name of the aspect to find. Returns: The aspect with the specified name. Return type: _Aspect Raises: **ValueError** -- If no aspect with the specified name is found. get_aspects_by_names(names) Retrieve a list of _Aspect objects corresponding to the provided list of names. Parameters: **names** ("list"["str"]) -- List of aspect names to retrieve. The names must be provided as a list of strings. Returns: A list of _Aspect objects corresponding to provided names. Return type: list[_Aspect] get_concept_by_name(name) Retrieves a concept from the list of concepts based on the provided name, if the instance has *concepts* attribute. Parameters: **name** (*str*) -- The name of the concept to search for. Returns: The *_Concept* object with the specified name. Return type: _Concept Raises: **ValueError** -- If no concept with the specified name is found. get_concepts_by_names(names) Retrieve a list of _Concept objects corresponding to the provided list of names. Parameters: **names** ("list"["str"]) -- List of concept names to retrieve. The names must be provided as a list of strings. Returns: A list of _Concept objects corresponding to provided names. Return type: list[_Concept] property llm_roles: set[str] A set of LLM roles associated with the object's aspects and concepts. Returns: A set containing unique LLM roles gathered from aspects and concepts. Return type: set[str] remove_all_aspects() Removes all aspects from the instance and returns the updated instance. This method clears the *aspects* attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining. Return type: "typing.Self" Returns: The updated instance with all aspects removed remove_all_concepts() Removes all concepts from the instance and returns the updated instance. This method clears the *concepts* attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining. Return type: "typing.Self" Returns: The updated instance with all concepts removed remove_all_instances() Removes all assigned instances from the object and resets them as empty lists. Returns the modified instance. Returns: The modified object with all assigned instances removed. Return type: Self remove_aspect_by_name(name) Removes an aspect from the assigned aspects by its name. Parameters: **name** (*str*) -- The name of the aspect to be removed Returns: Updated instance with the aspect removed. Return type: Self remove_aspects_by_names(names) Removes multiple aspects from an object based on the provided list of names. Parameters: **names** (*list**[**str**]*) -- A list of names identifying the aspects to be removed. Returns: The updated object after the specified aspects have been removed. Return type: Self remove_concept_by_name(name) Removes a concept from the assigned concepts by its name. Parameters: **name** (*str*) -- The name of the concept to be removed Returns: Updated instance with the concept removed. Return type: Self remove_concepts_by_names(names) Removes concepts from the object by their names. Parameters: **names** (*list**[**str**]*) -- A list of concept names to be removed. Returns: Returns the updated instance after removing the specified concepts. Return type: Self property sentences: list[_Sentence] Provides access to all sentences within the paragraphs of the document by flattening and combining sentences from each paragraph into a single list. Returns: A list of _Sentence objects that are contained within all paragraphs. Return type: list[_Sentence] to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. raw_text: NonEmptyStr | None paragraphs: list[_Paragraph] images: list[_Image] aspects: Annotated[Sequence[_Aspect], BeforeValidator(_validate_sequence_is_list)] concepts: Annotated[Sequence[_Concept], BeforeValidator(_validate_sequence_is_list)] paragraph_segmentation_mode: Literal['newlines', 'sat'] sat_model_id: SaTModelId pre_segment_sentences: bool custom_data: JSONDictField # ==== api/converters ==== Converters ********** class contextgem.public.converters.DocxConverter Bases: "_DocxConverterBase" Converter for DOCX files into ContextGem documents. This class handles extraction of text, formatting, tables, images, footnotes, comments, and other elements from DOCX files by directly parsing Word XML. The converter is read-only and does not modify the source DOCX file in any way. It only extracts content for conversion to ContextGem document object or text formats. The resulting ContextGem document is populated with the following: * Raw text: The raw text of the DOCX file. * Paragraphs: Paragraph objects with the following metadata: * Raw text: The raw text of the paragraph. * Additional context: Metadata about the paragraph's style, list level, table cell position, being part of a footnote or comment, etc. This context provides additional information that is useful for LLM analysis and extraction. * Images: Image objects constructed from embedded images in the DOCX file. Example: DocxConverter usage example # Using ContextGem's DocxConverter from contextgem import DocxConverter converter = DocxConverter() # Convert a DOCX file to an LLM-ready ContextGem Document # from path document = converter.convert("path/to/document.docx") # or from file object with open("path/to/document.docx", "rb") as docx_file_object: document = converter.convert(docx_file_object) # Perform data extraction on the resulting Document object # document.add_aspects(...) # document.add_concepts(...) # llm.extract_all(document) # You can also use DocxConverter instance as a standalone text extractor docx_text = converter.convert_to_text_format( "path/to/document.docx", output_format="markdown", # or "raw" ) convert_to_text_format(docx_path_or_file, output_format='markdown', include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, include_links=True, include_inline_formatting=True, strict_mode=False) Converts a DOCX file directly to text without creating a ContextGem Document. Parameters: * **docx_path_or_file** ("str" | "pathlib._local.Path" | "typing.BinaryIO") -- Path to the DOCX file (as string or Path object) or a file-like object * **output_format** ("typing.Literal"["'raw'", "'markdown'"]) -- Output format ("markdown" or "raw") (default: "markdown") * **include_tables** ("bool") -- If True, include tables in the output (default: True) * **include_comments** ("bool") -- If True, include comments in the output (default: True) * **include_footnotes** ("bool") -- If True, include footnotes in the output (default: True) * **include_headers** ("bool") -- If True, include headers in the output (default: True) * **include_footers** ("bool") -- If True, include footers in the output (default: True) * **include_textboxes** ("bool") -- If True, include textbox content (default: True) * **include_links** ("bool") -- If True, process and format hyperlinks (default: True) * **include_inline_formatting** ("bool") -- If True, apply inline formatting (bold, italic, etc.) in markdown mode (default: True) * **strict_mode** ("bool") -- If True, raise exceptions for any processing error instead of skipping problematic elements (default: False) Return type: "str" Returns: Text in the specified format Note: When using markdown output format, the following conditions apply: * Document structure elements (headings, lists, tables) are preserved * Headings are converted to markdown heading syntax (# Heading 1, ## Heading 2, etc.) * Lists are converted to markdown list syntax, preserving numbering and hierarchy * Tables are formatted using markdown table syntax * Footnotes, comments, headers, and footers are included as specially marked sections convert(docx_path_or_file, apply_markdown=True, raw_text_to_md=None, include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, include_images=True, include_links=True, include_inline_formatting=True, strict_mode=False) Converts a DOCX file into a ContextGem Document object. Parameters: * **docx_path_or_file** ("str" | "pathlib._local.Path" | "typing.BinaryIO") -- Path to the DOCX file (as string or Path object) or a file-like object * **apply_markdown** ("bool") -- If True, applies markdown processing and formatting to the document content while preserving raw text separately (default: True) * **raw_text_to_md** ("bool" | "None") -- [DEPRECATED] Use apply_markdown instead. Will be removed in v1.0.0. Note: This parameter previously controlled whether raw_text would contain raw or markdown text. The new apply_markdown parameter instead controls whether to apply markdown processing while keeping raw text and processed text separate. * **include_tables** ("bool") -- If True, include tables in the output (default: True) * **include_comments** ("bool") -- If True, include comments in the output (default: True) * **include_footnotes** ("bool") -- If True, include footnotes in the output (default: True) * **include_headers** ("bool") -- If True, include headers in the output (default: True) * **include_footers** ("bool") -- If True, include footers in the output (default: True) * **include_textboxes** ("bool") -- If True, include textbox content (default: True) * **include_images** ("bool") -- If True, extract and include images (default: True) * **include_links** ("bool") -- If True, process and format hyperlinks (default: True) * **include_inline_formatting** ("bool") -- If True, apply inline formatting (bold, italic, etc.) in markdown mode (default: True) * **strict_mode** ("bool") -- If True, raise exceptions for any processing error instead of skipping problematic elements (default: False) Return type: "contextgem.public.documents.Document" Returns: A populated Document object # ==== api/aspects ==== Aspects ******* Module for handling document aspects. This module provides the Aspect class, which represents a defined area or topic within a document that requires focused attention. Aspects are used to identify and extract specific subjects or themes from documents according to predefined criteria. class contextgem.public.aspects.Aspect(**data) Bases: "_Aspect" Represents an aspect with associated metadata, sub-aspects, concepts, and logic for validation. An aspect is a defined area or topic within a document that requires focused attention. Each aspect corresponds to a specific subject or theme described in the task. Variables: * **name** (*str*) -- The name of the aspect. Required, non- empty string. * **description** (*str*) -- A detailed description of the aspect. Required, non-empty string. * **concepts** (*list**[**_Concept**]*) -- A list of concepts associated with the aspect. These concepts must be unique in both name and description and cannot include concepts with vision LLM roles. * **llm_role** (*LLMRoleAspect*) -- The role of the LLM responsible for aspect extraction. Default is "extractor_text". Valid roles are "extractor_text" and "reasoner_text". * **reference_depth** (*ReferenceDepth*) -- The structural depth of references (paragraphs or sentences). Defaults to "paragraphs". Affects the structure of "extracted_items". * **add_justifications** (*bool*) -- Whether the LLM will output justification for each extracted item. Inherited from base class. Defaults to False. * **justification_depth** (*JustificationDepth*) -- The level of detail for justifications. Inherited from base class. Defaults to "brief". * **justification_max_sents** (*int*) -- Maximum number of sentences in a justification. Inherited from base class. Defaults to 2. Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **add_justifications** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **justification_depth** (*Literal**[**'brief'**, **'balanced'**, **'comprehensive'**]*) * **justification_max_sents** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **description** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **aspects** (*Annotated**[**Sequence**[**_Aspect**]**, **Befo reValidator**(**func=~contextgem.internal.typings.validators. _validate_sequence_is_list**, **json_schema_input_type=PydanticUndefined**)**]*) * **concepts** (*Annotated**[**Sequence**[**_Concept**]**, **Be foreValidator**(**func=~contextgem.internal.typings.validator s._validate_sequence_is_list**, **json_schema_input_type=PydanticUndefined**)**]*) * **llm_role** (*Literal**[**'extractor_text'**, **'reasoner_text'**]*) * **reference_depth** (*Literal**[**'paragraphs'**, **'sentences'**]*) Example: Aspect definition from contextgem import Aspect # Define an aspect focused on termination clauses termination_aspect = Aspect( name="Termination provisions", description="Contract termination conditions, notice requirements, and severance terms.", reference_depth="sentences", add_justifications=True, justification_depth="comprehensive", ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. add_aspects(aspects) Adds aspects to the existing aspects list of an instance and returns the updated instance. This method ensures that the provided aspects are deeply copied to avoid any unintended state modification of the original reusable aspects. Parameters: **aspects** (*list**[**_Aspect**]*) -- A list of aspects to be added. Each aspect is deeply copied to ensure the original list remains unaltered. Returns: Updated instance containing the newly added aspects. Return type: Self add_concepts(concepts) Adds a list of new concepts to the existing *concepts* attribute of the instance. This method ensures that the provided list of concepts is deep-copied to prevent unintended side effects from modifying the input list outside of this method. Parameters: **concepts** (*list**[**_Concept**]*) -- A list of concepts to be added. It will be deep-copied before being added to the instance's *concepts* attribute. Returns: Returns the instance itself after the modification. Return type: Self clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. property extracted_items: list[_ExtractedItem] Provides access to extracted items. Returns: A list containing the extracted items as *_ExtractedItem* objects. Return type: list[_ExtractedItem] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. get_aspect_by_name(name) Finds and returns an aspect with the specified name from the list of available aspects, if the instance has *aspects* attribute. Parameters: **name** (*str*) -- The name of the aspect to find. Returns: The aspect with the specified name. Return type: _Aspect Raises: **ValueError** -- If no aspect with the specified name is found. get_aspects_by_names(names) Retrieve a list of _Aspect objects corresponding to the provided list of names. Parameters: **names** ("list"["str"]) -- List of aspect names to retrieve. The names must be provided as a list of strings. Returns: A list of _Aspect objects corresponding to provided names. Return type: list[_Aspect] get_concept_by_name(name) Retrieves a concept from the list of concepts based on the provided name, if the instance has *concepts* attribute. Parameters: **name** (*str*) -- The name of the concept to search for. Returns: The *_Concept* object with the specified name. Return type: _Concept Raises: **ValueError** -- If no concept with the specified name is found. get_concepts_by_names(names) Retrieve a list of _Concept objects corresponding to the provided list of names. Parameters: **names** ("list"["str"]) -- List of concept names to retrieve. The names must be provided as a list of strings. Returns: A list of _Concept objects corresponding to provided names. Return type: list[_Concept] property llm_roles: set[str] A set of LLM roles associated with the object's aspects and concepts. Returns: A set containing unique LLM roles gathered from aspects and concepts. Return type: set[str] property reference_paragraphs: list[_Paragraph] Provides access to the instance's reference paragraphs, assigned during extraction. Returns: A list containing the paragraphs as *_Paragraph* objects. Return type: list[_Paragraph] property reference_sentences: list[_Sentence] Provides access to the instance's reference sentences, assigned during extraction. Returns: A list containing the sentences as *_Sentence* objects. Return type: list[_Sentence] remove_all_aspects() Removes all aspects from the instance and returns the updated instance. This method clears the *aspects* attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining. Return type: "typing.Self" Returns: The updated instance with all aspects removed remove_all_concepts() Removes all concepts from the instance and returns the updated instance. This method clears the *concepts* attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining. Return type: "typing.Self" Returns: The updated instance with all concepts removed remove_all_instances() Removes all assigned instances from the object and resets them as empty lists. Returns the modified instance. Returns: The modified object with all assigned instances removed. Return type: Self remove_aspect_by_name(name) Removes an aspect from the assigned aspects by its name. Parameters: **name** (*str*) -- The name of the aspect to be removed Returns: Updated instance with the aspect removed. Return type: Self remove_aspects_by_names(names) Removes multiple aspects from an object based on the provided list of names. Parameters: **names** (*list**[**str**]*) -- A list of names identifying the aspects to be removed. Returns: The updated object after the specified aspects have been removed. Return type: Self remove_concept_by_name(name) Removes a concept from the assigned concepts by its name. Parameters: **name** (*str*) -- The name of the concept to be removed Returns: Updated instance with the concept removed. Return type: Self remove_concepts_by_names(names) Removes concepts from the object by their names. Parameters: **names** (*list**[**str**]*) -- A list of concept names to be removed. Returns: Returns the updated instance after removing the specified concepts. Return type: Self to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. name: NonEmptyStr description: NonEmptyStr aspects: Annotated[Sequence[_Aspect], BeforeValidator(_validate_sequence_is_list)] concepts: Annotated[Sequence[_Concept], BeforeValidator(_validate_sequence_is_list)] llm_role: LLMRoleAspect reference_depth: ReferenceDepth add_justifications: StrictBool justification_depth: JustificationDepth justification_max_sents: StrictInt custom_data: JSONDictField # ==== api/concepts ==== Concepts ******** Module for handling concepts at aspect and document levels. This module provides classes for defining different types of concepts that can be extracted from documents and aspects. Concepts represent specific pieces of information to be identified and extracted by LLMs, such as strings, numbers, boolean values, JSON objects, and ratings. Each concept type has specific properties and behaviors tailored to the kind of data it represents, including validation rules, extraction methods, and reference handling. Concepts can be attached to documents or aspects and can include examples, justifications, and references to the source text. class contextgem.public.concepts.StringConcept(**data) Bases: "_StringConcept" A concept model for string-based information extraction from documents and aspects. This class provides functionality for defining, extracting, and managing string data as conceptual entities within documents or aspects. Variables: * **name** (*str*) -- The name of the concept (non-empty string, stripped). * **description** (*str*) -- A brief description of the concept (non-empty string, stripped). * **examples** (*list**[**StringExample**]*) -- Example strings illustrating the concept usage. * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible for extracting the concept ("extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision", "extractor_multimodal", "reasoner_multimodal"). Defaults to "extractor_text". * **add_justifications** (*bool*) -- Whether to include justifications for extracted items. * **justification_depth** (*JustificationDepth*) -- Justification detail level. Defaults to "brief". * **justification_max_sents** (*int*) -- Maximum sentences in justification. Defaults to 2. * **add_references** (*bool*) -- Whether to include source references for extracted items. * **reference_depth** (*ReferenceDepth*) -- Source reference granularity ("paragraphs" or "sentences"). Defaults to "paragraphs". Only relevant when references are added to extracted items. Affects the structure of "extracted_items". * **singular_occurrence** (*StrictBool*) -- Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept's name, description, and type (e.g., "document title" vs "key findings"). Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **add_justifications** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **justification_depth** (*Literal**[**'brief'**, **'balanced'**, **'comprehensive'**]*) * **justification_max_sents** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **description** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **llm_role** (*Literal**[**'extractor_text'**, **'reasoner_text'**, **'extractor_vision'**, **'reasoner_vision'**, **'extractor_multimodal'**, **'reasoner_multimodal'**]*) * **add_references** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **reference_depth** (*Literal**[**'paragraphs'**, **'sentences'**]*) * **singular_occurrence** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **examples** (*list**[**_StringExample**]*) Example: String concept definition from contextgem import StringConcept, StringExample # Define a string concept for identifying contract party names # and their roles in the contract party_names_and_roles_concept = StringConcept( name="Party names and roles", description=( "Names of all parties entering into the agreement and their contractual roles" ), examples=[ StringExample( content="X (Client)", # guidance regarding format ) ], ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. property extracted_items: list[_ExtractedItem] Provides access to extracted items. Returns: A list containing the extracted items as *_ExtractedItem* objects. Return type: list[_ExtractedItem] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. examples: list[_StringExample] name: NonEmptyStr description: NonEmptyStr llm_role: LLMRoleAny add_references: StrictBool reference_depth: ReferenceDepth singular_occurrence: StrictBool add_justifications: StrictBool justification_depth: JustificationDepth justification_max_sents: StrictInt custom_data: JSONDictField class contextgem.public.concepts.BooleanConcept(**data) Bases: "_BooleanConcept" A concept model for boolean (True/False) information extraction from documents and aspects. This class handles identification and extraction of boolean values that represent conceptual properties or attributes within content. Variables: * **name** (*str*) -- The name of the concept (non-empty string, stripped). * **description** (*str*) -- A brief description of the concept (non-empty string, stripped). * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible for extracting the concept ("extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision", "extractor_multimodal", "reasoner_multimodal"). Defaults to "extractor_text". * **add_justifications** (*bool*) -- Whether to include justifications for extracted items. * **justification_depth** (*JustificationDepth*) -- Justification detail level. Defaults to "brief". * **justification_max_sents** (*int*) -- Maximum sentences in justification. Defaults to 2. * **add_references** (*bool*) -- Whether to include source references for extracted items. * **reference_depth** (*ReferenceDepth*) -- Source reference granularity ("paragraphs" or "sentences"). Defaults to "paragraphs". Only relevant when references are added to extracted items. Affects the structure of "extracted_items". * **singular_occurrence** (*StrictBool*) -- Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept's name, description, and type (e.g., "contains confidential information" vs "compliance violations"). Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **add_justifications** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **justification_depth** (*Literal**[**'brief'**, **'balanced'**, **'comprehensive'**]*) * **justification_max_sents** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **description** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **llm_role** (*Literal**[**'extractor_text'**, **'reasoner_text'**, **'extractor_vision'**, **'reasoner_vision'**, **'extractor_multimodal'**, **'reasoner_multimodal'**]*) * **add_references** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **reference_depth** (*Literal**[**'paragraphs'**, **'sentences'**]*) * **singular_occurrence** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) Example: Boolean concept definition from contextgem import BooleanConcept # Create the concept with specific configuration has_confidentiality = BooleanConcept( name="Contains confidentiality clause", description="Determines whether the contract includes provisions requiring parties to maintain confidentiality", llm_role="reasoner_text", singular_occurrence=True, add_justifications=True, justification_depth="brief", ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. property extracted_items: list[_ExtractedItem] Provides access to extracted items. Returns: A list containing the extracted items as *_ExtractedItem* objects. Return type: list[_ExtractedItem] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. name: NonEmptyStr description: NonEmptyStr llm_role: LLMRoleAny add_references: StrictBool reference_depth: ReferenceDepth singular_occurrence: StrictBool add_justifications: StrictBool justification_depth: JustificationDepth justification_max_sents: StrictInt custom_data: JSONDictField class contextgem.public.concepts.NumericalConcept(**data) Bases: "_NumericalConcept" A concept model for numerical information extraction from documents and aspects. This class handles identification and extraction of numeric values (integers, floats, or both) that represent conceptual measurements or quantities within content. Variables: * **name** (*str*) -- The name of the concept (non-empty string, stripped). * **description** (*str*) -- A brief description of the concept (non-empty string, stripped). * **numeric_type** (*Literal**[**"int"**, **"float"**, **"any"**]*) -- Type constraint for extracted numbers ("int", "float", or "any"). Defaults to "any" for auto-detection. * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible for extracting the concept ("extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision", "extractor_multimodal", "reasoner_multimodal"). Defaults to "extractor_text". * **add_justifications** (*bool*) -- Whether to include justifications for extracted items. * **justification_depth** (*JustificationDepth*) -- Justification detail level. Defaults to "brief". * **justification_max_sents** (*int*) -- Maximum sentences in justification. Defaults to 2. * **add_references** (*bool*) -- Whether to include source references for extracted items. * **reference_depth** (*ReferenceDepth*) -- Source reference granularity ("paragraphs" or "sentences"). Defaults to "paragraphs". Only relevant when references are added to extracted items. Affects the structure of "extracted_items". * **singular_occurrence** (*StrictBool*) -- Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept's name, description, and type (e.g., "total revenue" vs "monthly sales figures"). Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **add_justifications** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **justification_depth** (*Literal**[**'brief'**, **'balanced'**, **'comprehensive'**]*) * **justification_max_sents** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **description** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **llm_role** (*Literal**[**'extractor_text'**, **'reasoner_text'**, **'extractor_vision'**, **'reasoner_vision'**, **'extractor_multimodal'**, **'reasoner_multimodal'**]*) * **add_references** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **reference_depth** (*Literal**[**'paragraphs'**, **'sentences'**]*) * **singular_occurrence** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **numeric_type** (*Literal**[**'int'**, **'float'**, **'any'**]*) Example: Numerical concept definition from contextgem import NumericalConcept # Create concepts for different numerical values in the contract payment_amount = NumericalConcept( name="Payment amount", description="The monetary value to be paid according to the contract terms", numeric_type="float", llm_role="extractor_text", add_references=True, reference_depth="sentences", ) payment_days = NumericalConcept( name="Payment term days", description="The number of days within which payment must be made", numeric_type="int", llm_role="extractor_text", add_justifications=True, justification_depth="balanced", ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. property extracted_items: list[_ExtractedItem] Provides access to extracted items. Returns: A list containing the extracted items as *_ExtractedItem* objects. Return type: list[_ExtractedItem] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. numeric_type: Literal['int', 'float', 'any'] name: NonEmptyStr description: NonEmptyStr llm_role: LLMRoleAny add_references: StrictBool reference_depth: ReferenceDepth singular_occurrence: StrictBool add_justifications: StrictBool justification_depth: JustificationDepth justification_max_sents: StrictInt custom_data: JSONDictField class contextgem.public.concepts.RatingConcept(**data) Bases: "_RatingConcept" A concept model for rating-based information extraction with defined scale boundaries. This class handles identification and extraction of integer ratings that must fall within the boundaries of a specified rating scale. Variables: * **name** (*str*) -- The name of the concept (non-empty string, stripped). * **description** (*str*) -- A brief description of the concept (non-empty string, stripped). * **rating_scale** (*RatingScale** | **tuple**[**int**, **int**]*) -- The rating scale defining valid value boundaries. Can be either a RatingScale object (deprecated, will be removed in v1.0.0) or a tuple of (start, end) integers. * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible for extracting the concept ("extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision", "extractor_multimodal", "reasoner_multimodal"). Defaults to "extractor_text". * **add_justifications** (*bool*) -- Whether to include justifications for extracted items. * **justification_depth** (*JustificationDepth*) -- Justification detail level. Defaults to "brief". * **justification_max_sents** (*int*) -- Maximum sentences in justification. Defaults to 2. * **add_references** (*bool*) -- Whether to include source references for extracted items. * **reference_depth** (*ReferenceDepth*) -- Source reference granularity ("paragraphs" or "sentences"). Defaults to "paragraphs". Only relevant when references are added to extracted items. Affects the structure of "extracted_items". * **singular_occurrence** (*StrictBool*) -- Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept's name, description, and type (e.g., "product rating score" vs "customer satisfaction ratings"). Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **add_justifications** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **justification_depth** (*Literal**[**'brief'**, **'balanced'**, **'comprehensive'**]*) * **justification_max_sents** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **description** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **llm_role** (*Literal**[**'extractor_text'**, **'reasoner_text'**, **'extractor_vision'**, **'reasoner_vision'**, **'extractor_multimodal'**, **'reasoner_multimodal'**]*) * **add_references** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **reference_depth** (*Literal**[**'paragraphs'**, **'sentences'**]*) * **singular_occurrence** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **rating_scale** (*_RatingScale** | **tuple**[**Annotated**[**int**, **Strict**(**strict=True**)**]**, **Annotated**[**int**, **Strict**(**strict=True**)**]**]*) Example: Rating concept definition from contextgem import RatingConcept # Create a concept to rate the fairness of contract terms fairness_rating = RatingConcept( name="Contract fairness rating", description="Evaluation of how balanced and fair the contract terms are for all parties", rating_scale=(1, 5), llm_role="reasoner_text", add_justifications=True, justification_depth="comprehensive", justification_max_sents=10, ) # Create a concept to rate the clarity of contract language clarity_rating = RatingConcept( name="Language clarity rating", description="Assessment of how clear and unambiguous the contract language is", rating_scale=(1, 10), llm_role="reasoner_text", add_justifications=True, justification_depth="balanced", justification_max_sents=3, ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. property extracted_items: list[_IntegerItem] Gets the list of extracted rating items. Returns: List of extracted integer items representing ratings. Return type: list[_IntegerItem] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. rating_scale: _RatingScale | tuple[StrictInt, StrictInt] name: NonEmptyStr description: NonEmptyStr llm_role: LLMRoleAny add_references: StrictBool reference_depth: ReferenceDepth singular_occurrence: StrictBool add_justifications: StrictBool justification_depth: JustificationDepth justification_max_sents: StrictInt custom_data: JSONDictField class contextgem.public.concepts.JsonObjectConcept(**data) Bases: "_JsonObjectConcept" A concept model for structured JSON object extraction from documents and aspects. This class handles identification and extraction of structured data in JSON format, with validation against a predefined schema structure. Variables: * **name** (*str*) -- The name of the concept (non-empty string, stripped). * **description** (*str*) -- A brief description of the concept (non-empty string, stripped). * **structure** (*type** | **dict**[**str**, **Any**]*) -- JSON object schema as a class with type annotations or dictionary where keys are field names and values are type annotations. All dictionary keys must be strings. Supports generic aliases, union types, nested dictionaries for complex hierarchical structures, lists of dictionaries for array items, Literal types, and classes with type annotations (Pydantic models, dataclasses, etc.) for nested structures. All annotated types must be JSON-serializable. Examples: * Simple structure: "{"item": str, "amount": int | float}" * Nested structure: "{"item": str, "details": {"price": float, "quantity": int}}" * List of objects: "{"items": [{"name": str, "price": float}]}" * List of primitives: "{"names": [str], "scores": [int | float], "statuses": [Literal["active", "inactive"]]}" * List of classes: "{"addresses": [AddressModel], "users": [UserModel]}" * Literal values: "{"status": Literal["pending", "completed", "failed"]}" * With type annotated classes: "{"address": AddressModel}" where AddressModel can be a Pydantic model, dataclass, or any class with type annotations **Note**: For lists, you can use either generic syntax ("list[str]") or literal syntax ("[str]"). List instances support primitive types, unions, literals, and typed classes. Both "{"items": [ClassName]}" and "{"items": list[ClassName]}" are equivalent. **Note**: Class types cannot be used as dictionary keys or values. For example, "dict[str, Address]" is not allowed. Use alternative structures like nested objects or lists of objects instead. **Note**: When using classes that contain other classes as type hints, inherit from "JsonObjectClassStruct" in all parts of the class hierarchy, to ensure proper conversion of nested class hierarchies to dictionary representations for serialization. **Tip**: do not overcomplicate the structure to avoid prompt overloading. * **examples** (*list**[**JsonObjectExample**]*) -- Example JSON objects illustrating the concept usage. * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible for extracting the concept ("extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision", "extractor_multimodal", "reasoner_multimodal"). Defaults to "extractor_text". * **add_justifications** (*bool*) -- Whether to include justifications for extracted items. * **justification_depth** (*JustificationDepth*) -- Justification detail level. Defaults to "brief". * **justification_max_sents** (*int*) -- Maximum sentences in justification. Defaults to 2. * **add_references** (*bool*) -- Whether to include source references for extracted items. * **reference_depth** (*ReferenceDepth*) -- Source reference granularity ("paragraphs" or "sentences"). Defaults to "paragraphs". Only relevant when references are added to extracted items. Affects the structure of "extracted_items". * **singular_occurrence** (*StrictBool*) -- Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept's name, description, and type (e.g., "product specifications" vs "customer order details"). Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **add_justifications** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **justification_depth** (*Literal**[**'brief'**, **'balanced'**, **'comprehensive'**]*) * **justification_max_sents** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **description** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **llm_role** (*Literal**[**'extractor_text'**, **'reasoner_text'**, **'extractor_vision'**, **'reasoner_vision'**, **'extractor_multimodal'**, **'reasoner_multimodal'**]*) * **add_references** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **reference_depth** (*Literal**[**'paragraphs'**, **'sentences'**]*) * **singular_occurrence** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **structure** (*type** | **dict**[**Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]**, **Any**]*) * **examples** (*list**[**_JsonObjectExample**]*) Example: JSON object concept definition from typing import Literal from contextgem import JsonObjectConcept # Define a JSON object concept for capturing address information address_info_concept = JsonObjectConcept( name="Address information", description=( "Structured address data from text including street, " "city, state, postal code, and country." ), structure={ "street": str | None, "city": str | None, "state": str | None, "postal_code": str | None, "country": str | None, "address_type": Literal["residential", "business"] | None, }, ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. property extracted_items: list[_ExtractedItem] Provides access to extracted items. Returns: A list containing the extracted items as *_ExtractedItem* objects. Return type: list[_ExtractedItem] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. structure: type | dict[NonEmptyStr, Any] examples: list[_JsonObjectExample] name: NonEmptyStr description: NonEmptyStr llm_role: LLMRoleAny add_references: StrictBool reference_depth: ReferenceDepth singular_occurrence: StrictBool add_justifications: StrictBool justification_depth: JustificationDepth justification_max_sents: StrictInt custom_data: JSONDictField class contextgem.public.concepts.DateConcept(**data) Bases: "_DateConcept" A concept model for date object extraction from documents and aspects. This class handles identification and extraction of dates, with support for parsing string representations in a specified format into Python date objects. Variables: * **name** (*str*) -- The name of the concept (non-empty string, stripped). * **description** (*str*) -- A brief description of the concept (non-empty string, stripped). * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible for extracting the concept ("extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision", "extractor_multimodal", "reasoner_multimodal"). Defaults to "extractor_text". * **add_justifications** (*bool*) -- Whether to include justifications for extracted items. * **justification_depth** (*JustificationDepth*) -- Justification detail level. Defaults to "brief". * **justification_max_sents** (*int*) -- Maximum sentences in justification. Defaults to 2. * **add_references** (*bool*) -- Whether to include source references for extracted items. * **reference_depth** (*ReferenceDepth*) -- Source reference granularity ("paragraphs" or "sentences"). Defaults to "paragraphs". Only relevant when references are added to extracted items. Affects the structure of "extracted_items". * **singular_occurrence** (*StrictBool*) -- Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept's name, description, and type (e.g., "contract signing date" vs "meeting dates"). Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **add_justifications** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **justification_depth** (*Literal**[**'brief'**, **'balanced'**, **'comprehensive'**]*) * **justification_max_sents** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **description** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **llm_role** (*Literal**[**'extractor_text'**, **'reasoner_text'**, **'extractor_vision'**, **'reasoner_vision'**, **'extractor_multimodal'**, **'reasoner_multimodal'**]*) * **add_references** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **reference_depth** (*Literal**[**'paragraphs'**, **'sentences'**]*) * **singular_occurrence** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) Example: Date concept definition from contextgem import DateConcept # Create a date concept to extract the effective date of the contract effective_date = DateConcept( name="Effective date", description="The effective as specified in the contract", add_references=True, # Include references to where dates were found singular_occurrence=True, # Only extract one effective date per document ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. property extracted_items: list[_ExtractedItem] Provides access to extracted items. Returns: A list containing the extracted items as *_ExtractedItem* objects. Return type: list[_ExtractedItem] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. name: NonEmptyStr description: NonEmptyStr llm_role: LLMRoleAny add_references: StrictBool reference_depth: ReferenceDepth singular_occurrence: StrictBool add_justifications: StrictBool justification_depth: JustificationDepth justification_max_sents: StrictInt custom_data: JSONDictField class contextgem.public.concepts.LabelConcept(**data) Bases: "_LabelConcept" A concept model for label-based classification of documents and aspects. This class handles identification and classification using predefined labels, supporting both multi-class (single label selection) and multi-label (multiple label selection) classification approaches. **Note**: Behavior depends on "classification_type": * "multi_class": exactly one label is always returned for each extracted item. If none of the specific labels apply, include a catch-all label (e.g., ""other"", ""N/A"") among "labels" so the model can select it. * "multi_label": when none of the predefined labels apply, no extracted items may be returned (empty "extracted_items" list). This prevents forced classification when no appropriate label exists. Variables: * **name** (*str*) -- The name of the concept (non-empty string, stripped). * **description** (*str*) -- A brief description of the concept (non-empty string, stripped). * **labels** (*list**[**str**]*) -- List of predefined labels (non-empty strings, stripped) for classification. Must contain at least 2 unique labels. * **classification_type** (*ClassificationType*) -- Classification mode - "multi_class" for single label selection, "multi_label" for multiple label selection. Defaults to "multi_class". * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible for extracting the concept ("extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision", "extractor_multimodal", "reasoner_multimodal"). Defaults to "extractor_text". * **add_justifications** (*bool*) -- Whether to include justifications for extracted items. * **justification_depth** (*JustificationDepth*) -- Justification detail level. Defaults to "brief". * **justification_max_sents** (*int*) -- Maximum sentences in justification. Defaults to 2. * **add_references** (*bool*) -- Whether to include source references for extracted items. * **reference_depth** (*ReferenceDepth*) -- Source reference granularity ("paragraphs" or "sentences"). Defaults to "paragraphs". Only relevant when references are added to extracted items. Affects the structure of "extracted_items". * **singular_occurrence** (*bool*) -- Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept's name, description, and type (e.g., "document type" vs "content topics"). Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **add_justifications** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **justification_depth** (*Literal**[**'brief'**, **'balanced'**, **'comprehensive'**]*) * **justification_max_sents** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **description** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **llm_role** (*Literal**[**'extractor_text'**, **'reasoner_text'**, **'extractor_vision'**, **'reasoner_vision'**, **'extractor_multimodal'**, **'reasoner_multimodal'**]*) * **add_references** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **reference_depth** (*Literal**[**'paragraphs'**, **'sentences'**]*) * **singular_occurrence** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **labels** (*list**[**Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]**]*) * **classification_type** (*Literal**[**'multi_class'**, **'multi_label'**]*) Example: Label concept definition from contextgem import LabelConcept # Multi-class classification: single label selection document_type_concept = LabelConcept( name="Document Type", description="Classify the type of legal document", labels=["NDA", "Consultancy Agreement", "Privacy Policy", "Other"], classification_type="multi_class", singular_occurrence=True, ) # Multi-label classification: multiple label selection content_topics_concept = LabelConcept( name="Content Topics", description="Identify all relevant topics covered in the document", labels=["Finance", "Legal", "Technology", "HR", "Operations", "Marketing"], classification_type="multi_label", add_justifications=True, justification_depth="brief", # add justifications for the selected labels ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. property extracted_items: list[_LabelItem] Gets the list of extracted label items. Returns: List of extracted label items. Return type: list[_LabelItem] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. labels: list[NonEmptyStr] classification_type: ClassificationType name: NonEmptyStr description: NonEmptyStr llm_role: LLMRoleAny add_references: StrictBool reference_depth: ReferenceDepth singular_occurrence: StrictBool add_justifications: StrictBool justification_depth: JustificationDepth justification_max_sents: StrictInt custom_data: JSONDictField # ==== api/examples ==== Examples ******** Module for handling example data in document processing. This module provides classes for defining examples that can be used to guide LLM extraction tasks. Examples serve as reference points for the model to understand the expected format and content of extracted information. The module supports different types of examples including string-based examples and structured JSON object examples. Examples can be attached to concepts to provide concrete illustrations of the kind of information to be extracted, improving the accuracy and consistency of LLM-based extraction processes. class contextgem.public.examples.StringExample(**data) Bases: "_StringExample" Represents a string example that can be provided by users for certain extraction tasks. Variables: **content** (*str*) -- A non-empty string that holds the text content of the example. Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **content** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) Note: Examples are optional and can be used to guide LLM extraction tasks. They serve as reference points for the model to understand the expected format and content of extracted information. StringExample can be attached to a "StringConcept". Example: String example definition from contextgem import StringConcept, StringExample # Create string examples string_examples = [ StringExample(content="X (Client)"), StringExample(content="Y (Supplier)"), ] # Attach string examples to a StringConcept string_concept = StringConcept( name="Contract party name and role", description="The name and role of the contract party", examples=string_examples, # Attach the example to the concept (optional) ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. content: NonEmptyStr custom_data: JSONDictField class contextgem.public.examples.JsonObjectExample(**data) Bases: "_JsonObjectExample" Represents a JSON object example that can be provided by users for certain extraction tasks. Variables: **content** (*dict**[**str**, **Any**]*) -- A JSON-serializable dict with the minimum length of 1 that holds the content of the example. Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **content** (*Annotated**[**dict**[**str**, **Any**]**, **Bef oreValidator**(**func=~contextgem.internal.typings.validators ._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) Note: Examples are optional and can be used to guide LLM extraction tasks. They serve as reference points for the model to understand the expected format and content of extracted information. JsonObjectExample can be attached to a "JsonObjectConcept". Example: JSON object example definition from contextgem import JsonObjectConcept, JsonObjectExample # Create a JSON object example json_example = JsonObjectExample( content={ "name": "John Doe", "education": "Bachelor's degree in Computer Science", "skills": ["Python", "Machine Learning", "Data Analysis"], "hobbies": ["Reading", "Traveling", "Gaming"], } ) # Define a structure for JSON object concept class PersonInfo: name: str education: str skills: list[str] hobbies: list[str] # Also works as a dict with type hints, e.g. # PersonInfo = { # "name": str, # "education": str, # "skills": list[str], # "hobbies": list[str], # } # Attach JSON example to a JsonObjectConcept json_concept = JsonObjectConcept( name="Candidate info", description="Structured information about a job candidate", structure=PersonInfo, # Define the expected structure examples=[json_example], # Attach the example to the concept (optional) ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. content: JSONDictField custom_data: JSONDictField # ==== api/llms ==== LLMs **** Module for handling processing logic using LLMs. This module provides classes and utilities for interacting with LLMs in document processing workflows. It includes functionality for managing LLM configurations, handling API calls, processing text and image inputs, tracking token usage and costs, and managing rate limits for LLM requests. The module supports various LLM providers through the litellm library, enabling both text-only and multimodal (vision) capabilities. It implements efficient asynchronous processing patterns and provides detailed usage statistics for monitoring and cost management. class contextgem.public.llms.DocumentLLMGroup(**data) Bases: "_DocumentLLMGroup" Represents a group of DocumentLLMs with unique roles for processing document content. This class manages multiple LLMs assigned to specific roles for text and vision processing. It ensures role compliance and facilitates extraction of aspects and concepts from documents. Variables: * **llms** (*list**[**DocumentLLM**]*) -- A list of DocumentLLM instances, each with a unique role (e.g., *extractor_text*, *reasoner_text*, *extractor_vision*, *reasoner_vision*). At least 2 instances with distinct roles are required. * **output_language** (*LanguageRequirement*) -- Language for produced output text (justifications, explanations). Values: "en" (always English) or "adapt" (matches document/image language). All LLMs in the group must share the same output_language setting. Defaults to "en". Applies only when DocumentLLMs' default system messages are used. Parameters: * **llms** (*list**[**_DocumentLLM**]*) * **output_language** (*Literal**[**'en'**, **'adapt'**]*) Note: Refer to the "DocumentLLM" class for more information on constructing LLMs for the group. Example: LLM group definition from contextgem import DocumentLLM, DocumentLLMGroup # Create a text extractor LLM with a fallback text_extractor = DocumentLLM( model="openai/gpt-4o-mini", api_key="your-openai-api-key", # Replace with your actual API key role="extractor_text", ) # Create a fallback LLM for the text extractor text_extractor_fallback = DocumentLLM( model="anthropic/claude-3-5-haiku", api_key="your-anthropic-api-key", # Replace with your actual API key role="extractor_text", # Must have the same role as the primary LLM is_fallback=True, ) # Assign the fallback LLM to the primary text extractor text_extractor.fallback_llm = text_extractor_fallback # Create a text reasoner LLM text_reasoner = DocumentLLM( model="openai/o3-mini", api_key="your-openai-api-key", # Replace with your actual API key role="reasoner_text", # For more complex tasks that require reasoning ) # Create a vision extractor LLM vision_extractor = DocumentLLM( model="openai/gpt-4o-mini", api_key="your-openai-api-key", # Replace with your actual API key role="extractor_vision", # For handling images ) # Create a vision reasoner LLM vision_reasoner = DocumentLLM( model="openai/gpt-5-mini", api_key="your-openai-api-key", role="reasoner_vision", # For more complex vision tasks that require reasoning ) # Create a DocumentLLMGroup with all four LLMs llm_group = DocumentLLMGroup( llms=[text_extractor, text_reasoner, vision_extractor, vision_reasoner], output_language="en", # All LLMs must have the same output language ("en" is default) ) # This group will have 5 LLMs: four main ones, with different roles, # and one fallback LLM for a specific LLM. Each LLM can have a fallback LLM. # Get usage statistics for the whole group or for a specific role group_usage = llm_group.get_usage() text_extractor_usage = llm_group.get_usage(llm_role="extractor_text") # Get cost statistics for the whole group or for a specific role all_costs = llm_group.get_cost() text_extractor_cost = llm_group.get_cost(llm_role="extractor_text") # Reset usage and cost statistics for the whole group or for a specific role llm_group.reset_usage_and_cost() llm_group.reset_usage_and_cost(llm_role="extractor_text") Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. _eq_deserialized_llm_config(other) Custom config equality method to compare this _DocumentLLMGroup with a deserialized instance. Uses the *_eq_deserialized_llm_config* method of the _DocumentLLM class to compare each LLM in the group, including fallbacks, if any. Parameters: **other** (*_DocumentLLMGroup*) -- Another _DocumentLLMGroup instance to compare with Returns: True if the instances are equal, False otherwise Return type: bool extract_all(document, *, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts all aspects and concepts from a document and its aspects. This method performs comprehensive extraction by processing the document for aspects and concepts, then extracting concepts from each aspect. The operation can be configured for concurrent processing and customized extraction parameters. This is the synchronous version of *extract_all_async()*. Parameters: * **document** (*_Document*) -- The document to analyze. * **overwrite_existing** (*bool**, **optional*) -- Whether to overwrite already processed aspects and concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int**, **optional*) -- Maximum number of items with the same extraction params to process in each LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool**, **optional*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int**, **optional*) -- Maximum paragraphs to include in a single LLM prompt. Defaults to 0 (all paragraphs). * **max_images_to_analyze_per_call** (*int**, **optional*) -- Maximum images to include in a single LLM prompt. Defaults to 0 (all images). Relevant only for document-level concepts. * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: The document with extracted aspects and concepts. Return type: _Document async extract_all_async(document, *, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Asynchronously extracts all aspects and concepts from a document and its aspects. This method performs comprehensive extraction by processing the document for aspects and concepts, then extracting concepts from each aspect. The operation can be configured for concurrent processing and customized extraction parameters. Parameters: * **document** (*_Document*) -- The document to analyze. * **overwrite_existing** (*bool**, **optional*) -- Whether to overwrite already processed aspects and concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int**, **optional*) -- Maximum number of items with the same extraction params to process in each LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool**, **optional*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int**, **optional*) -- Maximum paragraphs to include in a single LLM prompt. Defaults to 0 (all paragraphs). * **max_images_to_analyze_per_call** (*int**, **optional*) -- Maximum images to include in a single LLM prompt. Defaults to 0 (all images). Relevant only for document-level concepts. * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: The document with extracted aspects and concepts. Return type: _Document extract_aspects_from_document(document, *, from_aspects=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts aspects from the provided document using predefined LLMs. If an aspect instance has "extracted_items" populated, the "reference_paragraphs" field will be automatically populated from these items. This is the synchronous version of *extract_aspects_from_document_async()*. Parameters: * **document** (*_Document*) -- The document from which aspects are to be extracted. * **from_aspects** (*list**[**_Aspect**] **| **None*) -- Existing aspects to use as a base for extraction. If None, uses all document's aspects. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed aspects with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum items with the same extraction params to process per LLM call. Defaults to 0 (all items in single call). For complex tasks, you should not set a value, to avoid prompt overloading. If concurrency is enabled, defaults to 1 (each item processed separately). * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to analyze in a single LLM prompt. Defaults to 0 (all paragraphs). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed _Aspect objects with extracted items. Return type: list[_Aspect] async extract_aspects_from_document_async(document, *, from_aspects=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts aspects from the provided document using predefined LLMs asynchronously. If an aspect instance has "extracted_items" populated, the "reference_paragraphs" field will be automatically populated from these items. Parameters: * **document** (*_Document*) -- The document from which aspects are to be extracted. * **from_aspects** (*list**[**_Aspect**] **| **None*) -- Existing aspects to use as a base for extraction. If None, uses all document's aspects. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed aspects with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum number of items with the same extraction params to process per LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to analyze in a single LLM prompt. Defaults to 0 (all paragraphs). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed _Aspect objects with extracted items. Return type: list[_Aspect] extract_concepts_from_aspect(aspect, document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts concepts associated with a given aspect in a document. This method processes an aspect to extract related concepts using LLMs. If the aspect has not been previously processed, a ValueError is raised. This is the synchronous version of *extract_concepts_from_aspect_async()*. Parameters: * **aspect** (*_Aspect*) -- The aspect from which to extract concepts. * **document** (*_Document*) -- The document that contains the aspect. * **from_concepts** (*list**[**_Concept**] **| **None*) -- List of existing concepts to process. Defaults to None. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum number of items with the same extraction params to process in each LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to include in a single LLM prompt. Defaults to 0 (all paragraphs). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed concept objects. Return type: list[_Concept] async extract_concepts_from_aspect_async(aspect, document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Asynchronously extracts concepts from a specified aspect using LLMs. This method processes an aspect to extract related concepts using LLMs. If the aspect has not been previously processed, a ValueError is raised. Parameters: * **aspect** (*_Aspect*) -- The aspect from which to extract concepts. * **document** (*_Document*) -- The document that contains the aspect. * **from_concepts** (*list**[**_Concept**] **| **None*) -- List of existing concepts to process. Defaults to None. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum number of items with the same extraction params to process in each LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to include in a single LLM prompt. Defaults to 0 (all paragraphs). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed concept objects. Return type: list[_Concept] extract_concepts_from_document(document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts concepts from the provided document using predefined LLMs. This is the synchronous version of *extract_concepts_from_document_async()*. Parameters: * **document** (*_Document*) -- The document from which concepts are to be extracted. * **from_concepts** (*list**[**_Concept**] **| **None*) -- Existing concepts to use as a base for extraction. If None, uses all document's concepts. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum items with the same extraction params to process per LLM call. Defaults to 0 (all items in single call). For complex tasks, you should not set a value, to avoid prompt overloading. If concurrency is enabled, defaults to 1 (each item processed separately). * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to analyze in a single LLM prompt. Defaults to 0 (all paragraphs). * **max_images_to_analyze_per_call** (*int**, **optional*) -- Maximum images to include in a single LLM prompt. Defaults to 0 (all images). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed Concept objects with extracted items. Return type: list[_Concept] async extract_concepts_from_document_async(document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts concepts from the provided document using predefined LLMs asynchronously. This method processes a document to extract concepts using configured LLMs. Parameters: * **document** (*_Document*) -- The document from which concepts are to be extracted. * **from_concepts** (*list**[**_Concept**] **| **None*) -- Existing concepts to use as a base for extraction. If None, uses all document's concepts. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed concepts with newly extracted information. Defaults to False. Defaults to False. * **max_items_per_call** (*int*) -- Maximum number of items with the same extraction params to process per LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to analyze in a single LLM prompt. Defaults to 0 (all paragraphs). * **max_images_to_analyze_per_call** (*int**, **optional*) -- Maximum images to include in a single LLM prompt. Defaults to 0 (all images). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed Concept objects with extracted items. Return type: list[_Concept] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. get_cost(llm_role=None) Retrieves the accumulated cost information of the LLMs in the group, filtered by the specified LLM role if provided. Parameters: **llm_role** (*str** | **None*) -- Optional; A string representing the role of the LLM to filter the cost data. If None, returns cost for all LLMs in the group. Returns: A list of cost statistics containers for the specified LLMs and their fallbacks. Return type: list[_LLMCostOutputContainer] Raises: **ValueError** -- If no LLM with the specified role exists in the group. get_usage(llm_role=None) Retrieves the usage information of the LLMs in the group, filtered by the specified LLM role if provided. Parameters: **llm_role** (*str** | **None*) -- Optional; A string representing the role of the LLM to filter the usage data. If None, returns usage for all LLMs in the group. Returns: A list of usage statistics containers for the specified LLMs and their fallbacks. Return type: list[_LLMUsageOutputContainer] Raises: **ValueError** -- If no LLM with the specified role exists in the group. group_update_output_language(output_language) Updates the output language for all LLMs in the group. Parameters: **output_language** (*LanguageRequirement*) -- The new output language to set for all LLMs Returns: None Return type: None property is_group: bool Returns True indicating this is a group of LLMs. Returns: Always True for DocumentLLMGroup instances. Return type: bool property list_roles: list[Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision', 'extractor_multimodal', 'reasoner_multimodal']] Returns a list of all roles assigned to the LLMs in this group. Returns: A list of LLM role identifiers Return type: list[LLMRoleAny] reset_usage_and_cost(llm_role=None) Resets the usage and cost statistics for LLMs in the group. This method clears accumulated usage and cost data, which is useful when processing multiple documents sequentially and tracking metrics for each document separately. Parameters: **llm_role** (*str** | **None*) -- Optional; A string representing the role of the LLM to reset statistics for. If None, resets statistics for all LLMs in the group. Returns: None Return type: None Raises: **ValueError** -- If no LLM with the specified role exists in the group. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str llms: list[_DocumentLLM] output_language: LanguageRequirement class contextgem.public.llms.DocumentLLM(**data) Bases: "_DocumentLLM" Handles processing documents with a specific LLM. This class serves as an abstraction for interacting with a LLM. It provides functionality for querying the LLM with text or image inputs, and manages prompt preparation and token usage tracking. The class can be configured with different roles based on the document processing task. Variables: * **model** (*str*) -- Model identifier in format {model_provider}/{model_name}. See https://docs.litellm.ai/docs/providers for supported providers. * **deployment_id** (*str** | **None*) -- Deployment ID for the LLM. Primarily used with Azure OpenAI. * **api_key** (*str** | **None*) -- API key for LLM authentication. Not required for local models (e.g., Ollama). * **api_base** (*str** | **None*) -- Base URL of the API endpoint. * **api_version** (*str** | **None*) -- API version. Primarily used with Azure OpenAI. * **role** (*LLMRoleAny*) -- Role type for the LLM ("extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision", "extractor_multimodal", "reasoner_multimodal"). Defaults to "extractor_text". * **system_message** (*str** | **None*) -- Preparatory system- level message to set context for LLM responses. * **temperature** (*float** | **None*) -- Sampling temperature (0.0 to 1.0) controlling response creativity. Lower values produce more predictable outputs, higher values generate more varied responses. Defaults to 0.3. * **max_tokens** (*int*) -- Maximum tokens allowed in the generated response. Defaults to 4096. * **max_completion_tokens** (*int*) -- Maximum token size for output completions in reasoning (CoT-capable) models. Defaults to 16000. * **reasoning_effort** (*ReasoningEffort** | **None*) -- The effort level for the LLM to reason about the input. Can be set to ""minimal"" (gpt-5 models only), ""low"", ""medium"", or ""high"". Relevant for reasoning (CoT-capable) models. Defaults to None. * **top_p** (*float** | **None*) -- Nucleus sampling value (0.0 to 1.0) controlling output focus/randomness. Lower values make output more deterministic, higher values produce more diverse outputs. Defaults to 0.3. * **num_retries_failed_request** (*int*) -- Number of retries when LLM request fails. Defaults to 3. * **max_retries_failed_request** (*int*) -- LLM provider- specific retry count for failed requests. Defaults to 0. * **max_retries_invalid_data** (*int*) -- Number of retries when LLM returns invalid data. Defaults to 3. * **timeout** (*int*) -- Timeout in seconds for LLM API calls. Defaults to 120 seconds. * **pricing_details** (*LLMPricing** | **None*) -- LLMPricing object with pricing details for cost calculation. Defaults to None. * **auto_pricing** (*bool*) -- Enable automatic LLM cost calculation using genai-prices. Ignored when "pricing_details" is provided. Defaults to "False". * **auto_pricing_refresh** (*bool*) -- Whether genai-prices should auto-refresh its cached pricing data. Defaults to "False". * **is_fallback** (*bool*) -- Indicates whether the LLM is a fallback model. Defaults to False. * **fallback_llm** (*DocumentLLM** | **None*) -- DocumentLLM to use as fallback if current one fails. Must have the same role as the current LLM. Defaults to None. * **output_language** (*LanguageRequirement*) -- Language for produced output text (justifications, explanations). Can be "en" (English) or "adapt" (adapts to document/image language). Defaults to "en". Applies only when DocumentLLM's default system message is used. * **async_limiter** (*AsyncLimiter*) -- Controls frequency of async LLM API requests for concurrent tasks. Defaults to allowing 3 acquisitions per 10-second period to prevent rate limit issues. See https://github.com/mjpieters/aiolimiter for configuration details. * **seed** (*int** | **None*) -- Seed for random number generation to help produce more consistent outputs across multiple runs. When set to a specific integer value, the LLM will attempt to use this seed for sampling operations. However, deterministic output is still not guaranteed even with the same seed, as other factors may influence the model's response. Defaults to None. Parameters: * **model** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **deployment_id** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**] **| **None*) * **api_key** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**] **| **None*) * **api_base** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**] **| **None*) * **api_version** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**] **| **None*) * **role** (*Literal**[**'extractor_text'**, **'reasoner_text'**, **'extractor_vision'**, **'reasoner_vision'**, **'extractor_multimodal'**, **'reasoner_multimodal'**]*) * **system_message** (*str** | **None*) * **max_tokens** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **max_completion_tokens** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **reasoning_effort** (*Literal**[**'minimal'**, **'low'**, **'medium'**, **'high'**] **| **None*) * **num_retries_failed_request** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **max_retries_failed_request** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **max_retries_invalid_data** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **timeout** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **pricing_details** (*_LLMPricing** | **None*) * **is_fallback** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **fallback_llm** (*_DocumentLLM** | **None*) * **output_language** (*Literal**[**'en'**, **'adapt'**]*) * **temperature** (*Annotated**[**float**, **Strict**(**strict=True**)**] **| **None*) * **top_p** (*Annotated**[**float**, **Strict**(**strict=True**)**] **| **None*) * **seed** (*Annotated**[**int**, **Strict**(**strict=True**)**] **| **None*) * **tools** (*list**[**Annotated**[**dict**[**str**, **Any**]**, **BeforeValidator**(**func=~contextgem.internal.typings.valid ators._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]**] **| **None*) * **tool_choice** (*str** | **Annotated**[**dict**[**str**, **Any**]**, **BeforeValidator**(**func=~contextgem.internal.t ypings.validators._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**] **| **None*) * **parallel_tool_calls** (*bool** | **None*) * **tool_max_rounds** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **auto_pricing** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) * **auto_pricing_refresh** (*Annotated**[**bool**, **Strict**(**strict=True**)**]*) Note: * LLM groups Refer to the "DocumentLLMGroup" class for more information on constructing LLM groups, which are a collection of LLMs with unique roles, used for complex document processing tasks. * LLM role The "role" of an LLM is an abstraction to differentiate between tasks of different complexity. For example, if an aspect/concept is assigned "llm_role="extractor_text"", it means that the aspect/concept is extracted from the document using the LLM with the "role" set to "extractor_text". This helps to channel different tasks to different LLMs, ensuring that the task is handled by the most appropriate model. Usually, domain expertise is required to determine the most appropriate role for a specific aspect/concept. But for simple use cases, you can skip the role assignment completely, in which case the "role" will default to "extractor_text". * Explicit capability declaration Model vision capabilities are automatically detected using "litellm.supports_vision()". If this function does not correctly identify your model's capabilities, ContextGem will typically issue a warning, and you can explicitly declare the capability by setting "_supports_vision=True" on the LLM instance. Example: LLM definition from contextgem import DocumentLLM, LLMPricing # Create a single LLM for text extraction text_extractor = DocumentLLM( model="openai/gpt-4o-mini", api_key="your-api-key", # Replace with your actual API key role="extractor_text", # Role for text extraction pricing_details=LLMPricing( # optional input_per_1m_tokens=0.150, output_per_1m_tokens=0.600 ), # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider ) # Create a fallback LLM in case the primary model fails fallback_text_extractor = DocumentLLM( model="anthropic/claude-3-7-sonnet", api_key="your-anthropic-api-key", # Replace with your actual API key role="extractor_text", # must be the same as the role of the primary LLM is_fallback=True, pricing_details=LLMPricing( # optional input_per_1m_tokens=3.00, output_per_1m_tokens=15.00 ), # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider ) # Assign the fallback LLM to the primary LLM text_extractor.fallback_llm = fallback_text_extractor Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. _eq_deserialized_llm_config(other) Custom config equality method to compare this _DocumentLLM with a deserialized instance. Compares the __dict__ of both instances and performs specific checks for certain attributes that require special handling. Note that, by default, the reconstructed deserialized _DocumentLLM will be only partially equal (==) to the original one, as the api credentials are redacted, and the attached prompt templates, async limiter, and async lock are not serialized and point to different objects in memory post- initialization. Also, usage and cost are reset by default pre- serialization. Parameters: **other** (*_DocumentLLM*) -- Another _DocumentLLM instance to compare with Returns: True if the instances are equal, False otherwise Return type: bool _update_default_prompt(prompt_path, prompt_type) For advanced users only! Update the default Jinja2 prompt template for the LLM. This method allows you to replace the built-in prompt templates with custom ones for specific extraction types. The framework uses these templates to guide the LLM in extracting structured information from documents. The custom prompt must be a valid Jinja2 template and include all the necessary variables that are present in the default prompt. Otherwise, the extraction may fail. Default prompts are located under "contextgem/internal/prompts/" IMPORTANT NOTES: The default prompts are complex and specifically designed for various steps of LLM extraction with the framework. Such prompts include the necessary instructions, template variables, nested structures and loops, etc. Only use custom prompts if you MUST have a deeper customization and adaptation of the default prompts to your specific use case. Otherwise, the default prompts should be sufficient for most use cases. Use at your own risk! Parameters: * **prompt_path** (*str** | **Path*) -- Path to the Jinja2 template file (.j2 extension required) * **prompt_type** (*DefaultPromptType*) -- Type of prompt to update ("aspect" or "concept") Returns: None Return type: None property async_limiter: AsyncLimiter Gets the async rate limiter for this LLM. Returns: The AsyncLimiter instance controlling request rate limits. Return type: AsyncLimiter chat(prompt, *, images=None, chat_session=None) Synchronously sends a prompt to the LLM and gets a response. For models supporting vision, attach images to the prompt if needed. This method allows direct interaction with the LLM by submitting your own prompt. Parameters: * **prompt** (*str*) -- The input prompt to send to the LLM * **images** (*list**[**Image**] **| **None*) -- Optional list of Image instances for vision queries * **chat_session** (*_ChatSession** | **None*) -- Optional stateful chat session to preserve and use history. Returns: The LLM's response Return type: str Raises: * **ValueError** -- If the prompt is empty or not a string * **ValueError** -- If images parameter is not a list of Image instances * **ValueError** -- If images are provided but the model doesn't support vision * **RuntimeError** -- If the LLM call fails and no fallback is available async chat_async(prompt, *, images=None, chat_session=None) Asynchronously sends a prompt to the LLM and gets a response. For models supporting vision, attach images to the prompt if needed. This method allows direct interaction with the LLM by submitting your own prompt. Parameters: * **prompt** (*str*) -- The input prompt to send to the LLM * **images** (*list**[**Image**] **| **None*) -- Optional list of Image instances for vision queries * **chat_session** (*_ChatSession** | **None*) -- Optional stateful chat session to preserve and use history. Returns: The LLM's response Return type: str Raises: * **ValueError** -- If the prompt is empty or not a string * **ValueError** -- If images parameter is not a list of Image instances * **ValueError** -- If images are provided but the model doesn't support vision * **RuntimeError** -- If the LLM call fails and no fallback is available extract_all(document, *, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts all aspects and concepts from a document and its aspects. This method performs comprehensive extraction by processing the document for aspects and concepts, then extracting concepts from each aspect. The operation can be configured for concurrent processing and customized extraction parameters. This is the synchronous version of *extract_all_async()*. Parameters: * **document** (*_Document*) -- The document to analyze. * **overwrite_existing** (*bool**, **optional*) -- Whether to overwrite already processed aspects and concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int**, **optional*) -- Maximum number of items with the same extraction params to process in each LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool**, **optional*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int**, **optional*) -- Maximum paragraphs to include in a single LLM prompt. Defaults to 0 (all paragraphs). * **max_images_to_analyze_per_call** (*int**, **optional*) -- Maximum images to include in a single LLM prompt. Defaults to 0 (all images). Relevant only for document-level concepts. * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: The document with extracted aspects and concepts. Return type: _Document async extract_all_async(document, *, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Asynchronously extracts all aspects and concepts from a document and its aspects. This method performs comprehensive extraction by processing the document for aspects and concepts, then extracting concepts from each aspect. The operation can be configured for concurrent processing and customized extraction parameters. Parameters: * **document** (*_Document*) -- The document to analyze. * **overwrite_existing** (*bool**, **optional*) -- Whether to overwrite already processed aspects and concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int**, **optional*) -- Maximum number of items with the same extraction params to process in each LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool**, **optional*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int**, **optional*) -- Maximum paragraphs to include in a single LLM prompt. Defaults to 0 (all paragraphs). * **max_images_to_analyze_per_call** (*int**, **optional*) -- Maximum images to include in a single LLM prompt. Defaults to 0 (all images). Relevant only for document-level concepts. * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: The document with extracted aspects and concepts. Return type: _Document extract_aspects_from_document(document, *, from_aspects=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts aspects from the provided document using predefined LLMs. If an aspect instance has "extracted_items" populated, the "reference_paragraphs" field will be automatically populated from these items. This is the synchronous version of *extract_aspects_from_document_async()*. Parameters: * **document** (*_Document*) -- The document from which aspects are to be extracted. * **from_aspects** (*list**[**_Aspect**] **| **None*) -- Existing aspects to use as a base for extraction. If None, uses all document's aspects. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed aspects with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum items with the same extraction params to process per LLM call. Defaults to 0 (all items in single call). For complex tasks, you should not set a value, to avoid prompt overloading. If concurrency is enabled, defaults to 1 (each item processed separately). * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to analyze in a single LLM prompt. Defaults to 0 (all paragraphs). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed _Aspect objects with extracted items. Return type: list[_Aspect] async extract_aspects_from_document_async(document, *, from_aspects=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts aspects from the provided document using predefined LLMs asynchronously. If an aspect instance has "extracted_items" populated, the "reference_paragraphs" field will be automatically populated from these items. Parameters: * **document** (*_Document*) -- The document from which aspects are to be extracted. * **from_aspects** (*list**[**_Aspect**] **| **None*) -- Existing aspects to use as a base for extraction. If None, uses all document's aspects. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed aspects with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum number of items with the same extraction params to process per LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to analyze in a single LLM prompt. Defaults to 0 (all paragraphs). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed _Aspect objects with extracted items. Return type: list[_Aspect] extract_concepts_from_aspect(aspect, document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts concepts associated with a given aspect in a document. This method processes an aspect to extract related concepts using LLMs. If the aspect has not been previously processed, a ValueError is raised. This is the synchronous version of *extract_concepts_from_aspect_async()*. Parameters: * **aspect** (*_Aspect*) -- The aspect from which to extract concepts. * **document** (*_Document*) -- The document that contains the aspect. * **from_concepts** (*list**[**_Concept**] **| **None*) -- List of existing concepts to process. Defaults to None. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum number of items with the same extraction params to process in each LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to include in a single LLM prompt. Defaults to 0 (all paragraphs). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed concept objects. Return type: list[_Concept] async extract_concepts_from_aspect_async(aspect, document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Asynchronously extracts concepts from a specified aspect using LLMs. This method processes an aspect to extract related concepts using LLMs. If the aspect has not been previously processed, a ValueError is raised. Parameters: * **aspect** (*_Aspect*) -- The aspect from which to extract concepts. * **document** (*_Document*) -- The document that contains the aspect. * **from_concepts** (*list**[**_Concept**] **| **None*) -- List of existing concepts to process. Defaults to None. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum number of items with the same extraction params to process in each LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to include in a single LLM prompt. Defaults to 0 (all paragraphs). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed concept objects. Return type: list[_Concept] extract_concepts_from_document(document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts concepts from the provided document using predefined LLMs. This is the synchronous version of *extract_concepts_from_document_async()*. Parameters: * **document** (*_Document*) -- The document from which concepts are to be extracted. * **from_concepts** (*list**[**_Concept**] **| **None*) -- Existing concepts to use as a base for extraction. If None, uses all document's concepts. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed concepts with newly extracted information. Defaults to False. * **max_items_per_call** (*int*) -- Maximum items with the same extraction params to process per LLM call. Defaults to 0 (all items in single call). For complex tasks, you should not set a value, to avoid prompt overloading. If concurrency is enabled, defaults to 1 (each item processed separately). * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to analyze in a single LLM prompt. Defaults to 0 (all paragraphs). * **max_images_to_analyze_per_call** (*int**, **optional*) -- Maximum images to include in a single LLM prompt. Defaults to 0 (all images). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed Concept objects with extracted items. Return type: list[_Concept] async extract_concepts_from_document_async(document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True) Extracts concepts from the provided document using predefined LLMs asynchronously. This method processes a document to extract concepts using configured LLMs. Parameters: * **document** (*_Document*) -- The document from which concepts are to be extracted. * **from_concepts** (*list**[**_Concept**] **| **None*) -- Existing concepts to use as a base for extraction. If None, uses all document's concepts. * **overwrite_existing** (*bool*) -- Whether to overwrite already processed concepts with newly extracted information. Defaults to False. Defaults to False. * **max_items_per_call** (*int*) -- Maximum number of items with the same extraction params to process per LLM call. Defaults to 0 (all items in one call). If concurrency is enabled, defaults to 1. For complex tasks, you should not set a high value, in order to avoid prompt overloading. * **use_concurrency** (*bool*) -- If True, enables concurrent processing of multiple items. Concurrency can considerably reduce processing time, but may cause rate limit errors with LLM providers. Use this option when API rate limits allow for multiple concurrent requests. Defaults to False. * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum paragraphs to analyze in a single LLM prompt. Defaults to 0 (all paragraphs). * **max_images_to_analyze_per_call** (*int**, **optional*) -- Maximum images to include in a single LLM prompt. Defaults to 0 (all images). * **raise_exception_on_extraction_error** (*bool**, **optional*) -- Whether to raise an exception if the extraction fails due to invalid data returned by an LLM or an error in the LLM API. If False, a warning will be issued instead, and no extracted items will be returned. Defaults to True. Returns: List of processed Concept objects with extracted items. Return type: list[_Concept] classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. get_cost() Retrieves the accumulated cost information of the LLM and its fallback LLM if configured. This method collects cost statistics for the current LLM instance and its fallback LLM (if configured), providing insights into API usage expenses. Returns: A list of cost statistics containers for the LLM and its fallback. Return type: list[_LLMCostOutputContainer] get_usage() Retrieves the usage information of the LLM and its fallback LLM if configured. This method collects token usage statistics for the current LLM instance and its fallback LLM (if configured), providing insights into API consumption. Returns: A list of usage statistics containers for the LLM and its fallback. Return type: list[_LLMUsageOutputContainer] property is_group: bool Returns False indicating this is a single LLM, not a group. Returns: Always False for DocumentLLM instances. Return type: bool property list_roles: list[Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision', 'extractor_multimodal', 'reasoner_multimodal']] Returns a list containing the role of this LLM. (For a single LLM, this returns a list with just one element - the LLM's role. For LLM groups, the method implementation returns roles of all LLMs in the group.) Returns: A list containing the role of this LLM. Return type: list[LLMRoleAny] reset_usage_and_cost() Resets the usage and cost statistics for the LLM and its fallback LLM (if configured). This method clears accumulated usage and cost data, which is useful when processing multiple documents sequentially and tracking metrics for each document separately. Returns: None Return type: None to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str model: NonEmptyStr deployment_id: NonEmptyStr | None api_key: NonEmptyStr | None api_base: NonEmptyStr | None api_version: NonEmptyStr | None role: LLMRoleAny system_message: str | None max_tokens: StrictInt max_completion_tokens: StrictInt reasoning_effort: ReasoningEffort | None num_retries_failed_request: StrictInt max_retries_failed_request: StrictInt max_retries_invalid_data: StrictInt timeout: StrictInt pricing_details: _LLMPricing | None is_fallback: StrictBool fallback_llm: _DocumentLLM | None output_language: LanguageRequirement temperature: StrictFloat | None top_p: StrictFloat | None seed: StrictInt | None tools: list[JSONDictField] | None tool_choice: str | JSONDictField | None parallel_tool_calls: bool | None tool_max_rounds: StrictInt auto_pricing: StrictBool auto_pricing_refresh: StrictBool class contextgem.public.llms.ChatSession(**data) Bases: "_ChatSession" Stateful chat session that preserves message history across turns. To be used as "chat_session=..." parameter for "DocumentLLM.chat(...)" or "DocumentLLM.chat_async(...)". Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. Parameters: **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, **B eforeValidator**(**func=~contextgem.internal.typings.validators ._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. property messages: list[_Message] Returns the list of messages in the session. Returns: The list of messages in the session. Return type: list[_Message] reset() Clears conversation history by removing all messages. Returns: None Return type: None to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. custom_data: JSONDictField # ==== api/data_models ==== Data models *********** Module defining public data validation models. class contextgem.public.data_models.LLMPricing(**data) Bases: "_LLMPricing" Represents the pricing details for an LLM. Defines the cost structure for processing input tokens and generating output tokens, with prices specified per million tokens. Variables: * **input_per_1m_tokens** (*StrictFloat*) -- The cost in currency units for processing 1M input tokens. * **output_per_1m_tokens** (*StrictFloat*) -- The cost in currency units for generating 1M output tokens. Parameters: * **input_per_1m_tokens** (*Annotated**[**float**, **Strict**(**strict=True**)**]*) * **output_per_1m_tokens** (*Annotated**[**float**, **Strict**(**strict=True**)**]*) Example: LLM pricing definition from contextgem import LLMPricing # Create a pricing model for an LLM (openai/o3-mini example) pricing = LLMPricing( input_per_1m_tokens=1.10, # $1.10 per million input tokens output_per_1m_tokens=4.40, # $4.40 per million output tokens ) # LLMPricing objects are immutable try: pricing.input_per_1m_tokens = 0.7 except ValueError as e: print(f"Error when trying to modify pricing: {e}") Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str input_per_1m_tokens: StrictFloat output_per_1m_tokens: StrictFloat class contextgem.public.data_models.RatingScale(*, start=0, end=10) Bases: "_RatingScale" Represents a rating scale with defined minimum and maximum values. Deprecated since version 0.10.0: RatingScale is deprecated and will be removed in v1.0.0. Use a tuple of (start, end) integers instead, e.g. (1, 5) instead of RatingScale(start=1, end=5). This class defines a numerical scale for rating concepts, with configurable start and end values that determine the valid range for ratings. Variables: * **start** (*StrictInt*) -- The minimum value of the rating scale (inclusive). Must be greater than or equal to 0. * **end** (*StrictInt*) -- The maximum value of the rating scale (inclusive). Must be greater than 0. Parameters: * **start** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) * **end** (*Annotated**[**int**, **Strict**(**strict=True**)**]*) Initialize RatingScale with deprecation warning. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str start: StrictInt end: StrictInt # ==== api/utils ==== Utility functions and classes ***************************** Module defining public utility functions and classes of the framework. contextgem.public.utils.image_to_base64(source) Converts an image to its Base64 encoded string representation. Helper function that can be used when constructing "Image" objects. Parameters: **source** (*str** | **Path** | **BinaryIO** | **bytes*) -- The image source - can be a file path (str or Path), file-like object (BytesIO, file handle, etc.), or raw bytes data. Returns: A Base64 encoded string representation of the image. Return type: str Raises: * **FileNotFoundError** -- If the image file path does not exist. * **OSError** -- If the image cannot be read. Example: >>> from pathlib import Path >>> import io >>> >>> # From file path >>> base64_str = image_to_base64("path/to/image.jpg") >>> >>> # From file handle >>> with open("image.png", "rb") as f: ... base64_str = image_to_base64(f) >>> >>> # From bytes data >>> with open("image.webp", "rb") as f: ... image_bytes = f.read() >>> base64_str = image_to_base64(image_bytes) >>> >>> # From BytesIO >>> buffer = io.BytesIO(image_bytes) >>> base64_str = image_to_base64(buffer) contextgem.public.utils.create_image(source) Creates an Image instance from various image sources. This function automatically determines the MIME type and converts the image to base64 format using Pillow functionality. It supports common image formats including JPEG, PNG, and WebP. Parameters: **source** (*str** | **Path** | **PILImage.Image** | **BinaryIO** | **bytes*) -- The image source - can be a file path (str or Path), PIL Image object, file-like object (BytesIO, file handle, etc.), or raw bytes data. Returns: An Image instance with the appropriate MIME type and base64 data. Return type: Image Raises: * **ValueError** -- If the image format is not supported or cannot be determined. * **FileNotFoundError** -- If the image file path does not exist. * **OSError** -- If the image cannot be opened or processed. Example: >>> from pathlib import Path >>> from PIL import Image as PILImage >>> import io >>> >>> # From file path >>> img = create_image("path/to/image.jpg") >>> >>> # From PIL Image object >>> pil_img = PILImage.open("path/to/image.png") >>> img = create_image(pil_img) >>> >>> # From file-like object >>> with open("image.jpg", "rb") as f: ... img = create_image(f) >>> >>> # From bytes data >>> with open("image.png", "rb") as f: ... image_bytes = f.read() >>> img = create_image(image_bytes) >>> >>> # From BytesIO >>> buffer = io.BytesIO(image_bytes) >>> img = create_image(buffer) contextgem.public.utils.reload_logger_settings() Reloads logger settings from environment variables. This function should be called when environment variables related to logging have been changed after the module was imported. It re- reads the environment variables and reconfigures the logger accordingly. Returns: None Example: Reload logger settings import os from contextgem import reload_logger_settings # Initial logger settings are loaded from environment variables at import time # Change logger level to WARNING os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "WARNING" print("Setting logger level to WARNING") reload_logger_settings() # Now the logger will only show WARNING level and above messages # Disable the logger completely os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "OFF" print("Disabling the logger") reload_logger_settings() # Now the logger is disabled and won't show any messages # You can re-enable the logger by setting it back to a valid level # os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "INFO" # reload_logger_settings() class contextgem.public.utils.JsonObjectClassStruct(*args, **kwargs) Bases: "_JsonObjectClassStruct" A base class that automatically converts class hierarchies to dictionary representations. This class enables the use of existing class hierarchies (such as dataclasses or Pydantic models) with nested type hints as a structure definition for JsonObjectConcept. When you need to use typed class hierarchies with JsonObjectConcept, inherit from this class in all parts of your class structure. Example: Using JsonObjectClassStruct for class hierarchies from dataclasses import dataclass from contextgem import JsonObjectClassStruct, JsonObjectConcept @dataclass class Address(JsonObjectClassStruct): street: str city: str country: str @dataclass class Contact(JsonObjectClassStruct): email: str phone: str address: Address @dataclass class Person(JsonObjectClassStruct): name: str age: int contact: Contact # Use the class structure with JsonObjectConcept # JsonObjectClassStruct enables automatic conversion of typed class hierarchies # into the dictionary structure required by JsonObjectConcept, preserving the # type information and nested relationships between classes. JsonObjectConcept(name="person", description="Person information", structure=Person) Replacement for "__new__" that blocks direct instantiation of the decorated class while allowing subclasses to instantiate normally. If invoked for the exact decorated class, an error is logged and "TypeError" is raised. For subclasses, the call is forwarded to the next "__new__" in the MRO, preserving base-class behavior (e.g., Pydantic's "BaseModel.__new__"). Parameters: * **inner_cls** -- The class being instantiated (decorated class or its subclass). * **args** -- Positional constructor arguments. * **kwargs** -- Keyword constructor arguments. Returns: A new instance when called for a subclass. Raises: **TypeError** -- When attempting to instantiate the decorated class directly. Return type: Any # ==== api/images ==== Images ****** Module for handling document images. This module provides the Image class, which represents visual content that can be attached to or fully represent a document. Images are stored in base64-encoded format with specified MIME types to ensure proper handling. class contextgem.public.images.Image(**data) Bases: "_Image" Represents an image with specified MIME type and base64-encoded data. An image is typically attached to a document, or fully represents a document. Util function "create_image()" from "contextgem.public.utils" can be used to create an Image instance from various sources: file paths, PIL Image objects, file-like objects, or raw bytes data. Variables: * **mime_type** (*Literal**[**"image/jpg"**, **"image/jpeg"**, **"image/png"**, **"image/webp"**]*) -- The MIME type of the image. This must be one of the predefined valid types ("image/jpg", "image/jpeg", "image/png", "image/webp"). * **base64_data** (*str*) -- The base64-encoded data of the image. The util function "image_to_base64()" from "contextgem.public.utils" can be used to encode images to base64. Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **mime_type** (*Literal**[**'image/jpg'**, **'image/jpeg'**, **'image/png'**, **'image/webp'**]*) * **base64_data** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) Note: * Attached to documents: An image must be attached to a document. A document can have multiple images. * Extraction types: Only document-level concept extraction is supported for images. Use LLM with role ""extractor_vision"", ""reasoner_vision"", ""extractor_multimodal"", or ""reasoner_multimodal"" to extract concepts from images. Example: Image definition from pathlib import Path from contextgem import Document, Image, create_image, image_to_base64 # Path is adapted for doc tests current_file = Path(__file__).resolve() root_path = current_file.parents[4] # Using the create_image utility function (recommended approach) image_path = root_path / "tests" / "images" / "invoices" / "invoice.jpg" jpg_image = create_image( image_path ) # Automatically detects MIME type and converts to base64 # Using pre-encoded base64 data directly png_image = Image( mime_type="image/png", base64_data="base64-string", # image as a base64 string ) # Using a different supported image format with create_image webp_image = create_image(root_path / "tests" / "images" / "invoices" / "invoice.webp") # Alternative: Manual approach using image_to_base64 (when you need specific control) manual_image = Image(mime_type="image/jpeg", base64_data=image_to_base64(image_path)) # Attaching an image to a document # Documents can contain both text and multiple images, or just images # Create a document with text content text_document = Document( raw_text="This is a document with an attached image that shows an invoice.", images=[jpg_image], ) # Create a document with only image content (no text) image_only_document = Document(images=[jpg_image]) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. mime_type: Literal['image/jpg', 'image/jpeg', 'image/png', 'image/webp'] base64_data: NonEmptyStr custom_data: JSONDictField # ==== api/paragraphs ==== Paragraphs ********** Module for handling document paragraphs. This module provides the Paragraph class, which represents a structured segment of text within a document. Paragraphs serve as containers for sentences and maintain the raw text content of the segment they represent. The module supports validation to ensure data integrity and provides mechanisms to prevent inconsistencies during document analysis by restricting certain attribute modifications after initial assignment. class contextgem.public.paragraphs.Paragraph(**data) Bases: "_Paragraph" Represents a paragraph of a document with its raw text content and constituent sentences. Paragraphs are immutable text segments that can contain multiple sentences. Once sentences are assigned to a paragraph, they cannot be changed to maintain data integrity during analysis. Variables: * **raw_text** (*str*) -- The complete text content of the paragraph. This value is frozen after initialization. * **sentences** (*list**[**Sentence**]*) -- The individual sentences contained within the paragraph. Defaults to an empty list. Cannot be reassigned once populated. Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **additional_context** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**] **| **None*) * **raw_text** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) * **sentences** (*list**[**_Sentence**]*) Note: Normally, you do not need to construct paragraphs manually, as they are populated automatically from document's "raw_text" attribute. Only use this constructor for advanced use cases, such as when you have a custom paragraph segmentation tool. Example: Paragraph definition from contextgem import Paragraph # Create a paragraph with raw text content contract_paragraph = Paragraph( raw_text=( "This agreement is effective as of January 1, 2025. " "All parties must comply with the terms outlined herein. " "Failure to adhere to these terms may result in termination of the agreement." ) ) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. raw_text: NonEmptyStr sentences: list[_Sentence] additional_context: NonEmptyStr | None custom_data: JSONDictField # ==== api/sentences ==== Sentences ********* Module for handling document sentences. This module provides the Sentence class, which represents a structured unit of text within a document paragraph. Sentences are the fundamental building blocks of text analysis, containing the raw text content of individual statements. The module supports validation to ensure data integrity and integrates with the paragraph structure to maintain the hierarchical organization of document content. class contextgem.public.sentences.Sentence(**data) Bases: "_Sentence" Represents a sentence within a document paragraph. Sentences are immutable text units that serve as the fundamental building blocks for document analysis. The raw text content is preserved and cannot be modified after initialization to maintain data integrity. Variables: **raw_text** (*str*) -- The complete text content of the sentence. This value is frozen after initialization. Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **additional_context** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**] **| **None*) * **raw_text** (*Annotated**[**str**, **Strict**(**strict=True**)**, **StringConstraints**(**strip_whitespace=True**, **to_upper=None**, **to_lower=None**, **strict=None**, **min_length=1**, **max_length=None**, **pattern=None**)**]*) Note: Normally, you do not need to construct sentences manually, as they are populated automatically from document's "raw_text" or "paragraphs" attributes. Only use this constructor for advanced use cases, such as when you have a custom paragraph/sentence segmentation tool. Example: Sentence definition from contextgem import Sentence # Create a sentence with raw text content sentence = Sentence(raw_text="This is a simple sentence.") # Sentences are immutable - their content cannot be changed after creation try: sentence.raw_text = "Attempting to modify the sentence." except ValueError as e: print(f"Error when trying to modify sentence: {e}") Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. raw_text: NonEmptyStr additional_context: NonEmptyStr | None custom_data: JSONDictField # ==== api/pipelines ==== Pipelines ********* Module for handling document processing pipelines. This module provides the ExtractionPipeline class, which represents a reusable collection of pre-defined aspects and concepts that can be assigned to documents. Pipelines enable standardized document analysis by packaging common extraction patterns into reusable units. Pipelines serve as templates for document processing, allowing consistent application of the same analysis approach across multiple documents. They encapsulate both the structural organization (aspects) and the specific information to extract (concepts) in a single, assignable object. class contextgem.public.pipelines.ExtractionPipeline(**data) Bases: "_ExtractionPipeline" Represents a reusable collection of predefined aspects and concepts for document analysis. Extraction pipelines serve as templates that can be assigned to multiple documents, ensuring consistent application of the same analysis criteria. They package common extraction patterns into reusable units, allowing for standardized document processing. Variables: * **aspects** (*list**[**_Aspect**]*) -- A list of aspects to extract from documents. Aspects represent structural categories of information. Defaults to an empty list. * **concepts** (*list**[**_Concept**]*) -- A list of concepts to identify within documents. Concepts represent specific information elements to extract. Defaults to an empty list. Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **aspects** (*Annotated**[**Sequence**[**_Aspect**]**, **Befo reValidator**(**func=~contextgem.internal.typings.validators. _validate_sequence_is_list**, **json_schema_input_type=PydanticUndefined**)**]*) * **concepts** (*Annotated**[**Sequence**[**_Concept**]**, **Be foreValidator**(**func=~contextgem.internal.typings.validator s._validate_sequence_is_list**, **json_schema_input_type=PydanticUndefined**)**]*) Note: A pipeline is a reusable configuration of extraction steps. You can use the same pipeline to extract data from multiple documents. Example: Extraction pipeline definition from contextgem import ( Aspect, BooleanConcept, DateConcept, Document, ExtractionPipeline, StringConcept, ) # Create a pipeline for NDA (Non-Disclosure Agreement) review nda_pipeline = ExtractionPipeline( aspects=[ Aspect( name="Confidential information", description="Clauses defining the confidential information", ), Aspect( name="Exclusions", description="Clauses defining exclusions from confidential information", ), Aspect( name="Obligations", description="Clauses defining confidentiality obligations", ), Aspect( name="Liability", description="Clauses defining liability for breach of the agreement", ), # ... Add more aspects as needed ], concepts=[ StringConcept( name="Anomaly", description="Anomaly in the contract, e.g. out-of-context or nonsensical clauses", llm_role="reasoner_text", add_references=True, # Add references to the source text reference_depth="sentences", # Reference to the sentence level add_justifications=True, # Add justifications for the anomaly justification_depth="balanced", # Justification at the sentence level justification_max_sents=5, # Maximum number of sentences in the justification ), BooleanConcept( name="Is mutual", description="Whether the NDA is mutual (bidirectional) or one-way", singular_occurrence=True, llm_role="reasoner_text", # Use the reasoner role for this concept ), DateConcept( name="Effective date", description="The date when the NDA agreement becomes effective", singular_occurrence=True, ), StringConcept( name="Term", description="The term of the NDA", ), StringConcept( name="Governing law", description="The governing law of the agreement", singular_occurrence=True, ), # ... Add more concepts as needed ], ) # Assign the pipeline to the NDA document nda_document = Document(raw_text="[NDA text]") nda_document.assign_pipeline(nda_pipeline) # Now the document is ready for processing with the NDA review pipeline! # The document can be processed to extract the defined aspects and concepts # Extract all aspects and concepts from the NDA using an LLM group # with LLMs with roles "extractor_text" and "reasoner_text". # llm_group.extract_all(nda_document) Create a new model by parsing and validating input data from keyword arguments. Raises [*ValidationError*][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model. *self* is explicitly positional-only to allow *self* as a field name. add_aspects(aspects) Adds aspects to the existing aspects list of an instance and returns the updated instance. This method ensures that the provided aspects are deeply copied to avoid any unintended state modification of the original reusable aspects. Parameters: **aspects** (*list**[**_Aspect**]*) -- A list of aspects to be added. Each aspect is deeply copied to ensure the original list remains unaltered. Returns: Updated instance containing the newly added aspects. Return type: Self add_concepts(concepts) Adds a list of new concepts to the existing *concepts* attribute of the instance. This method ensures that the provided list of concepts is deep-copied to prevent unintended side effects from modifying the input list outside of this method. Parameters: **concepts** (*list**[**_Concept**]*) -- A list of concepts to be added. It will be deep-copied before being added to the instance's *concepts* attribute. Returns: Returns the instance itself after the modification. Return type: Self clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. get_aspect_by_name(name) Finds and returns an aspect with the specified name from the list of available aspects, if the instance has *aspects* attribute. Parameters: **name** (*str*) -- The name of the aspect to find. Returns: The aspect with the specified name. Return type: _Aspect Raises: **ValueError** -- If no aspect with the specified name is found. get_aspects_by_names(names) Retrieve a list of _Aspect objects corresponding to the provided list of names. Parameters: **names** ("list"["str"]) -- List of aspect names to retrieve. The names must be provided as a list of strings. Returns: A list of _Aspect objects corresponding to provided names. Return type: list[_Aspect] get_concept_by_name(name) Retrieves a concept from the list of concepts based on the provided name, if the instance has *concepts* attribute. Parameters: **name** (*str*) -- The name of the concept to search for. Returns: The *_Concept* object with the specified name. Return type: _Concept Raises: **ValueError** -- If no concept with the specified name is found. get_concepts_by_names(names) Retrieve a list of _Concept objects corresponding to the provided list of names. Parameters: **names** ("list"["str"]) -- List of concept names to retrieve. The names must be provided as a list of strings. Returns: A list of _Concept objects corresponding to provided names. Return type: list[_Concept] property llm_roles: set[str] A set of LLM roles associated with the object's aspects and concepts. Returns: A set containing unique LLM roles gathered from aspects and concepts. Return type: set[str] remove_all_aspects() Removes all aspects from the instance and returns the updated instance. This method clears the *aspects* attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining. Return type: "typing.Self" Returns: The updated instance with all aspects removed remove_all_concepts() Removes all concepts from the instance and returns the updated instance. This method clears the *concepts* attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining. Return type: "typing.Self" Returns: The updated instance with all concepts removed remove_all_instances() Removes all assigned instances from the object and resets them as empty lists. Returns the modified instance. Returns: The modified object with all assigned instances removed. Return type: Self remove_aspect_by_name(name) Removes an aspect from the assigned aspects by its name. Parameters: **name** (*str*) -- The name of the aspect to be removed Returns: Updated instance with the aspect removed. Return type: Self remove_aspects_by_names(names) Removes multiple aspects from an object based on the provided list of names. Parameters: **names** (*list**[**str**]*) -- A list of names identifying the aspects to be removed. Returns: The updated object after the specified aspects have been removed. Return type: Self remove_concept_by_name(name) Removes a concept from the assigned concepts by its name. Parameters: **name** (*str*) -- The name of the concept to be removed Returns: Updated instance with the concept removed. Return type: Self remove_concepts_by_names(names) Removes concepts from the object by their names. Parameters: **names** (*list**[**str**]*) -- A list of concept names to be removed. Returns: Returns the updated instance after removing the specified concepts. Return type: Self to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. aspects: Annotated[Sequence[_Aspect], BeforeValidator(_validate_sequence_is_list)] concepts: Annotated[Sequence[_Concept], BeforeValidator(_validate_sequence_is_list)] custom_data: JSONDictField class contextgem.public.pipelines.DocumentPipeline(**data) Bases: "_DocumentPipeline" Deprecated wrapper for ExtractionPipeline. Deprecated since version 0.14.1: DocumentPipeline is deprecated and will be removed in v1.0.0. Use ExtractionPipeline instead. This class was renamed to ExtractionPipeline to better reflect its purpose and scope: * **Clearer semantics**: "ExtractionPipeline" explicitly describes what the pipeline does * **Consistency**: Aligns with the framework's naming conventions for extraction-focused components **Migration**: Simply replace "DocumentPipeline" with "ExtractionPipeline" in your imports. All functionality remains identical. Initialize DocumentPipeline with deprecation warning. Parameters: * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, * *BeforeValidator**(**func=~contextgem.internal.typings.valida tors._validate_is_json_dict**, **json_schema_input_type=PydanticUndefined**)**]*) * **aspects** (*Annotated**[**Sequence**[**_Aspect**]**, **Befo reValidator**(**func=~contextgem.internal.typings.validators. _validate_sequence_is_list**, **json_schema_input_type=PydanticUndefined**)**]*) * **concepts** (*Annotated**[**Sequence**[**_Concept**]**, **Be foreValidator**(**func=~contextgem.internal.typings.validator s._validate_sequence_is_list**, **json_schema_input_type=PydanticUndefined**)**]*) add_aspects(aspects) Adds aspects to the existing aspects list of an instance and returns the updated instance. This method ensures that the provided aspects are deeply copied to avoid any unintended state modification of the original reusable aspects. Parameters: **aspects** (*list**[**_Aspect**]*) -- A list of aspects to be added. Each aspect is deeply copied to ensure the original list remains unaltered. Returns: Updated instance containing the newly added aspects. Return type: Self add_concepts(concepts) Adds a list of new concepts to the existing *concepts* attribute of the instance. This method ensures that the provided list of concepts is deep-copied to prevent unintended side effects from modifying the input list outside of this method. Parameters: **concepts** (*list**[**_Concept**]*) -- A list of concepts to be added. It will be deep-copied before being added to the instance's *concepts* attribute. Returns: Returns the instance itself after the modification. Return type: Self clone() Creates and returns a deep copy of the current instance. Return type: "typing.Self" Returns: A deep copy of the current instance. classmethod from_dict(obj_dict) Reconstructs an instance of the class from a dictionary representation. This method deserializes a dictionary containing the object's attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component. Parameters: **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary containing the serialized object data. Returns: A new instance of the class with restored attributes. Return type: Self classmethod from_disk(file_path) Loads an instance of the class from a JSON file stored on disk. This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the *from_json* method. Parameters: **file_path** (*str** | **Path*) -- Path to the JSON file to load (must end with '.json'). Can be a string or a Path object. Returns: An instance of the class populated with the data from the file. Return type: Self Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If deserialization fails. classmethod from_json(json_string) Creates an instance of the class from a JSON string representation. This method deserializes the provided JSON string into a dictionary and uses the *from_dict* method to construct the class instance. It validates that the class name in the serialized data matches the current class. Parameters: **json_string** (*str*) -- JSON string containing the serialized object data. Returns: A new instance of the class with restored state. Return type: Self Raises: **TypeError** -- If the class name in the serialized data doesn't match. get_aspect_by_name(name) Finds and returns an aspect with the specified name from the list of available aspects, if the instance has *aspects* attribute. Parameters: **name** (*str*) -- The name of the aspect to find. Returns: The aspect with the specified name. Return type: _Aspect Raises: **ValueError** -- If no aspect with the specified name is found. get_aspects_by_names(names) Retrieve a list of _Aspect objects corresponding to the provided list of names. Parameters: **names** ("list"["str"]) -- List of aspect names to retrieve. The names must be provided as a list of strings. Returns: A list of _Aspect objects corresponding to provided names. Return type: list[_Aspect] get_concept_by_name(name) Retrieves a concept from the list of concepts based on the provided name, if the instance has *concepts* attribute. Parameters: **name** (*str*) -- The name of the concept to search for. Returns: The *_Concept* object with the specified name. Return type: _Concept Raises: **ValueError** -- If no concept with the specified name is found. get_concepts_by_names(names) Retrieve a list of _Concept objects corresponding to the provided list of names. Parameters: **names** ("list"["str"]) -- List of concept names to retrieve. The names must be provided as a list of strings. Returns: A list of _Concept objects corresponding to provided names. Return type: list[_Concept] property llm_roles: set[str] A set of LLM roles associated with the object's aspects and concepts. Returns: A set containing unique LLM roles gathered from aspects and concepts. Return type: set[str] remove_all_aspects() Removes all aspects from the instance and returns the updated instance. This method clears the *aspects* attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining. Return type: "typing.Self" Returns: The updated instance with all aspects removed remove_all_concepts() Removes all concepts from the instance and returns the updated instance. This method clears the *concepts* attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining. Return type: "typing.Self" Returns: The updated instance with all concepts removed remove_all_instances() Removes all assigned instances from the object and resets them as empty lists. Returns the modified instance. Returns: The modified object with all assigned instances removed. Return type: Self remove_aspect_by_name(name) Removes an aspect from the assigned aspects by its name. Parameters: **name** (*str*) -- The name of the aspect to be removed Returns: Updated instance with the aspect removed. Return type: Self remove_aspects_by_names(names) Removes multiple aspects from an object based on the provided list of names. Parameters: **names** (*list**[**str**]*) -- A list of names identifying the aspects to be removed. Returns: The updated object after the specified aspects have been removed. Return type: Self remove_concept_by_name(name) Removes a concept from the assigned concepts by its name. Parameters: **name** (*str*) -- The name of the concept to be removed Returns: Updated instance with the concept removed. Return type: Self remove_concepts_by_names(names) Removes concepts from the object by their names. Parameters: **names** (*list**[**str**]*) -- A list of concept names to be removed. Returns: Returns the updated instance after removing the specified concepts. Return type: Self to_dict() Transforms the current object into a dictionary representation. Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed. Returns: A dictionary representation of the current object with all necessary data for serialization Return type: dict[str, Any] to_disk(file_path) Saves the serialized instance to a JSON file at the specified path. This method converts the instance to a dictionary representation using *to_dict()*, then writes it to disk as a formatted JSON file with UTF-8 encoding. Parameters: **file_path** (*str** | **Path*) -- Path where the JSON file should be saved (must end with '.json'). Can be a string or a Path object. Return type: "None" Returns: None Raises: * **ValueError** -- If the file path doesn't end with '.json'. * **RuntimeError** -- If there's an error during the file writing process. to_json() Converts the object to its JSON string representation. Serializes the object into a JSON-formatted string using the dictionary representation provided by the *to_dict()* method. Returns: A JSON string representation of the object. Return type: str property unique_id: str Returns the ULID of the instance. aspects: Annotated[Sequence[_Aspect], BeforeValidator(_validate_sequence_is_list)] concepts: Annotated[Sequence[_Concept], BeforeValidator(_validate_sequence_is_list)] custom_data: JSONDictField # ==== api/decorators ==== Decorators ********** Public decorators for extending or integrating with the framework. This module contains decorators that are part of the public API and intended for end users to apply to their own functions or classes. contextgem.public.decorators.register_tool(func, /) Registers a function as a tool handler for LLM chat with tools. Validates that the function has an inspectable signature and accepts keyword arguments (no positional-only parameters). Marks the function so the runtime can recognize and call it by name. Parameters: **func** (*ToolHandler*) -- A callable to be used as a tool handler. Returns: The same function, marked as a registered tool. Return type: ToolHandler Raises: * **TypeError** -- If the provided object is not callable. * **ValueError** -- If the signature cannot be inspected or has positional-only parameters, or if the function name is empty.