ContextGem - Effortless LLM extraction from documents
====================================================================================================

Copyright (c) 2025 Shcherbak AI AS
All rights reserved
Developed by Sergii Shcherbak

This software is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# ==== Documentation Content ====


# ==== motivation ====

Why ContextGem?
***************

ContextGem is an LLM framework designed to strike the right balance
between ease of use, customizability, and accuracy for structured data
and insights extraction from documents.

ContextGem offers the **easiest and fastest way** to build LLM
extraction workflows for document analysis through powerful
abstractions of most time consuming parts.


⏱️ Development Overhead of Other Frameworks
===========================================

Most popular LLM frameworks for extracting structured data from
documents require extensive boilerplate code to extract even basic
information. As a developer using these frameworks, you're typically
expected to:

📝 Prompt Engineering

* Write custom prompts from scratch for each extraction scenario

* Maintain different prompt templates for different extraction
  workflows

* Adapt prompts manually when extraction requirements change

🔧 Technical Implementation

* Define your own data models and implement validation logic

* Implement complex chaining for multi-LLM workflows

* Implement nested context extraction logic (*e.g. document > sections
  > paragraphs > entities*)

* Configure text segmentation logic for correct reference mapping

* Configure concurrent I/O processing logic to speed up complex
  extraction workflows

**Result:** All these limitations significantly increase development
time and complexity.


💡 The ContextGem Solution
==========================

ContextGem addresses these challenges by providing a flexible,
intuitive framework that extracts structured data and insights from
documents with minimal effort. Complex, most time-consuming parts are
handled with **powerful abstractions**, eliminating boilerplate code
and reducing development overhead.

With ContextGem, you benefit from a "batteries included" approach,
coupled with simple, intuitive syntax.


ContextGem and Other Open-Source LLM Frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

+-----+-----------------------------------------------+------------+----------------------+
|     | Key built-in abstractions                     | **Context  | Other frameworks*    |
|     |                                               | Gem**      |                      |
|=====|===============================================|============|======================|
| 💎  | **Automated dynamic prompts**  Automatically  | 🟢         | ◯                    |
|     | constructs comprehensive prompts for your     |            |                      |
|     | specific extraction needs.                    |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Automated data modelling and validators**   | 🟢         | ◯                    |
|     | Automatically creates data models and         |            |                      |
|     | validation logic.                             |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Precise granular reference mapping          | 🟢         | ◯                    |
|     | (paragraphs & sentences)**  Automatically     |            |                      |
|     | maps extracted data to the relevant parts of  |            |                      |
|     | the document, which will always match in the  |            |                      |
|     | source document, with customizable            |            |                      |
|     | granularity.                                  |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Justifications (reasoning backing the       | 🟢         | ◯                    |
|     | extraction)**  Automatically provides         |            |                      |
|     | justifications for each extraction, with      |            |                      |
|     | customizable granularity.                     |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Neural segmentation (SaT)**  Automatically  | 🟢         | ◯                    |
|     | segments the document into paragraphs and     |            |                      |
|     | sentences using state-of-the-art SaT models,  |            |                      |
|     | compatible with many languages.               |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Multilingual support (I/O without           | 🟢         | ◯                    |
|     | prompting)**  Supports multiple languages in  |            |                      |
|     | input and output without additional           |            |                      |
|     | prompting.                                    |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Single, unified extraction pipeline         | 🟢         | 🟡                   |
|     | (declarative, reusable, fully serializable)** |            |                      |
|     | Allows to define a complete extraction        |            |                      |
|     | workflow in a single, unified, reusable       |            |                      |
|     | pipeline, using simple declarative syntax.    |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Grouped LLMs with role-specific tasks**     | 🟢         | 🟡                   |
|     | Allows to easily group LLMs with different    |            |                      |
|     | roles to process role- specific tasks in the  |            |                      |
|     | pipeline.                                     |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Nested context extraction**  Automatically  | 🟢         | 🟡                   |
|     | manages nested context based on the pipeline  |            |                      |
|     | definition (e.g. document > aspects > sub-    |            |                      |
|     | aspects > concepts).                          |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Unified, fully serializable results storage | 🟢         | 🟡                   |
|     | model (document)**  All extraction results    |            |                      |
|     | are stored on the document object, including  |            |                      |
|     | aspects, sub-aspects, and concepts. This      |            |                      |
|     | object is fully serializable, and all the     |            |                      |
|     | extraction results can be restored, with just |            |                      |
|     | one line of code.                             |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Extraction task calibration with examples** | 🟢         | 🟡                   |
|     | Allows to easily define and attach output     |            |                      |
|     | examples that guide the LLM's extraction      |            |                      |
|     | behavior, without manually modifying prompts. |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Built-in concurrent I/O processing**        | 🟢         | 🟡                   |
|     | Automatically manages concurrent I/O          |            |                      |
|     | processing to speed up complex extraction     |            |                      |
|     | workflows, with a simple switch               |            |                      |
|     | ("use_concurrency=True").                     |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Automated usage & costs tracking**          | 🟢         | 🟡                   |
|     | Automatically tracks usage (calls, tokens,    |            |                      |
|     | costs) of all LLM calls.                      |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Fallback and retry logic**  Built-in retry  | 🟢         | 🟢                   |
|     | logic and easily attachable fallback LLMs.    |            |                      |
+-----+-----------------------------------------------+------------+----------------------+
| 💎  | **Multiple LLM providers**  Compatible with a | 🟢         | 🟢                   |
|     | wide range of commercial and locally hosted   |            |                      |
|     | LLMs.                                         |            |                      |
+-----+-----------------------------------------------+------------+----------------------+

   🟢 - fully supported - no additional setup required
   🟡 - partially supported - requires additional setup
   ◯ - not supported - requires custom logic

   * See ContextGem and other frameworks for specific implementation
   examples comparing ContextGem with other popular open-source LLM
   frameworks. (Comparison as of 24 March 2025.)


🎯 Focused Approach
===================

ContextGem is intentionally optimized for **in-depth single-document
analysis** to deliver maximum extraction accuracy and precision. While
this focused approach enables superior results for individual
documents, ContextGem currently does not support cross-document
querying or corpus-wide information retrieval. For these use cases,
modern RAG frameworks (e.g. LlamaIndex) remain more appropriate.


# ==== vs_other_frameworks ====

ContextGem and other frameworks
*******************************

Due to ContextGem's powerful abstractions, it is the **easiest and
fastest way** to build LLM extraction workflows for document analysis.


✏️ Basic Example
================

Below is a basic example of an extraction workflow - *extraction of
anomalies from a document* - implemented side-by-side in ContextGem
and other frameworks. (All implementations are self-contained.
Comparison as of 24 March 2025.)

Even implementing this basic extraction workflow requires
significantly more effort in other frameworks:

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 📝 **Prompt engineering**: Crafting comprehensive prompts that guide
  the LLM effectively

* 🔄 **Output parsing logic**: Setting up parsers to handle the LLM's
  response

* 📄 **Reference mapping**: Writing custom logic for mapping
  references in the source document

In contrast, ContextGem handles all these complexities automatically.
Users simply describe what to extract in natural language, provide
basic configuration parameters, and the framework takes care of the
rest.

-[ **ContextGem** ]-

⚡ Fastest way

ContextGem is the fastest and easiest way to implement an LLM
extraction workflow. All the boilerplate code is handled behind the
scenes.

**Major time savers:**

* ⌨️ **Simple syntax**: ContextGem uses a simple, intuitive API that
  requires minimal code

* 📝 **Automatic prompt engineering**: ContextGem automatically
  constructs a prompt tailored to the extraction task

* 🔄 **Automatic model definition**: ContextGem automatically defines
  the Pydantic model for structured output

* 🧩 **Automatic output parsing**: ContextGem automatically parses the
  LLM's response

* 🔍 **Automatic reference tracking**: Precise references are
  automatically extracted and mapped to the original document

* 📏 **Flexible reference granularity**: References can be tracked at
  different levels (paragraphs, sentences)

Anomaly extraction example (ContextGem)

   # Quick Start Example - Extracting anomalies from a document, with source references and justifications

   import os

   from contextgem import Document, DocumentLLM, StringConcept


   # Sample document text (shortened for brevity)
   doc = Document(
       raw_text=(
           "Consultancy Agreement\n"
           "This agreement between Company A (Supplier) and Company B (Customer)...\n"
           "The term of the agreement is 1 year from the Effective Date...\n"
           "The Supplier shall provide consultancy services as described in Annex 2...\n"
           "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
           "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # 💎 anomaly
           "Time-traveling dinosaurs will review all deliverables before acceptance.\n"  # 💎 another anomaly
           "This agreement is governed by the laws of Norway...\n"
       ),
   )

   # Attach a document-level concept
   doc.concepts = [
       StringConcept(
           name="Anomalies",  # in longer contexts, this concept is hard to capture with RAG
           description="Anomalies in the document",
           add_references=True,
           reference_depth="sentences",
           add_justifications=True,
           justification_depth="brief",
           # see the docs for more configuration options
       )
       # add more concepts to the document, if needed
       # see the docs for available concepts: StringConcept, JsonObjectConcept, etc.
   ]
   # Or use `doc.add_concepts([...])`

   # Define an LLM for extracting information from the document
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or another provider/LLM
       api_key=os.environ.get(
           "CONTEXTGEM_OPENAI_API_KEY"
       ),  # your API key for the LLM provider
       # see the docs for more configuration options
   )

   # Extract information from the document
   doc = llm.extract_all(doc)  # or use async version `await llm.extract_all_async(doc)`

   # Access extracted information in the document object
   anomalies_concept = doc.concepts[0]
   # or `doc.get_concept_by_name("Anomalies")`
   for item in anomalies_concept.extracted_items:
       print("Anomaly:")
       print(f"  {item.value}")
       print("Justification:")
       print(f"  {item.justification}")
       print("Reference paragraphs:")
       for p in item.reference_paragraphs:
           print(f"  - {p.raw_text}")
       print("Reference sentences:")
       for s in item.reference_sentences:
           print(f"  - {s.raw_text}")
       print()

-[ LangChain ]-

LangChain is a popular and versatile framework for building LLM
applications through composable components. It offers excellent
flexibility and a rich ecosystem of integrations. While powerful,
feature-rich, and widely adopted in the industry, it requires more
manual configuration and setup work for structured data extraction
tasks compared to ContextGem's streamlined approach.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
  response

* 🔍 **Manual reference mapping**: Writing custom logic for mapping
  references

Anomaly extraction example (LangChain)

   # LangChain implementation for extracting anomalies from a document, with source references and justifications

   import os
   from textwrap import dedent

   from langchain.output_parsers import PydanticOutputParser
   from langchain.prompts import PromptTemplate
   from langchain_core.runnables import RunnableLambda, RunnablePassthrough
   from langchain_openai import ChatOpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class Anomaly(BaseModel):
       """An anomaly found in the document."""

       text: str = Field(description="The anomalous text found in the document")
       justification: str = Field(
           description="Brief justification for why this is an anomaly"
       )
       reference: str = Field(
           description="The sentence containing the anomaly"
       )  # LLM reciting a reference is error-prone and unreliable


   class AnomaliesList(BaseModel):
       """List of anomalies found in the document."""

       anomalies: list[Anomaly] = Field(
           description="List of anomalies found in the document"
       )


   def extract_anomalies_with_langchain(
       document_text: str, api_key: str | None = None
   ) -> list[Anomaly]:
       """
       Extract anomalies from a document using LangChain.

       Args:
           document_text: The text content of the document
           api_key: OpenAI API key (defaults to environment variable)

       Returns:
           List of extracted anomalies with justifications and references
       """
       openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
       llm = ChatOpenAI(model="gpt-4o-mini", openai_api_key=openai_api_key, temperature=0)

       # Create a parser for structured output
       parser = PydanticOutputParser(pydantic_object=AnomaliesList)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       template = dedent(
           """
       You are an expert document analyzer. Your task is to identify any anomalies in the document.
       Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent
       with the rest of the document's context and purpose.
       
       Document:
       {document_text}
       
       Identify all anomalies in the document. For each anomaly, provide:
       1. The anomalous text
       2. A brief justification explaining why it's an anomaly
       3. The complete sentence containing the anomaly for reference
       
       {format_instructions}
       """
       )

       prompt = PromptTemplate(
           template=template,
           input_variables=["document_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       # Create a runnable chain
       chain = (
           {"document_text": lambda x: x}
           | RunnablePassthrough.assign()
           | prompt
           | llm
           | RunnableLambda(lambda x: parser.parse(x.content))
       )

       # Run the chain and extract anomalies
       parsed_output = chain.invoke(document_text)

       return parsed_output.anomalies


   # Example usage
   # Sample document text (shortened for brevity)
   document_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # out-of-context / anomaly
       "This agreement is governed by the laws of Norway...\n"
   )

   # Extract anomalies
   anomalies = extract_anomalies_with_langchain(document_text)

   # Print results
   for anomaly in anomalies:
       print(f"Anomaly: {anomaly}")

-[ LlamaIndex ]-

LlamaIndex is a powerful and versatile framework for building LLM
applications with data, particularly excelling at RAG workflows and
document retrieval. It offers a comprehensive set of tools for data
indexing and querying. While highly effective for its intended use
cases, for structured data extraction tasks (non-RAG setup), it
requires more manual configuration and setup work compared to
ContextGem's streamlined approach.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
  response

* 🔍 **Manual reference mapping**: Writing custom logic for mapping
  references

Anomaly extraction example (LlamaIndex)

   # LlamaIndex implementation for extracting anomalies from a document, with source references and justifications

   import os
   from textwrap import dedent

   from llama_index.core.output_parsers import PydanticOutputParser
   from llama_index.core.program import LLMTextCompletionProgram
   from llama_index.llms.openai import OpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class Anomaly(BaseModel):
       """An anomaly found in the document."""

       text: str = Field(description="The anomalous text found in the document")
       justification: str = Field(
           description="Brief justification for why this is an anomaly"
       )
       reference: str = Field(
           description="The sentence containing the anomaly"
       )  # LLM reciting a reference is error-prone and unreliable


   class AnomaliesList(BaseModel):
       """List of anomalies found in the document."""

       anomalies: list[Anomaly] = Field(
           description="List of anomalies found in the document"
       )


   def extract_anomalies_with_llama_index(
       document_text: str, api_key: str | None = None
   ) -> list[Anomaly]:
       """
       Extract anomalies from a document using LlamaIndex.

       Args:
           document_text: The text content of the document
           api_key: OpenAI API key (defaults to environment variable)

       Returns:
           List of extracted anomalies with justifications and references
       """
       openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
       llm = OpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Your task is to identify any anomalies in the document.
       Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent
       with the rest of the document's context and purpose.
       
       Document:
       {document_text}
       
       Identify all anomalies in the document. For each anomaly, provide:
       1. The anomalous text
       2. A brief justification explaining why it's an anomaly
       3. The complete sentence containing the anomaly for reference
       """
       )

       # Use PydanticOutputParser to directly parse the LLM output into our structured format
       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=AnomaliesList),
           prompt_template_str=prompt_template,
           llm=llm,
           verbose=True,
       )

       # Execute the program
       try:
           result = program(document_text=document_text)
           return result.anomalies
       except Exception as e:
           print(f"Error parsing LLM response: {e}")
           return []


   # Example usage
   # Sample document text (shortened for brevity)
   document_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # out-of-context / anomaly
       "This agreement is governed by the laws of Norway...\n"
   )

   # Extract anomalies
   anomalies = extract_anomalies_with_llama_index(document_text)

   # Print results
   for anomaly in anomalies:
       print(f"Anomaly: {anomaly}")

-[ LlamaIndex (RAG) ]-

LlamaIndex with RAG setup is a powerful and sophisticated framework
for document retrieval and analysis, offering exceptional capabilities
for knowledge-intensive applications. Its comprehensive architecture
excels at handling complex document interactions and information
retrieval tasks across large document collections. While it provides
robust and versatile capabilities for building advanced document-based
applications, it does require more manual configuration and
specialized setup for structured extraction tasks compared to
ContextGem's streamlined and intuitive approach.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
  response

* 🔍 **Complex reference mapping**: Getting precise references
  correctly requires additional config, such as setting up a sentence
  splitter,  CitationQueryEngine, adjusting chunk sizes, etc.

Anomaly extraction example (LlamaIndex RAG)

   # LlamaIndex (RAG) implementation for extracting anomalies from a document, with source references and justifications

   import os
   from textwrap import dedent
   from typing import Any

   from llama_index.core import Document, Settings, VectorStoreIndex
   from llama_index.core.base.response.schema import RESPONSE_TYPE
   from llama_index.core.node_parser import SentenceSplitter
   from llama_index.core.output_parsers import PydanticOutputParser
   from llama_index.core.query_engine import CitationQueryEngine
   from llama_index.core.response_synthesizers.base import BaseSynthesizer
   from llama_index.core.retrievers import VectorIndexRetriever
   from llama_index.llms.openai import OpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class Anomaly(BaseModel):
       text: str = Field(description="The anomalous text found in the document")
       justification: str = Field(
           description="Brief justification for why this is an anomaly"
       )
       # This field will hold the citation info (e.g., node references)
       source_id: str | None = Field(
           description="Automatically added source reference", default=None
       )


   class AnomaliesList(BaseModel):
       anomalies: list[Anomaly] = Field(
           description="List of anomalies found in the document"
       )


   # Custom synthesizer that instructs the LLM to extract anomalies in JSON format.
   class AnomalyExtractorSynthesizer(BaseSynthesizer):
       def __init__(self, llm=None, nodes=None):
           super().__init__()
           self._llm = llm or Settings.llm
           # Nodes are still provided in case additional context is needed.
           self._nodes = nodes or []

       def _get_prompts(self) -> dict[str, Any]:
           return {}

       def _update_prompts(self, prompts: dict[str, Any]):
           return

       async def aget_response(
           self, query_str: str, text_chunks: list[str], **kwargs: Any
       ) -> RESPONSE_TYPE:
           return self.get_response(query_str, text_chunks, **kwargs)

       def get_response(
           self, query_str: str, text_chunks: list[str], **kwargs: Any
       ) -> str:
           all_text = "\n".join(text_chunks)

           # Prompt must be manually drafted
           # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
           prompt_str = dedent(
               """
           You are an expert document analyzer. Your task is to identify anomalies in the document.
           Anomalies are statements or phrases that seem out of place or inconsistent with the document's context.

           Document:
           {all_text}

           For each anomaly, provide:
           1. The anomalous text (only the specific phrase).
           2. A brief justification for why it is an anomaly.

           Format your answer as a JSON object:
           {{
               "anomalies": [
                   {{
                       "text": "anomalous text",
                       "justification": "reason for anomaly",
                   }}
               ]
           }}
           """
           )
           print(prompt_str)
           output_parser = PydanticOutputParser(output_cls=AnomaliesList)
           response = self._llm.complete(prompt_str.format(all_text=all_text))

           try:
               parsed_response = output_parser.parse(response.text)
               self._last_anomalies = parsed_response
               return parsed_response.model_dump_json()
           except Exception as e:
               print(f"Error parsing LLM response: {e}")
               print(f"Raw response: {response.text}")
               return "{}"


   def extract_anomalies_with_citations(
       document_text: str, api_key: str | None = None
   ) -> list[Anomaly]:
       """
       Extract anomalies from a document using LlamaIndex with citation support.

       Args:
           document_text: The content of the document.
           api_key: OpenAI API key (if not provided, read from environment variable).

       Returns:
           List of extracted anomalies with automatically added source references.
       """
       openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
       llm = OpenAI(model="gpt-4o-mini", api_key=openai_api_key, temperature=0)
       Settings.llm = llm

       # Create a Document and split it into nodes
       doc = Document(text=document_text)
       splitter = SentenceSplitter(
           paragraph_separator="\n",
           chunk_size=100,
           chunk_overlap=0,
       )
       nodes = splitter.get_nodes_from_documents([doc])
       print(f"Document split into {len(nodes)} nodes")

       # Build a vector index and retriever using all nodes.
       index = VectorStoreIndex(nodes)
       retriever = VectorIndexRetriever(index=index, similarity_top_k=len(nodes))

       # Create a custom synthesizer.
       synthesizer = AnomalyExtractorSynthesizer(llm=llm, nodes=nodes)

       # Initialize CitationQueryEngine by passing the expected components.
       citation_query_engine = CitationQueryEngine(
           retriever=retriever,
           llm=llm,
           response_synthesizer=synthesizer,
           citation_chunk_size=100,  # Adjust as needed
           citation_chunk_overlap=10,  # Adjust as needed
       )

       try:
           response = citation_query_engine.query(
               "Extract all anomalies from this document"
           )
           # If the synthesizer stored the anomalies, attach the citation info
           if hasattr(synthesizer, "_last_anomalies"):
               anomalies = synthesizer._last_anomalies.anomalies
               formatted_citations = (
                   response.get_formatted_sources()
                   if hasattr(response, "get_formatted_sources")
                   else None
               )
               for anomaly in anomalies:
                   anomaly.source_id = formatted_citations
               return anomalies
           return []

       except Exception as e:
           print(f"Error querying document: {e}")
           return []


   # Example usage
   document_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # anomaly
       "This agreement is governed by the laws of Norway...\n"
   )

   anomalies = extract_anomalies_with_citations(document_text)
   for anomaly in anomalies:
       print(f"Anomaly: {anomaly}")

-[ Instructor ]-

Instructor is a popular framework that specializes in structured data
extraction with LLMs using Pydantic. It offers excellent type safety
and validation capabilities, making it a solid choice for many
extraction tasks. While powerful for structured outputs, Instructor
requires more manual setup for document analysis workflows.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🔍 **Manual reference mapping**: Writing custom logic for mapping
  references

Anomaly extraction example (Instructor)

   # Instructor implementation for extracting anomalies from a document, with source references and justifications

   import os
   from textwrap import dedent

   import instructor
   from openai import OpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class Anomaly(BaseModel):
       """An anomaly found in the document."""

       text: str = Field(description="The anomalous text found in the document")
       justification: str = Field(
           description="Brief justification for why this is an anomaly"
       )
       source_text: str = Field(
           description="The sentence containing the anomaly"
       )  # LLM reciting a reference is error-prone and unreliable


   class AnomaliesList(BaseModel):
       """List of anomalies found in the document."""

       anomalies: list[Anomaly] = Field(
           description="List of anomalies found in the document"
       )


   def extract_anomalies_with_instructor(
       document_text: str, api_key: str | None = None
   ) -> list[Anomaly]:
       """
       Extract anomalies from a document using Instructor.

       Args:
           document_text: The text content of the document
           api_key: OpenAI API key (defaults to environment variable)

       Returns:
           List of extracted anomalies with justifications and references
       """
       openai_api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY")
       client = OpenAI(api_key=openai_api_key)
       instructor_client = instructor.from_openai(client)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Your task is to identify any anomalies in the document.
       Anomalies are statements, phrases, or content that seem out of place, irrelevant, or inconsistent
       with the rest of the document's context and purpose.
       
       Document:
       {document_text}
       
       Identify all anomalies in the document. For each anomaly, provide:
       1. The anomalous text - just the specific anomalous phrase
       2. A brief justification explaining why it's an anomaly
       3. The exact complete sentence containing the anomaly for reference
       
       Only identify real anomalies that truly don't belong in this type of document.
       """
       )

       # Extract structured data using Instructor
       response = instructor_client.chat.completions.create(
           model="gpt-4o-mini",
           response_model=AnomaliesList,
           messages=[
               {"role": "system", "content": "You are an expert document analyzer."},
               {"role": "user", "content": prompt},
           ],
           temperature=0,
       )
       return response.anomalies


   # Example usage
   # Sample document text (shortened for brevity)
   document_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # out-of-context / anomaly
       "This agreement is governed by the laws of Norway...\n"
   )

   # Extract anomalies
   anomalies = extract_anomalies_with_instructor(document_text)

   # Print results
   for anomaly in anomalies:
       print(f"Anomaly: {anomaly}")


🔬 Advanced Example
===================

As use cases grow more complex, the development overhead of
alternative frameworks becomes increasingly evident, while
ContextGem's abstractions deliver substantial time savings. As
extraction steps stack up, the implementation with other frameworks
quickly becomes *non-scalable*:

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts for
  each extraction step

* 🔧 **Manual model definition**: Defining Pydantic validation models
  for each element of extraction

* 🧩 **Manual output parsing**: Setting up parsers to handle the LLM's
  response

* 🔍 **Manual reference mapping**: Writing custom logic for mapping
  references

* 📄 **Complex pipeline configuration**: Writing custom logic for
  pipeline configuration and extraction components

* 📊 **Implementing usage and cost tracking callbacks**, which quickly
  increases in complexity when multiple LLMs are used in the pipeline

* 🔄 **Complex concurrency setup**: Implementing complex concurrency
  logic with asyncio

* 📝 **Embedding examples in prompts**: Writing output examples
  directly in the custom prompts

* 📋 **Manual result aggregation**: Need to write code to collect and
  organize results

Below is a more advanced example of an extraction workflow - *using an
extraction pipeline for multiple documents, with concurrency and cost
tracking* - implemented side-by-side in ContextGem and other
frameworks. (All implementations are self-contained. Comparison as of
24 March 2025.)

-[ **ContextGem** ]-

⚡ Fastest way

ContextGem is the fastest and easiest way to implement an LLM
extraction workflow. All the boilerplate code is handled behind the
scenes.

**Major time savers:**

* ⌨️ **Simple syntax**: ContextGem uses a simple, intuitive API that
  requires minimal code

* 🔄 **Automatic model definition**: ContextGem automatically defines
  the Pydantic model for structured output

* 📝 **Automatic prompt engineering**: ContextGem automatically
  constructs a prompt tailored to the extraction task

* 🧩 **Automatic output parsing**: ContextGem automatically parses the
  LLM's response

* 🔍 **Automatic reference tracking**: Precise references are
  automatically extracted and mapped to the original document

* 📏 **Flexible reference granularity**: References can be tracked at
  different levels (paragraphs, sentences)

* 📄 **Easy pipeline definition**: Simple, declarative syntax for
  defining the extraction pipeline involving multiple LLMs, in a few
  lines of code

* 💰 **Automated usage and cost tracking**: Built-in token counting
  and cost calculation without additional setup

* 🔄 **Built-in concurrency**: Concurrent execution of extraction
  steps with a simple switch "use_concurrency=True"

* 📊 **Easy example definition**: Output examples can be easily
  defined without modifying any prompts

* 📋 **Built-in result aggregation**: Results are automatically
  collected and organized in a unified storage model (document)

Extraction pipeline example (ContextGem)

   # Advanced Usage Example - analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency and cost tracking

   import os

   from contextgem import (
       Aspect,
       DateConcept,
       Document,
       DocumentLLM,
       DocumentLLMGroup,
       ExtractionPipeline,
       JsonObjectConcept,
       JsonObjectExample,
       LLMPricing,
       NumericalConcept,
       RatingConcept,
       StringConcept,
       StringExample,
   )


   # Construct documents

   # Document 1 - Consultancy Agreement (shortened for brevity)
   doc1 = Document(
       raw_text=(
           "Consultancy Agreement\n"
           "This agreement between Company A (Supplier) and Company B (Customer)...\n"
           "The term of the agreement is 1 year from the Effective Date...\n"
           "The Supplier shall provide consultancy services as described in Annex 2...\n"
           "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
           "All intellectual property created during the provision of services shall belong to the Customer...\n"
           "This agreement is governed by the laws of Norway...\n"
           "Annex 1: Data processing agreement...\n"
           "Annex 2: Statement of Work...\n"
           "Annex 3: Service Level Agreement...\n"
       ),
   )

   # Document 2 - Service Level Agreement (shortened for brevity)
   doc2 = Document(
       raw_text=(
           "Service Level Agreement\n"
           "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
           "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
           "The Provider shall deliver IT support services as outlined in Schedule A...\n"
           "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
           "The Provider guarantees [99.9%] uptime for all critical systems...\n"
           "Either party may terminate with 60 days written notice...\n"
           "This agreement is governed by the laws of California...\n"
           "Schedule A: Service Descriptions...\n"
           "Schedule B: Response Time Requirements...\n"
       ),
   )

   # Create a reusable extraction pipeline
   contract_pipeline = ExtractionPipeline()

   # Define aspects and aspect-level concepts in the pipeline
   # Concepts in the aspects will be extracted from the extracted aspect context
   contract_pipeline.aspects = [  # or use .add_aspects([...])
       Aspect(
           name="Contract Parties",
           description="Clauses defining the parties to the agreement",
           concepts=[  # define aspect-level concepts, if any
               StringConcept(
                   name="Party names and roles",
                   description="Names of all parties entering into the agreement and their roles",
                   examples=[  # optional
                       StringExample(
                           content="X (Client)",  # guidance regarding the expected output format
                       )
                   ],
               )
           ],
       ),
       Aspect(
           name="Term",
           description="Clauses defining the term of the agreement",
           concepts=[
               NumericalConcept(
                   name="Contract term",
                   description="The term of the agreement in years",
                   numeric_type="int",  # or "float", or "any" for auto-detection
                   add_references=True,  # extract references to the source text
                   reference_depth="paragraphs",
               )
           ],
       ),
   ]

   # Define document-level concepts
   # Concepts in the document will be extracted from the whole document content
   contract_pipeline.concepts = [  # or use .add_concepts()
       DateConcept(
           name="Effective date",
           description="The effective date of the agreement",
       ),
       StringConcept(
           name="Contract type",
           description="The type of agreement",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
       ),
       StringConcept(
           name="Governing law",
           description="The law that governs the agreement",
       ),
       JsonObjectConcept(
           name="Attachments",
           description="The titles and concise descriptions of the attachments to the agreement",
           structure={"title": str, "description": str | None},
           examples=[  # optional
               JsonObjectExample(  # guidance regarding the expected output format
                   content={
                       "title": "Appendix A",
                       "description": "Code of conduct",
                   }
               ),
           ],
       ),
       RatingConcept(
           name="Duration adequacy",
           description="Contract duration adequacy considering the subject matter and best practices.",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
           rating_scale=(1, 10),
           add_justifications=True,  # add justifications for the rating
           justification_depth="balanced",  # provide a balanced justification
           justification_max_sents=3,
       ),
   ]

   # Assign pipeline to the documents
   # You can re-use the same pipeline for multiple documents
   doc1.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document
   doc2.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document

   # Create an LLM group for data extraction and reasoning
   llm_extractor = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="extractor_text",  # signifies the LLM is used for data extraction tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=0.150,
           output_per_1m_tokens=0.600,
       ),
       # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider
   )
   llm_reasoner = DocumentLLM(
       model="openai/o3-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="reasoner_text",  # signifies the LLM is used for reasoning tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=1.10,
           output_per_1m_tokens=4.40,
       ),
       # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider
   )
   # The LLM group is used for all extraction tasks within the pipeline
   llm_group = DocumentLLMGroup(llms=[llm_extractor, llm_reasoner])

   # Extract all information from the documents at once
   doc1 = llm_group.extract_all(
       doc1, use_concurrency=True
   )  # use concurrency to speed up extraction
   doc2 = llm_group.extract_all(
       doc2, use_concurrency=True
   )  # use concurrency to speed up extraction
   # Or use async variants .extract_all_async(...)

   # Get the extracted data
   print("Some extracted data from doc 1:")
   print("Contract Parties > Party names and roles:")
   print(
       doc1.get_aspect_by_name("Contract Parties")
       .get_concept_by_name("Party names and roles")
       .extracted_items
   )
   print("Attachments:")
   print(doc1.get_concept_by_name("Attachments").extracted_items)
   # ...

   print("\nSome extracted data from doc 2:")
   print("Term > Contract term:")
   print(
       doc2.get_aspect_by_name("Term")
       .get_concept_by_name("Contract term")
       .extracted_items[0]
       .value
   )
   print("Duration adequacy:")
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].value)
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].justification)
   # ...

   # Output processing costs (requires setting the pricing details for each LLM)
   print("\nProcessing costs:")
   print(llm_group.get_cost())

-[ LangChain ]-

LangChain provides a powerful and flexible framework for building LLM
applications with excellent composability and a rich ecosystem of
integrations. While it offers great versatility for many use cases, it
does require additional manual setup and configuration for complex
extraction workflows.

**Development overhead:**

* 📝 **Manual prompt engineering**: Must craft detailed prompts for
  each extraction step

* 🔧 **Manual model definition**: Need to define Pydantic models and
  output parsers for structured data

* 🧩 **Complex chain configuration**: Requires manual setup of chains
  and their connections involving multiple LLMs

* 🔍 **Manual reference mapping**: Must implement custom logic to
  track source references

* 🔄 **Complex concurrency setup**: Implementing concurrent processing
  requires additional setup with asyncio

* 💰 **Cost tracking setup**: Requires custom logic for cost tracking
  for each LLM

* 💾 **No unified storage model**: Need to write additional code to
  collect and organize results

Extraction pipeline example (LangChain)

   # LangChain implementation of analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency, and cost tracking
   # Jupyter notebook compatible version

   import asyncio
   import os
   import time
   from dataclasses import dataclass, field
   from textwrap import dedent

   import nest_asyncio


   nest_asyncio.apply()

   from langchain.callbacks import get_openai_callback
   from langchain.output_parsers import PydanticOutputParser
   from langchain.prompts import PromptTemplate
   from langchain_core.runnables import (
       RunnableLambda,
       RunnableParallel,
       RunnablePassthrough,
   )
   from langchain_openai import ChatOpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class PartyInfo(BaseModel):
       """Information about contract parties"""

       name: str = Field(description="Name of the party")
       role: str = Field(description="Role of the party (e.g., Client, Provider)")


   class Term(BaseModel):
       """Contract term information"""

       duration_years: int = Field(description="Duration in years")
       reference: str = Field(
           description="Reference text from document"
       )  # LLM reciting a reference is error-prone and unreliable


   class Attachment(BaseModel):
       """Contract attachment information"""

       title: str = Field(description="Title of the attachment")
       description: str | None = Field(description="Brief description of the attachment")


   class ContractRating(BaseModel):
       """Rating with justification"""

       score: int = Field(description="Rating score (1-10)")
       justification: str = Field(description="Justification for the rating")


   class ContractInfo(BaseModel):
       """Complete contract information"""

       contract_type: str = Field(description="Type of contract")
       effective_date: str | None = Field(description="Effective date of the contract")
       governing_law: str | None = Field(description="Governing law of the contract")


   class AspectExtraction(BaseModel):
       """Result of aspect extraction"""

       aspect_text: str = Field(
           description="Extracted text for this aspect"
       )  # this does not provide granular structured content, such as specific paragraphs and sentences


   class PartyExtraction(BaseModel):
       """Party extraction results"""

       parties: list[PartyInfo] = Field(description="List of parties in the contract")


   class TermExtraction(BaseModel):
       """Term extraction results"""

       terms: list[Term] = Field(description="Contract term details")


   class AttachmentExtraction(BaseModel):
       """Attachment extraction results"""

       attachments: list[Attachment] = Field(description="List of contract attachments")


   class DurationRatingExtraction(BaseModel):
       """Duration adequacy rating"""

       rating: ContractRating = Field(description="Rating of contract duration adequacy")


   # Configuration models must be manually defined
   @dataclass
   class ExtractorConfig:
       """Configuration for a specific extractor"""

       name: str
       description: str
       model_name: str = "gpt-4o-mini"  # Default model


   @dataclass
   class PipelineConfig:
       """Complete pipeline configuration"""

       # Aspect extractors
       party_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Contract Parties",
               description="Clauses defining the parties to the agreement",
           )
       )

       term_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Term", description="Clauses defining the term of the agreement"
           )
       )

       # Document-level extractors
       contract_info_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Contract Information",
               description="Basic contract information including type, date, and governing law",
           )
       )

       attachment_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Attachments",
               description="Contract attachments and their descriptions",
           )
       )

       duration_rating_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Duration Rating",
               description="Rating of contract duration adequacy",
               model_name="o3-mini",  # Using a more capable model for judgment
           )
       )


   # LLM configuration
   def get_llm(model_name="gpt-4o-mini", api_key=None):
       """Get a ChatOpenAI instance with the specified configuration"""
       # Skipped temperature etc. for brevity, as e.g. temperature is not supported by o3-mini
       api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "")
       return ChatOpenAI(model=model_name, openai_api_key=api_key)


   # Chain components must be manually defined
   def create_aspect_extractor(aspect_name, aspect_description, model_name="gpt-4o-mini"):
       """Create a chain to extract text related to a specific aspect"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=AspectExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract the text related to the following aspect from the document.
           
           Document:
           {document_text}
           
           Aspect: {aspect_name}
           Description: {aspect_description}
           
           Extract all text related to this aspect.
           {format_instructions}
           """
           ),
           input_variables=["document_text", "aspect_name", "aspect_description"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )  # this does not provide granular structured content, such as specific paragraphs and sentences

       chain = prompt | llm | parser

       # Return a callable that works with both sync and async code
       def extractor(doc):
           return chain.invoke(
               {
                   "document_text": doc,
                   "aspect_name": aspect_name,
                   "aspect_description": aspect_description,
               }
           )

       # Add an async version that will be used when awaited
       async def async_extractor(doc):
           return await chain.ainvoke(
               {
                   "document_text": doc,
                   "aspect_name": aspect_name,
                   "aspect_description": aspect_description,
               }
           )

       extractor.ainvoke = async_extractor
       return extractor


   def create_party_extractor(model_name="gpt-4o-mini"):
       """Create a chain to extract party information"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=PartyExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract all party information from the following contract text.
           
           Contract text:
           {aspect_text}
           
           For each party, extract their name and role in the agreement.
           {format_instructions}
           """
           ),
           input_variables=["aspect_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   def create_term_extractor(model_name="gpt-4o-mini"):
       """Create a chain to extract term information"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=TermExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract term information from the following contract text.
           
           Contract text:
           {aspect_text}
           
           Extract the contract term duration in years. Include the relevant reference text.
           {format_instructions}
           """
           ),
           input_variables=["aspect_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   def create_contract_info_extractor(model_name="gpt-4o-mini"):
       """Create a chain to extract basic contract information"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=ContractInfo)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract the following information from the contract document.
           
           Contract document:
           {document_text}
           
           Extract the contract type, effective date if mentioned, and governing law if specified.
           {format_instructions}
           """
           ),
           input_variables=["document_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   def create_attachment_extractor(model_name="gpt-4o-mini"):
       """Create a chain to extract attachment information"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=AttachmentExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert document analyzer. Extract information about all attachments, annexes, 
           schedules, or appendices mentioned in the contract.
           
           Contract document:
           {document_text}
           
           For each attachment, extract:
           1. The title/name of the attachment (e.g., "Appendix A", "Schedule 1", "Annex 2")
           2. A brief description of what the attachment contains (if mentioned in the document)
           
           Example format:
           {{"title": "Appendix A", "description": "Code of conduct"}}
           
           {format_instructions}
           """
           ),
           input_variables=["document_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   def create_duration_rating_extractor(model_name="o3-mini"):
       """Create a chain to rate contract duration adequacy"""
       llm = get_llm(model_name=model_name)
       parser = PydanticOutputParser(pydantic_object=DurationRatingExtraction)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = PromptTemplate(
           template=dedent(
               """
           You are an expert contract analyst. Evaluate the adequacy of the contract duration 
           considering the subject matter and best practices.
           
           Contract document:
           {document_text}
           
           Rate the duration adequacy on a scale of 1-10, where:
           1 = Extremely inadequate duration
           10 = Perfectly adequate duration
           
           Provide a brief justification for your rating (2-3 sentences).
           {format_instructions}
           """
           ),
           input_variables=["document_text"],
           partial_variables={"format_instructions": parser.get_format_instructions()},
       )

       chain = prompt | llm | parser
       return chain


   # Complete pipeline definition
   def create_document_pipeline(config=PipelineConfig()):
       """Create a complete document analysis pipeline and return it along with its components"""

       # Create aspect extractors
       party_aspect_extractor = create_aspect_extractor(
           config.party_extractor.name,
           config.party_extractor.description,
           config.party_extractor.model_name,
       )

       term_aspect_extractor = create_aspect_extractor(
           config.term_extractor.name,
           config.term_extractor.description,
           config.term_extractor.model_name,
       )

       # Create concept extractors for aspects
       party_extractor = create_party_extractor(config.party_extractor.model_name)
       term_extractor = create_term_extractor(config.term_extractor.model_name)

       # Create document-level extractors
       contract_info_extractor = create_contract_info_extractor(
           config.contract_info_extractor.model_name
       )
       attachment_extractor = create_attachment_extractor(
           config.attachment_extractor.model_name
       )
       duration_rating_extractor = create_duration_rating_extractor(
           config.duration_rating_extractor.model_name
       )

       # Create aspect extraction pipeline
       party_pipeline = (
           RunnablePassthrough()
           | party_aspect_extractor
           | RunnableLambda(lambda x: {"aspect_text": x.aspect_text})
           | party_extractor
       )

       term_pipeline = (
           RunnablePassthrough()
           | term_aspect_extractor
           | RunnableLambda(lambda x: {"aspect_text": x.aspect_text})
           | term_extractor
       )

       # Create document-level extraction pipeline
       document_extraction = RunnableParallel(
           contract_info=contract_info_extractor,
           attachments=attachment_extractor,
           duration_rating=duration_rating_extractor,
       )

       # Combine into complete pipeline
       complete_pipeline = RunnableParallel(
           parties=party_pipeline, terms=term_pipeline, document_info=document_extraction
       )

       # Create a components dictionary for easy access
       components = {
           "party_pipeline": party_pipeline,
           "term_pipeline": term_pipeline,
           "contract_info_extractor": contract_info_extractor,
           "attachment_extractor": attachment_extractor,
           "duration_rating_extractor": duration_rating_extractor,
       }

       return complete_pipeline, components


   # Cost tracking
   class CostTracker:
       """Track LLM costs across multiple extractions"""

       def __init__(self):
           self.costs = {
               "gpt-4o-mini": {
                   "input_per_1m": 0.15,
                   "output_per_1m": 0.60,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
               "o3-mini": {
                   "input_per_1m": 1.10,
                   "output_per_1m": 4.40,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
           }
           self.total_cost = 0.0

       def track_usage(self, model_name, input_tokens, output_tokens):
           """Track token usage for a model"""
           # Extract base model name
           base_model = model_name.split("/")[-1] if "/" in model_name else model_name

           if base_model in self.costs:
               self.costs[base_model]["input_tokens"] += input_tokens
               self.costs[base_model]["output_tokens"] += output_tokens

               # Calculate costs separately for input and output tokens
               input_cost = input_tokens * (
                   self.costs[base_model]["input_per_1m"] / 1000000
               )
               output_cost = output_tokens * (
                   self.costs[base_model]["output_per_1m"] / 1000000
               )

               self.total_cost += input_cost + output_cost

       def get_costs(self):
           """Get cost summary"""
           model_costs = {}
           for model, data in self.costs.items():
               if data["input_tokens"] > 0 or data["output_tokens"] > 0:
                   input_cost = data["input_tokens"] * (data["input_per_1m"] / 1000000)
                   output_cost = data["output_tokens"] * (data["output_per_1m"] / 1000000)
                   model_costs[model] = {
                       "input_cost": input_cost,
                       "output_cost": output_cost,
                       "total_cost": input_cost + output_cost,
                       "input_tokens": data["input_tokens"],
                       "output_tokens": data["output_tokens"],
                   }

           return {
               "model_costs": model_costs,
               "total_cost": self.total_cost,
           }


   # Document processing functions
   async def process_document_async(
       document_text, pipeline_and_components, cost_tracker=None, use_concurrency=True
   ):
       """Process a document asynchronously and track costs"""
       pipeline, components = pipeline_and_components  # Unpack the pipeline and components
       results = {}

       # Track tokens used across all calls
       total_tokens = {
           "gpt-4o-mini": {"input": 0, "output": 0},
           "o3-mini": {"input": 0, "output": 0},
       }

       # Use the provided components
       async def process_parties():
           """Process parties using the party pipeline"""
           with get_openai_callback() as cb:
               party_results = await components["party_pipeline"].ainvoke(document_text)
               total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens
               total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens
           return party_results

       async def process_terms():
           """Process terms using the term pipeline"""
           with get_openai_callback() as cb:
               term_results = await components["term_pipeline"].ainvoke(document_text)
               total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens
               total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens
           return term_results

       async def process_contract_info():
           """Process contract info"""
           with get_openai_callback() as cb:
               info_results = await components["contract_info_extractor"].ainvoke(
                   document_text
               )
               total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens
               total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens
           return info_results

       async def process_attachments():
           """Process attachments"""
           with get_openai_callback() as cb:
               attachment_results = await components["attachment_extractor"].ainvoke(
                   document_text
               )
               total_tokens["gpt-4o-mini"]["input"] += cb.prompt_tokens
               total_tokens["gpt-4o-mini"]["output"] += cb.completion_tokens
           return attachment_results

       async def process_duration_rating():
           """Process duration rating"""
           with get_openai_callback() as cb:
               duration_results = await components["duration_rating_extractor"].ainvoke(
                   document_text
               )
               # Duration rating is done with o3-mini
               total_tokens["o3-mini"]["input"] += cb.prompt_tokens
               total_tokens["o3-mini"]["output"] += cb.completion_tokens
           return duration_results

       # Run extractions based on concurrency preference
       if use_concurrency:
           # Process all extractions concurrently for maximum speed
           (
               parties,
               terms,
               contract_info,
               attachments,
               duration_rating,
           ) = await asyncio.gather(
               process_parties(),
               process_terms(),
               process_contract_info(),
               process_attachments(),
               process_duration_rating(),
           )
       else:
           # Process extractions sequentially
           parties = await process_parties()
           terms = await process_terms()
           contract_info = await process_contract_info()
           attachments = await process_attachments()
           duration_rating = await process_duration_rating()

       # Update cost tracker if provided
       if cost_tracker:
           for model, tokens in total_tokens.items():
               cost_tracker.track_usage(model, tokens["input"], tokens["output"])

       # Structure results in an easy-to-use format
       results["contract_type"] = contract_info.contract_type
       results["governing_law"] = contract_info.governing_law
       results["effective_date"] = contract_info.effective_date
       results["parties"] = parties.parties
       results["term_years"] = terms.terms[0].duration_years if terms.terms else None
       results["term_reference"] = terms.terms[0].reference if terms.terms else None
       results["attachments"] = attachments.attachments
       results["duration_rating"] = duration_rating.rating

       return results


   def process_document(
       document_text, pipeline_and_components, cost_tracker=None, use_concurrency=True
   ):
       """
       Process a document and track costs.
       This is a Jupyter-compatible version that uses the existing event loop
       instead of creating a new one with asyncio.run().
       """
       # Get the current event loop
       loop = asyncio.get_event_loop()
       # Run the async function in the current event loop
       return loop.run_until_complete(
           process_document_async(
               document_text, pipeline_and_components, cost_tracker, use_concurrency
           )
       )


   # Example usage
   # Sample contract texts (shortened for brevity)
   doc1_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "All intellectual property created during the provision of services shall belong to the Customer...\n"
       "This agreement is governed by the laws of Norway...\n"
       "Annex 1: Data processing agreement...\n"
       "Annex 2: Statement of Work...\n"
       "Annex 3: Service Level Agreement...\n"
   )

   doc2_text = (
       "Service Level Agreement\n"
       "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
       "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
       "The Provider shall deliver IT support services as outlined in Schedule A...\n"
       "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
       "The Provider guarantees [99.9%] uptime for all critical systems...\n"
       "Either party may terminate with 60 days written notice...\n"
       "This agreement is governed by the laws of California...\n"
       "Schedule A: Service Descriptions...\n"
       "Schedule B: Response Time Requirements...\n"
   )


   # Function to pretty-print document results
   def print_document_results(doc_name, results):
       print(f"\nResults from {doc_name}:")
       print(f"Contract Type: {results['contract_type']}")
       print(f"Parties: {[f'{p.name} ({p.role})' for p in results['parties']]}")
       print(f"Term: {results['term_years']} years")
       print(
           f"Term Reference: {results['term_reference'] if results['term_reference'] else 'Not specified'}"
       )
       print(f"Governing Law: {results['governing_law']}")
       print(f"Attachments: {[(a.title, a.description) for a in results['attachments']]}")
       print(f"Duration Rating: {results['duration_rating'].score}/10")
       print(f"Rating Justification: {results['duration_rating'].justification}")


   # Create cost tracker
   cost_tracker = CostTracker()

   # Create pipeline with default configuration - returns both pipeline and components
   pipeline, pipeline_components = create_document_pipeline()

   # Process documents
   print("Processing document 1 with concurrency...")
   start_time = time.time()
   doc1_results = process_document(
       doc1_text, (pipeline, pipeline_components), cost_tracker, use_concurrency=True
   )
   print(f"Processing time: {time.time() - start_time:.2f} seconds")

   print("Processing document 2 with concurrency...")
   start_time = time.time()
   doc2_results = process_document(
       doc2_text, (pipeline, pipeline_components), cost_tracker, use_concurrency=True
   )
   print(f"Processing time: {time.time() - start_time:.2f} seconds")

   # Print results
   print_document_results("Document 1 (Consultancy Agreement)", doc1_results)
   print_document_results("Document 2 (Service Level Agreement)", doc2_results)

   # Print cost information
   print("\nProcessing costs:")
   costs = cost_tracker.get_costs()
   for model, model_data in costs["model_costs"].items():
       print(f"\n{model}:")
       print(f"  Input cost: ${model_data['input_cost']:.4f}")
       print(f"  Output cost: ${model_data['output_cost']:.4f}")
       print(f"  Total cost: ${model_data['total_cost']:.4f}")
   print(f"\nTotal across all models: ${costs['total_cost']:.4f}")

-[ LlamaIndex ]-

LlamaIndex provides a robust data framework for LLM applications with
excellent capabilities for knowledge retrieval and RAG. It offers
powerful tools for working with documents and structured data, though
implementing complex extraction workflows may require some additional
configuration to fully leverage its capabilities.

**Development overhead:**

* 📝 **Manual prompt engineering**: Must craft detailed prompts for
  each extraction task

* 🔧 **Manual model definition**: Need to define Pydantic models and
  output parsers for structured data

* 🧩 **Pipeline setup**: Requires manual configuration of extraction
  pipeline components involving multiple LLMs

* 🔍 **Limited reference tracking**: Basic source tracking, but
  requires additional work for fine-grained references

* 📊 **Embedding examples in prompts**: Examples must be manually
  incorporated into prompts

* 🔄 **Complex concurrency setup**: Implementing concurrent processing
  requires additional setup with asyncio

* 💰 **Cost tracking setup**: Requires custom logic for cost tracking
  for each LLM

* 💾 **No unified storage model**: Need to write additional code to
  collect and organize results

Extraction pipeline example (LlamaIndex)

   # LlamaIndex implementation of analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency, and cost tracking
   # Jupyter notebook compatible version

   import asyncio
   import os
   from textwrap import dedent

   import nest_asyncio


   nest_asyncio.apply()

   from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
   from llama_index.core.output_parsers import PydanticOutputParser
   from llama_index.core.program import LLMTextCompletionProgram
   from llama_index.llms.openai import OpenAI
   from pydantic import BaseModel, Field


   # Pydantic models must be manually defined
   class PartyInfo(BaseModel):
       """Information about contract parties"""

       name: str = Field(description="Name of the party")
       role: str = Field(description="Role of the party (e.g., Client, Provider)")


   class Term(BaseModel):
       """Contract term information"""

       duration_years: int = Field(description="Duration in years")
       reference: str = Field(
           description="Reference text from document"
       )  # LLM reciting a reference is error-prone and unreliable


   class Attachment(BaseModel):
       """Contract attachment information"""

       title: str = Field(description="Title of the attachment")
       description: str | None = Field(description="Brief description of the attachment")


   class ContractRating(BaseModel):
       """Rating with justification"""

       score: int = Field(description="Rating score (1-10)")
       justification: str = Field(description="Justification for the rating")


   class ContractInfo(BaseModel):
       """Complete contract information"""

       contract_type: str = Field(description="Type of contract")
       effective_date: str | None = Field(description="Effective date of the contract")
       governing_law: str | None = Field(description="Governing law of the contract")


   class AspectExtraction(BaseModel):
       """Result of aspect extraction"""

       aspect_text: str = Field(
           description="Extracted text for this aspect"
       )  # this does not provide granular structured content, such as specific paragraphs and sentences


   class PartyExtraction(BaseModel):
       """Party extraction results"""

       parties: list[PartyInfo] = Field(description="List of parties in the contract")


   class TermExtraction(BaseModel):
       """Term extraction results"""

       terms: list[Term] = Field(description="Contract term details")


   class AttachmentExtraction(BaseModel):
       """Attachment extraction results"""

       attachments: list[Attachment] = Field(description="List of contract attachments")


   class DurationRatingExtraction(BaseModel):
       """Duration adequacy rating"""

       rating: ContractRating = Field(description="Rating of contract duration adequacy")


   # Cost tracking class
   class CostTracker:
       """Track LLM costs across multiple extractions"""

       def __init__(self):
           self.costs = {
               "gpt-4o-mini": {
                   "input_per_1m": 0.15,
                   "output_per_1m": 0.60,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
               "o3-mini": {
                   "input_per_1m": 1.10,
                   "output_per_1m": 4.40,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
           }
           self.total_cost = 0.0

       def track_usage(self, model_name, input_tokens, output_tokens):
           """Track token usage for a model"""
           # Extract base model name
           base_model = model_name.split("/")[-1] if "/" in model_name else model_name

           if base_model in self.costs:
               self.costs[base_model]["input_tokens"] += input_tokens
               self.costs[base_model]["output_tokens"] += output_tokens

               # Calculate costs separately for input and output tokens
               input_cost = input_tokens * (
                   self.costs[base_model]["input_per_1m"] / 1000000
               )
               output_cost = output_tokens * (
                   self.costs[base_model]["output_per_1m"] / 1000000
               )

               self.total_cost += input_cost + output_cost

       def get_costs(self):
           """Get cost summary"""
           model_costs = {}
           for model, data in self.costs.items():
               if data["input_tokens"] > 0 or data["output_tokens"] > 0:
                   input_cost = data["input_tokens"] * (data["input_per_1m"] / 1000000)
                   output_cost = data["output_tokens"] * (data["output_per_1m"] / 1000000)
                   model_costs[model] = {
                       "input_cost": input_cost,
                       "output_cost": output_cost,
                       "total_cost": input_cost + output_cost,
                       "input_tokens": data["input_tokens"],
                       "output_tokens": data["output_tokens"],
                   }

           return {
               "model_costs": model_costs,
               "total_cost": self.total_cost,
           }


   # Helper functions for extractors
   def get_llm(model_name="gpt-4o-mini", api_key=None, temperature=0, token_counter=None):
       """Get an OpenAI instance with the specified configuration"""
       api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "")

       # Create callback manager with token counter if provided
       callback_manager = None
       if token_counter is not None:
           callback_manager = CallbackManager([token_counter])

       return OpenAI(
           model=model_name,
           api_key=api_key,
           temperature=temperature,
           callback_manager=callback_manager,
       )


   def create_aspect_extractor(
       aspect_name, aspect_description, model_name="gpt-4o-mini", token_counter=None
   ):
       """Create an extractor to extract text related to a specific aspect"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           f"""
       You are an expert document analyzer. Extract the text related to the following aspect from the document.
       
       Document:
       {{document_text}}
       
       Aspect: {aspect_name}
       Description: {aspect_description}
       
       Extract all text related to this aspect.
       """
       )  # this does not provide granular structured content, such as specific paragraphs and sentences

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=AspectExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_party_extractor(model_name="gpt-4o-mini", token_counter=None):
       """Create an extractor for party information"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Extract all party information from the following contract text.
       
       Contract text:
       {aspect_text}
       
       For each party, extract their name and role in the agreement.
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=PartyExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_term_extractor(model_name="gpt-4o-mini", token_counter=None):
       """Create an extractor for term information"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Extract term information from the following contract text.
       
       Contract text:
       {aspect_text}
       
       Extract the contract term duration in years. Include the relevant reference text.
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=TermExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_contract_info_extractor(model_name="gpt-4o-mini", token_counter=None):
       """Create an extractor for basic contract information"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Extract the following information from the contract document.
       
       Contract document:
       {document_text}
       
       Extract the contract type, effective date if mentioned, and governing law if specified.
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=ContractInfo),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_attachment_extractor(model_name="gpt-4o-mini", token_counter=None):
       """Create an extractor for attachment information"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert document analyzer. Extract information about all attachments, annexes, 
       schedules, or appendices mentioned in the contract.
       
       Contract document:
       {document_text}
       
       For each attachment, extract:
       1. The title/name of the attachment (e.g., "Appendix A", "Schedule 1", "Annex 2")
       2. A brief description of what the attachment contains (if mentioned in the document)
       
       Example format:
       {"title": "Appendix A", "description": "Code of conduct"}
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=AttachmentExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   def create_duration_rating_extractor(model_name="o3-mini", token_counter=None):
       """Create an extractor to rate contract duration adequacy"""
       llm = get_llm(model_name=model_name, token_counter=token_counter)

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt_template = dedent(
           """
       You are an expert contract analyst. Evaluate the adequacy of the contract duration 
       considering the subject matter and best practices.
       
       Contract document:
       {document_text}
       
       Rate the duration adequacy on a scale of 1-10, where:
       1 = Extremely inadequate duration
       10 = Perfectly adequate duration
       
       Provide a brief justification for your rating (2-3 sentences).
       """
       )

       program = LLMTextCompletionProgram.from_defaults(
           output_parser=PydanticOutputParser(output_cls=DurationRatingExtraction),
           prompt_template_str=prompt_template,
           llm=llm,
       )
       return program


   # Main document processing functions
   async def process_document_async(
       document_text, cost_tracker=None, use_concurrency=True
   ):
       """Process a document asynchronously and track costs"""
       results = {}

       # Create separate token counting handlers for each model
       gpt4o_token_counter = TokenCountingHandler()
       o3_token_counter = TokenCountingHandler()

       # Create extractors with appropriate token counters
       party_aspect_extractor = create_aspect_extractor(
           "Contract Parties",
           "Clauses defining the parties to the agreement",
           token_counter=gpt4o_token_counter,
       )
       term_aspect_extractor = create_aspect_extractor(
           "Term",
           "Clauses defining the term of the agreement",
           token_counter=gpt4o_token_counter,
       )
       party_extractor = create_party_extractor(token_counter=gpt4o_token_counter)
       term_extractor = create_term_extractor(token_counter=gpt4o_token_counter)
       contract_info_extractor = create_contract_info_extractor(
           token_counter=gpt4o_token_counter
       )
       attachment_extractor = create_attachment_extractor(
           token_counter=gpt4o_token_counter
       )

       # Use separate token counter for o3-mini
       duration_rating_extractor = create_duration_rating_extractor(
           model_name="o3-mini", token_counter=o3_token_counter
       )

       # Define processing functions using native async methods
       async def process_party_aspect():
           response = await party_aspect_extractor.acall(document_text=document_text)
           return response

       async def process_term_aspect():
           response = await term_aspect_extractor.acall(document_text=document_text)
           return response

       # Get aspect texts
       if use_concurrency:
           party_aspect, term_aspect = await asyncio.gather(
               process_party_aspect(), process_term_aspect()
           )
       else:
           party_aspect = await process_party_aspect()
           term_aspect = await process_term_aspect()

       async def process_parties():
           party_results = await party_extractor.acall(
               aspect_text=party_aspect.aspect_text
           )
           return party_results

       async def process_terms():
           term_results = await term_extractor.acall(aspect_text=term_aspect.aspect_text)
           return term_results

       async def process_contract_info():
           contract_info = await contract_info_extractor.acall(document_text=document_text)
           return contract_info

       async def process_attachments():
           attachments = await attachment_extractor.acall(document_text=document_text)
           return attachments

       async def process_duration_rating():
           duration_rating = await duration_rating_extractor.acall(
               document_text=document_text
           )
           return duration_rating

       # Run extractions based on concurrency preference
       if use_concurrency:
           (
               parties,
               terms,
               contract_info,
               attachments,
               duration_rating,
           ) = await asyncio.gather(
               process_parties(),
               process_terms(),
               process_contract_info(),
               process_attachments(),
               process_duration_rating(),
           )
       else:
           parties = await process_parties()
           terms = await process_terms()
           contract_info = await process_contract_info()
           attachments = await process_attachments()
           duration_rating = await process_duration_rating()

       # Get token usage from the token counter and update cost tracker
       if cost_tracker:
           cost_tracker.track_usage(
               "gpt-4o-mini",
               gpt4o_token_counter.prompt_llm_token_count,
               gpt4o_token_counter.completion_llm_token_count,
           )
           cost_tracker.track_usage(
               "o3-mini",
               o3_token_counter.prompt_llm_token_count,
               o3_token_counter.completion_llm_token_count,
           )

       # Structure results in an easy-to-use format
       results["contract_type"] = contract_info.contract_type
       results["governing_law"] = contract_info.governing_law
       results["effective_date"] = contract_info.effective_date
       results["parties"] = parties.parties
       results["term_years"] = terms.terms[0].duration_years if terms.terms else None
       results["term_reference"] = terms.terms[0].reference if terms.terms else None
       results["attachments"] = attachments.attachments
       results["duration_rating"] = duration_rating.rating

       return results


   def process_document(document_text, cost_tracker=None, use_concurrency=True):
       """
       Process a document and track costs.
       This is a Jupyter-compatible version that uses the existing event loop
       instead of creating a new one with asyncio.run().
       """
       loop = asyncio.get_event_loop()
       return loop.run_until_complete(
           process_document_async(document_text, cost_tracker, use_concurrency)
       )


   # Function to pretty-print document results
   def print_document_results(doc_name, results):
       print(f"\nResults from {doc_name}:")
       print(f"Contract Type: {results['contract_type']}")
       print(f"Parties: {[f'{p.name} ({p.role})' for p in results['parties']]}")
       print(f"Term: {results['term_years']} years")
       print(
           f"Term Reference: {results['term_reference'] if results['term_reference'] else 'Not specified'}"
       )
       print(f"Governing Law: {results['governing_law']}")
       print(f"Attachments: {[(a.title, a.description) for a in results['attachments']]}")
       print(f"Duration Rating: {results['duration_rating'].score}/10")
       print(f"Rating Justification: {results['duration_rating'].justification}")


   # Example usage
   # Sample contract texts (shortened for brevity)
   doc1_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "All intellectual property created during the provision of services shall belong to the Customer...\n"
       "This agreement is governed by the laws of Norway...\n"
       "Annex 1: Data processing agreement...\n"
       "Annex 2: Statement of Work...\n"
       "Annex 3: Service Level Agreement...\n"
   )

   doc2_text = (
       "Service Level Agreement\n"
       "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
       "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
       "The Provider shall deliver IT support services as outlined in Schedule A...\n"
       "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
       "The Provider guarantees [99.9%] uptime for all critical systems...\n"
       "Either party may terminate with 60 days written notice...\n"
       "This agreement is governed by the laws of California...\n"
       "Schedule A: Service Descriptions...\n"
       "Schedule B: Response Time Requirements...\n"
   )


   # Create cost tracker
   cost_tracker = CostTracker()

   # Process documents
   print("Processing document 1 with concurrency...")
   doc1_results = process_document(doc1_text, cost_tracker, use_concurrency=True)

   print("Processing document 2 with concurrency...")
   doc2_results = process_document(doc2_text, cost_tracker, use_concurrency=True)

   # Print results
   print_document_results("Document 1 (Consultancy Agreement)", doc1_results)
   print_document_results("Document 2 (Service Level Agreement)", doc2_results)

   # Print cost information
   print("\nProcessing costs:")
   costs = cost_tracker.get_costs()
   for model, model_data in costs["model_costs"].items():
       print(f"\n{model}:")
       print(f"  Input cost: ${model_data['input_cost']:.4f}")
       print(f"  Output cost: ${model_data['output_cost']:.4f}")
       print(f"  Total cost: ${model_data['total_cost']:.4f}")
   print(f"\nTotal across all models: ${costs['total_cost']:.4f}")

-[ Instructor ]-

Instructor is a powerful library focused on structured outputs from
LLMs with strong typing support through Pydantic. It excels at
extracting structured data with validation, but requires additional
work to build complex extraction pipelines.

**Development overhead:**

* 📝 **Manual prompt engineering**: Crafting comprehensive prompts
  that guide the LLM effectively

* 🔧 **Manual model definition**: Developers must define Pydantic
  validation models for structured output

* 🧩 **Manual pipeline assembly**: Requires custom code to connect
  extraction components involving multiple LLMs

* 🔍 **Manual reference mapping**: Must implement custom logic to
  track source references

* 📊 **Embedding examples in prompts**: Examples must be manually
  incorporated into prompts

* 🔄 **Complex concurrency setup**: Implementing concurrent processing
  requires additional setup with asyncio

* 💰 **Cost tracking setup**: Requires custom logic for cost tracking
  for each LLM

Extraction pipeline example (Instructor)

   # Instructor implementation of analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency, and cost tracking
   # Jupyter notebook compatible version

   import asyncio
   import os
   from dataclasses import dataclass, field
   from textwrap import dedent

   import instructor
   import nest_asyncio
   from openai import AsyncOpenAI, OpenAI
   from pydantic import BaseModel, Field


   nest_asyncio.apply()


   # Pydantic models must be manually defined
   class PartyInfo(BaseModel):
       """Information about contract parties"""

       name: str = Field(description="Name of the party")
       role: str = Field(description="Role of the party (e.g., Client, Provider)")


   class Term(BaseModel):
       """Contract term information"""

       duration_years: int = Field(description="Duration in years")
       reference: str = Field(
           description="Reference text from document"
       )  # LLM reciting a reference is error-prone and unreliable


   class Attachment(BaseModel):
       """Contract attachment information"""

       title: str = Field(description="Title of the attachment")
       description: str | None = Field(description="Brief description of the attachment")


   class ContractRating(BaseModel):
       """Rating with justification"""

       score: int = Field(description="Rating score (1-10)")
       justification: str = Field(description="Justification for the rating")


   class ContractInfo(BaseModel):
       """Complete contract information"""

       contract_type: str = Field(description="Type of contract")
       effective_date: str | None = Field(description="Effective date of the contract")
       governing_law: str | None = Field(description="Governing law of the contract")


   class AspectExtraction(BaseModel):
       """Result of aspect extraction"""

       aspect_text: str = Field(
           description="Extracted text for this aspect"
       )  # this does not provide granular structured content, such as specific paragraphs and sentences


   class PartyExtraction(BaseModel):
       """Party extraction results"""

       parties: list[PartyInfo] = Field(description="List of parties in the contract")


   class TermExtraction(BaseModel):
       """Term extraction results"""

       terms: list[Term] = Field(description="Contract term details")


   class AttachmentExtraction(BaseModel):
       """Attachment extraction results"""

       attachments: list[Attachment] = Field(description="List of contract attachments")


   class DurationRatingExtraction(BaseModel):
       """Duration adequacy rating"""

       rating: ContractRating = Field(description="Rating of contract duration adequacy")


   # Configuration models must be manually defined
   @dataclass
   class ExtractorConfig:
       """Configuration for a specific extractor"""

       name: str
       description: str
       model_name: str = "gpt-4o-mini"  # Default model


   @dataclass
   class PipelineConfig:
       """Complete pipeline configuration"""

       # Aspect extractors
       party_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Contract Parties",
               description="Clauses defining the parties to the agreement",
           )
       )

       term_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Term", description="Clauses defining the term of the agreement"
           )
       )

       # Document-level extractors
       contract_info_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Contract Information",
               description="Basic contract information including type, date, and governing law",
           )
       )

       attachment_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Attachments",
               description="Contract attachments and their descriptions",
           )
       )

       duration_rating_extractor: ExtractorConfig = field(
           default_factory=lambda: ExtractorConfig(
               name="Duration Rating",
               description="Rating of contract duration adequacy",
               model_name="o3-mini",  # Using a more capable model for judgment
           )
       )


   # LLM client setup
   def get_client(api_key=None):
       """Get an OpenAI client with instructor integrated"""
       api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "")
       client = OpenAI(api_key=api_key)
       return instructor.from_openai(client)


   async def get_async_client(api_key=None):
       """Get an AsyncOpenAI client with instructor integrated"""
       api_key = api_key or os.environ.get("CONTEXTGEM_OPENAI_API_KEY", "")
       client = AsyncOpenAI(api_key=api_key)
       return instructor.from_openai(client)


   # Helper function to execute completions with token tracking
   async def execute_with_tracking(model, messages, response_model, cost_tracker=None):
       """
       Execute a completion request with token tracking.
       """
       # Create the Instructor client
       client = await get_async_client()

       # Make a single API call with Instructor
       response = await client.chat.completions.create(
           model=model, response_model=response_model, messages=messages
       )

       # Access the raw response to get token usage
       if cost_tracker and hasattr(response, "_raw_response"):
           raw_response = response._raw_response
           if hasattr(raw_response, "usage"):
               prompt_tokens = raw_response.usage.prompt_tokens
               completion_tokens = raw_response.usage.completion_tokens
               cost_tracker.track_usage(model, prompt_tokens, completion_tokens)

       return response


   def execute_sync(model, messages, response_model):
       """Execute a completion request synchronously"""
       client = get_client()
       return client.chat.completions.create(
           model=model, response_model=response_model, messages=messages
       )


   # Unified extraction functions
   def extract_aspect(
       document_text,
       aspect_name,
       aspect_description,
       model_name="gpt-4o-mini",
       is_async=False,
       cost_tracker=None,
   ):
       """Extract text related to a specific aspect"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract the text related to the following aspect from the document.
       
       Document:
       {document_text}
       
       Aspect: {aspect_name}
       Description: {aspect_description}
       
       Extract all text related to this aspect.
       """
       )  # this does not provide granular structured content, such as specific paragraphs and sentences

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(
               model_name, messages, AspectExtraction, cost_tracker
           )
       else:
           return execute_sync(model_name, messages, AspectExtraction)


   def extract_parties(
       aspect_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None
   ):
       """Extract party information"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract all party information from the following contract text.
       
       Contract text:
       {aspect_text}
       
       For each party, extract their name and role in the agreement.
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(
               model_name, messages, PartyExtraction, cost_tracker
           )
       else:
           return execute_sync(model_name, messages, PartyExtraction)


   def extract_terms(
       aspect_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None
   ):
       """Extract term information"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract term information from the following contract text.
       
       Contract text:
       {aspect_text}
       
       Extract the contract term duration in years. Include the relevant reference text.
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(model_name, messages, TermExtraction, cost_tracker)
       else:
           return execute_sync(model_name, messages, TermExtraction)


   def extract_contract_info(
       document_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None
   ):
       """Extract basic contract information"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract the following information from the contract document.
       
       Contract document:
       {document_text}
       
       Extract the contract type, effective date if mentioned, and governing law if specified.
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(model_name, messages, ContractInfo, cost_tracker)
       else:
           return execute_sync(model_name, messages, ContractInfo)


   def extract_attachments(
       document_text, model_name="gpt-4o-mini", is_async=False, cost_tracker=None
   ):
       """Extract attachment information"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert document analyzer. Extract information about all attachments, annexes, 
       schedules, or appendices mentioned in the contract.
       
       Contract document:
       {document_text}
       
       For each attachment, extract:
       1. The title/name of the attachment (e.g., "Appendix A", "Schedule 1", "Annex 2")
       2. A brief description of what the attachment contains (if mentioned in the document)
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert document analyzer."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(
               model_name, messages, AttachmentExtraction, cost_tracker
           )
       else:
           return execute_sync(model_name, messages, AttachmentExtraction)


   def extract_duration_rating(
       document_text, model_name="o3-mini", is_async=False, cost_tracker=None
   ):
       """Rate contract duration adequacy"""

       # Prompt must be manually drafted
       # This is a basic example, which is shortened for brevity. The prompt should be improved for better accuracy.
       prompt = dedent(
           f"""
       You are an expert contract analyst. Evaluate the adequacy of the contract duration 
       considering the subject matter and best practices.
       
       Contract document:
       {document_text}
       
       Rate the duration adequacy on a scale of 1-10, where:
       1 = Extremely inadequate duration
       10 = Perfectly adequate duration
       
       Provide a brief justification for your rating (2-3 sentences).
       """
       )

       messages = [
           {"role": "system", "content": "You are an expert contract analyst."},
           {"role": "user", "content": prompt},
       ]

       if is_async:
           return execute_with_tracking(
               model_name, messages, DurationRatingExtraction, cost_tracker
           )
       else:
           return execute_sync(model_name, messages, DurationRatingExtraction)


   # Cost tracking
   class CostTracker:
       """Track LLM costs across multiple extractions"""

       def __init__(self):
           self.costs = {
               "gpt-4o-mini": {
                   "input_per_1m": 0.15,
                   "output_per_1m": 0.60,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
               "o3-mini": {
                   "input_per_1m": 1.10,
                   "output_per_1m": 4.40,
                   "input_tokens": 0,
                   "output_tokens": 0,
               },
           }
           self.total_cost = 0.0

       def track_usage(self, model_name, input_tokens, output_tokens):
           """Track token usage for a model"""
           # Extract base model name
           base_model = model_name.split("/")[-1] if "/" in model_name else model_name

           if base_model in self.costs:
               self.costs[base_model]["input_tokens"] += input_tokens
               self.costs[base_model]["output_tokens"] += output_tokens

               # Calculate costs separately for input and output tokens
               input_cost = input_tokens * (
                   self.costs[base_model]["input_per_1m"] / 1000000
               )
               output_cost = output_tokens * (
                   self.costs[base_model]["output_per_1m"] / 1000000
               )

               self.total_cost += input_cost + output_cost

       def get_costs(self):
           """Get cost summary"""
           model_costs = {}
           for model, data in self.costs.items():
               if data["input_tokens"] > 0 or data["output_tokens"] > 0:
                   input_cost = data["input_tokens"] * (data["input_per_1m"] / 1000000)
                   output_cost = data["output_tokens"] * (data["output_per_1m"] / 1000000)
                   model_costs[model] = {
                       "input_cost": input_cost,
                       "output_cost": output_cost,
                       "total_cost": input_cost + output_cost,
                       "input_tokens": data["input_tokens"],
                       "output_tokens": data["output_tokens"],
                   }

           return {
               "model_costs": model_costs,
               "total_cost": self.total_cost,
           }


   # Document processing functions
   async def process_document_async(
       document_text, config=None, cost_tracker=None, use_concurrency=True
   ):
       """Process a document asynchronously and track costs"""
       if config is None:
           config = PipelineConfig()

       results = {}

       # Define processing functions
       async def process_party_pipeline():
           # Extract party aspect
           party_aspect = await extract_aspect(
               document_text,
               config.party_extractor.name,
               config.party_extractor.description,
               model_name=config.party_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

           # Extract parties from the aspect
           parties = await extract_parties(
               party_aspect.aspect_text,
               model_name=config.party_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

           return parties

       async def process_term_pipeline():
           # Extract term aspect
           term_aspect = await extract_aspect(
               document_text,
               config.term_extractor.name,
               config.term_extractor.description,
               model_name=config.term_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

           # Extract terms from the aspect
           terms = await extract_terms(
               term_aspect.aspect_text,
               model_name=config.term_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

           return terms

       async def process_contract_info():
           return await extract_contract_info(
               document_text,
               model_name=config.contract_info_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

       async def process_attachments():
           return await extract_attachments(
               document_text,
               model_name=config.attachment_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

       async def process_duration_rating():
           return await extract_duration_rating(
               document_text,
               model_name=config.duration_rating_extractor.model_name,
               is_async=True,
               cost_tracker=cost_tracker,
           )

       # Run extractions based on concurrency preference
       if use_concurrency:
           # Process all extractions concurrently for maximum speed
           (
               parties,
               terms,
               contract_info,
               attachments,
               duration_rating,
           ) = await asyncio.gather(
               process_party_pipeline(),
               process_term_pipeline(),
               process_contract_info(),
               process_attachments(),
               process_duration_rating(),
           )
       else:
           # Process extractions sequentially
           parties = await process_party_pipeline()
           terms = await process_term_pipeline()
           contract_info = await process_contract_info()
           attachments = await process_attachments()
           duration_rating = await process_duration_rating()

       # Structure results in the same format as the LangChain implementation
       results["contract_type"] = contract_info.contract_type
       results["governing_law"] = contract_info.governing_law
       results["effective_date"] = contract_info.effective_date
       results["parties"] = parties.parties
       results["term_years"] = terms.terms[0].duration_years if terms.terms else None
       results["term_reference"] = terms.terms[0].reference if terms.terms else None
       results["attachments"] = attachments.attachments
       results["duration_rating"] = duration_rating.rating

       return results


   def process_document(
       document_text, config=None, cost_tracker=None, use_concurrency=True
   ):
       """
       Process a document and track costs.
       """
       # Get the current event loop
       loop = asyncio.get_event_loop()
       # Run the async function in the current event loop
       return loop.run_until_complete(
           process_document_async(document_text, config, cost_tracker, use_concurrency)
       )


   # Example usage
   # Sample contract texts (shortened for brevity)
   doc1_text = (
       "Consultancy Agreement\n"
       "This agreement between Company A (Supplier) and Company B (Customer)...\n"
       "The term of the agreement is 1 year from the Effective Date...\n"
       "The Supplier shall provide consultancy services as described in Annex 2...\n"
       "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
       "All intellectual property created during the provision of services shall belong to the Customer...\n"
       "This agreement is governed by the laws of Norway...\n"
       "Annex 1: Data processing agreement...\n"
       "Annex 2: Statement of Work...\n"
       "Annex 3: Service Level Agreement...\n"
   )

   doc2_text = (
       "Service Level Agreement\n"
       "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
       "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
       "The Provider shall deliver IT support services as outlined in Schedule A...\n"
       "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
       "The Provider guarantees [99.9%] uptime for all critical systems...\n"
       "Either party may terminate with 60 days written notice...\n"
       "This agreement is governed by the laws of California...\n"
       "Schedule A: Service Descriptions...\n"
       "Schedule B: Response Time Requirements...\n"
   )


   # Function to pretty-print document results
   def print_document_results(doc_name, results):
       print(f"\nResults from {doc_name}:")
       print(f"Contract Type: {results['contract_type']}")
       print(f"Parties: {[f'{p.name} ({p.role})' for p in results['parties']]}")
       print(f"Term: {results['term_years']} years")
       print(
           f"Term Reference: {results['term_reference'] if results['term_reference'] else 'Not specified'}"
       )
       print(f"Governing Law: {results['governing_law']}")
       print(f"Attachments: {[(a.title, a.description) for a in results['attachments']]}")
       print(f"Duration Rating: {results['duration_rating'].score}/10")
       print(f"Rating Justification: {results['duration_rating'].justification}")


   # Create cost tracker
   cost_tracker = CostTracker()

   # Create pipeline with default configuration
   config = PipelineConfig()

   # Process documents
   print("Processing document 1 with concurrency...")
   doc1_results = process_document(doc1_text, config, cost_tracker, use_concurrency=True)

   print("Processing document 2 with concurrency...")
   doc2_results = process_document(doc2_text, config, cost_tracker, use_concurrency=True)

   # Print results
   print_document_results("Document 1 (Consultancy Agreement)", doc1_results)
   print_document_results("Document 2 (Service Level Agreement)", doc2_results)

   # Print cost information
   print("\nProcessing costs:")
   costs = cost_tracker.get_costs()
   for model, model_data in costs["model_costs"].items():
       print(f"\n{model}:")
       print(f"  Input cost: ${model_data['input_cost']:.4f}")
       print(f"  Output cost: ${model_data['output_cost']:.4f}")
       print(f"  Total cost: ${model_data['total_cost']:.4f}")
   print(f"\nTotal across all models: ${costs['total_cost']:.4f}")


# ==== how_it_works ====

How it works
************


📏 Leveraging LLM Context Windows
=================================

ContextGem leverages LLMs' long context windows to deliver superior
extraction accuracy. Unlike RAG approaches that often struggle with
complex concepts and nuanced insights, ContextGem is betting on the
continuously expanding context capacity, evolving capabilities of
modern LLMs, and constantly decreasing LLM costs. This approach
enables direct information extraction from full documents, eliminating
retrieval inconsistencies and capturing the complete context necessary
for accurate understanding.


🧩 Core Components
==================

ContextGem's main elements are the Document, Aspect, and Concept
models:


📄 **Document**
---------------

"Document" model contains text and/or visual content representing a
specific document. Documents can vary in type and purpose, including
but not limited to:
   * *Contracts*: legal agreements defining terms and obligations.

   * *Invoices*: financial documents detailing transactions and
     payments.

   * *Curricula Vitae (CVs)*: resumes outlining an individual's
     professional experience and qualifications.

   * *General documents*: any other types of documents that may
     contain text or images.


🔍 **Aspect**
-------------

"Aspect" model contains text representing a defined area or topic
within a document (or another aspect) that requires focused attention.
Each aspect reflects a specific subject or theme. For example:
   * *Contract aspects*: payment terms, parties involved, or
     termination clauses.

   * *Invoice aspects*: due dates, line-item breakdowns, or tax
     details.

   * *CV aspects*: work experience, education, or skills.

Aspects may have sub-aspects, for more granular extraction with nested
context. This hierarchical structure allows for progressive refinement
of focus areas, enabling precise extraction of information from
complex documents while maintaining the contextual relationships
between different levels of content.


💡 **Concept**
--------------

Concept model contains a unit of information or an entity, derived
from an aspect or the broader document context. Concepts represent a
wide range of data points and insights, from simple entities (names,
dates, monetary values) to complex evaluations, conclusions, and
answers to specific questions. Concepts can be:
   * *Factual extractions*: such as a penalty clause in a contract, a
     total amount due in an invoice, or a certification in a CV.

   * *Analytical insights*: such as risk assessments, compliance
     evaluations, or pattern identifications.

   * *Reasoned conclusions*: such as determining whether a document
     meets specific criteria or answers particular questions.

   * *Interpretative judgments*: such as ratings, classifications, or
     qualitative assessments based on document content.

Concepts may be attached to an aspect or a document. The context for
the concept extraction will be the aspect or document, respectively.
This flexible attachment allows for both targeted extraction from
specific document sections and broader analysis across the entire
document content. When attached to aspects, concepts benefit from the
focused context, enabling more precise extraction of domain-specific
information. When attached to documents, concepts can leverage the
complete context to identify patterns, anomalies, or insights that
span multiple sections.

Multiple concept types are supported: "StringConcept",
"BooleanConcept", "NumericalConcept", "DateConcept",
"JsonObjectConcept", "RatingConcept", "LabelConcept"


Component Examples
^^^^^^^^^^^^^^^^^^

+-----------------+----------------------+----------------------+----------------------+----------------------+
|                 | Document             | Aspect               | Sub-aspect           | Concept              |
|=================|======================|======================|======================|======================|
| **Legal**       | *Software License    | Intellectual         | Patent               | Indemnification      |
|                 | Agreement*           | Property Rights      | Indemnification      | Coverage Scope ("Js  |
|                 |                      |                      |                      | onObjectConcept")    |
+-----------------+----------------------+----------------------+----------------------+----------------------+
| **Financial**   | *Quarterly Earnings  | Revenue Analysis     | Regional Performance | Year-over-Year       |
|                 | Report*              |                      |                      | Growth Rate          |
|                 |                      |                      |                      | ("NumericalConcept") |
+-----------------+----------------------+----------------------+----------------------+----------------------+
| **Healthcare**  | *Medical Research    | Methodology          | Patient Selection    | Inclusion/Exclusion  |
|                 | Paper*               |                      | Criteria             | Validity             |
|                 |                      |                      |                      | ("BooleanConcept")   |
+-----------------+----------------------+----------------------+----------------------+----------------------+
| **Technical**   | *System Architecture | Security Framework   | Authentication       | Implementation Risk  |
|                 | Document*            |                      | Protocols            | Rating               |
|                 |                      |                      |                      | ("RatingConcept")    |
+-----------------+----------------------+----------------------+----------------------+----------------------+
| **HR**          | *Employee Handbook*  | Leave Policy         | Parental Leave       | Eligibility Start    |
|                 |                      |                      | Benefits             | Date ("DateConcept") |
+-----------------+----------------------+----------------------+----------------------+----------------------+


🔄 Extraction Workflow
======================

ContextGem uses the following models to extract information from
documents:


🤖 **DocumentLLM**
------------------

**A single configurable LLM with a specific role to extract specific
information from the document.**

The "role" of an LLM is an abstraction used to assign various LLMs
tasks of different complexity. For example, if an aspect/concept is
assigned "llm_role="extractor_text"", this aspect/concept is extracted
from the document using the LLM with "role="extractor_text"". This
helps to channel different tasks to different LLMs, ensuring that the
task is handled by the most appropriate model. Usually, domain
expertise is required to determine the most appropriate role for a
specific aspect/concept. But for simple use cases, when working with
text-only documents and a single LLM, you can skip the role assignment
completely, in which case the role will default to ""extractor_text"".

An LLM can have a configurable fallback LLM with the same role.

See "DocumentLLM" and 🏷️ LLM Roles for more details.


🤖🤖 **DocumentLLMGroup**
-------------------------

**A group of LLMs with different unique roles to extract different
information from the document.**

For more complex and granular extraction workflows, an LLM group can
be used to extract different information from the same document using
different LLMs with different roles. For example, a simpler LLM e.g.
gpt-4o-mini can be used to extract specific aspects of the document,
and a more powerful LLM e.g. o3-mini will handle the extraction of
complex concepts that require reasoning over the aspects' context.

Each LLM can have its own backend and configuration, and one fallback
LLM with the same role.

See "DocumentLLMGroup" and 🏷️ LLM Roles for more details.


LLM Group Workflow Example
^^^^^^^^^^^^^^^^^^^^^^^^^^

+-----------------+----------------------+----------------------+----------------------+
|                 | LLM 1                | LLM 2                | LLM 3                |
|                 | ("extractor_text")   | ("reasoner_text")    | ("extractor_vision") |
|=================|======================|======================|======================|
| *Model*         | gpt-4.1-mini         | o4-mini              | gpt-4.1-mini         |
+-----------------+----------------------+----------------------+----------------------+
| *Task*          | Extract payment      | Detect anomalies in  | Extract invoice      |
|                 | terms from a         | the payment terms    | amounts              |
|                 | contract             |                      |                      |
+-----------------+----------------------+----------------------+----------------------+
| *Fallback LLM*  | gpt-4o-mini          | o3-mini              | gpt-4o-mini          |
| (optional)      |                      |                      |                      |
+-----------------+----------------------+----------------------+----------------------+

[image: ContextGem - How it works infographics][image]


ℹ️ What ContextGem Doesn't Offer (Yet)
======================================

While ContextGem excels at structured data extraction from individual
documents, it's important to understand its intentional design
boundaries:


**Not a RAG framework**
-----------------------

ContextGem focuses on in-depth single-document analysis, leveraging
long context windows of LLMs for maximum accuracy and precision. It
does not offer RAG capabilities for cross-document querying or corpus-
wide information retrieval. For these use cases, modern RAG frameworks
such as LlamaIndex remain more appropriate.


**Not an agent framework**
--------------------------

ContextGem is not designed as an agent framework. It now supports tool
calling in chat (function-calling) with optional parallel tool calls
and JSON schema validation of tool arguments, which enables building
lightweight, task-oriented agents using your own control loop together
with "ChatSession". For full agent orchestration (planning/critique,
goal decomposition, long-term memory, schedulers, multi-agent
coordination), we recommend frameworks specifically designed for this
purpose. ContextGem integrates cleanly as a high-accuracy document
extraction tool in larger agent systems thanks to its simple API and
structured outputs.


# ==== installation ====

Installation
************


🔧 Prerequisites
================

Before installing ContextGem, ensure you have:

* Python 3.10-3.13

* pip (Python package installer)


📦 Installation Methods
=======================


From PyPI
---------

The simplest way to install ContextGem is via pip:

   pip install -U contextgem

Or using uv (faster alternative):

   uv add contextgem


Development Installation
------------------------

For development, clone the repository and use uv:

   git clone https://github.com/shcherbak-ai/contextgem.git
   cd contextgem

   # Install uv if you don't have it
   pip install uv

   # Install dependencies including development extras
   uv sync --all-groups


✅ Verifying Installation
=========================

To verify that ContextGem is installed correctly, run:

   python -c "import contextgem; print(contextgem.__version__)"


# ==== quickstart ====

Quickstart examples
*******************

This guide will help you get started with ContextGem by walking
through basic extraction examples.

Below are complete, self-contained examples showing how to extract
data from a document using ContextGem.


🔄 Extraction Process
=====================

ContextGem follows a simple extraction process:

1. Create a "Document" instance with your content

2. Define "Aspect" instances for sections of interest

3. Define concept instances ("StringConcept", "BooleanConcept",
   "NumericalConcept", "DateConcept", "JsonObjectConcept",
   "RatingConcept") for specific data points to extract, and attach
   them to "Aspect" (for aspect context) or "Document" (for document
   context).

4. Use "DocumentLLM" or "DocumentLLMGroup" to perform the extraction

5. Access the extracted data in the document object


📋 Aspect Extraction from Document
==================================

Tip:

  Aspect extraction is useful for identifying and extracting specific
  sections or topics from documents. Common use cases include:

  * Extracting specific clauses from legal contracts

  * Identifying specific sections from financial reports

  * Isolating relevant topics from research papers

  * Extracting product features from technical documentation

   # Quick Start Example - Extracting aspect from a document

   import os

   from contextgem import Aspect, Document, DocumentLLM


   # Example document instance
   # Document content is shortened for brevity
   doc = Document(
       raw_text=(
           "Consultancy Agreement\n"
           "This agreement between Company A (Supplier) and Company B (Customer)...\n"
           "The term of the agreement is 1 year from the Effective Date...\n"
           "The Supplier shall provide consultancy services as described in Annex 2...\n"
           "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
           "This agreement is governed by the laws of Norway...\n"
       ),
   )

   # Define an aspect with optional concept(s), using natural language
   doc_aspect = Aspect(
       name="Governing law",
       description="Clauses defining the governing law of the agreement",
       reference_depth="sentences",
   )

   # Add aspects to the document
   doc.add_aspects([doc_aspect])
   # (add more aspects to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
   )

   # Extract information from the document
   extracted_aspects = llm.extract_aspects_from_document(doc)
   # or use async version llm.extract_aspects_from_document_async(doc)

   # Access extracted information
   print("Governing law aspect:")
   print(
       extracted_aspects[0].extracted_items
   )  # extracted aspect items with references to sentences
   # or doc.get_aspect_by_name("Governing law").extracted_items


🌳 Extracting Aspect with Sub-Aspects
=====================================

Tip:

  Sub-aspect extraction helps organize complex topics into logical
  components. Common use cases include:

  * Breaking down termination clauses in employment contracts into
    company rights, employee rights, and severance terms

  * Dividing financial report sections into revenue streams, expenses,
    and forecasts

  * Organizing product specifications into technical details,
    compatibility, and maintenance requirements

   # Quick Start Example - Extracting an aspect with sub-aspects

   import os

   from contextgem import Aspect, Document, DocumentLLM


   # Sample document (content shortened for brevity)
   contract_text = """
   EMPLOYMENT AGREEMENT
   ...
   8. TERMINATION
   8.1 Termination by the Company. The Company may terminate the Employee's employment for Cause at any time upon written notice. 
   "Cause" shall mean: (i) Employee's material breach of this Agreement; (ii) Employee's conviction of a felony; or 
   (iii) Employee's willful misconduct that causes material harm to the Company.
   8.2 Termination by the Employee. The Employee may terminate employment for Good Reason upon 30 days' written notice to the Company. 
   "Good Reason" shall mean a material reduction in Employee's base salary or a material diminution in Employee's duties.
   8.3 Severance. If the Employee's employment is terminated by the Company without Cause or by the Employee for Good Reason, 
   the Employee shall be entitled to receive severance pay equal to six (6) months of the Employee's base salary.
   ...
   """

   doc = Document(raw_text=contract_text)

   # Define termination aspect with practical sub-aspects
   termination_aspect = Aspect(
       name="Termination",
       description="Provisions related to the termination of employment",
       aspects=[  # assign sub-aspects (optional)
           Aspect(
               name="Company Termination Rights",
               description="Conditions under which the company can terminate employment",
           ),
           Aspect(
               name="Employee Termination Rights",
               description="Conditions under which the employee can terminate employment",
           ),
           Aspect(
               name="Severance Terms",
               description="Compensation or benefits provided upon termination",
           ),
       ],
   )

   # Add the aspect to the document. Sub-aspects are added with the parent aspect.
   doc.add_aspects([termination_aspect])
   # (add more aspects to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ.get(
           "CONTEXTGEM_OPENAI_API_KEY"
       ),  # your API key of the LLM provider
   )

   # Extract all information from the document
   doc = llm.extract_all(doc)

   # Get results with references in the document object
   print("\nTermination aspect:\n")
   termination_aspect = doc.get_aspect_by_name("Termination")
   for sub_aspect in termination_aspect.aspects:
       print(sub_aspect.name)
       for item in sub_aspect.extracted_items:
           print(item.value)
       print("\n")


🔍 Concept Extraction from Aspect
=================================

Tip:

  Concept extraction from aspects helps identify specific data points
  within already extracted sections or topics. Common use cases
  include:

  * Extracting payment amounts from a contract's payment terms

  * Extracting liability cap from a contract's liability section

  * Isolating timelines from delivery terms

  * Extracting a list of features from a product description

  * Identifying programming languages from a CV's experience section

   # Quick Start Example - Extracting a concept from an aspect

   import os

   from contextgem import Aspect, Document, DocumentLLM, StringConcept, StringExample


   # Example document instance
   # Document content is shortened for brevity
   doc = Document(
       raw_text=(
           "Employment Agreement\n"
           "This agreement between TechCorp Inc. (Employer) and Jane Smith (Employee)...\n"
           "The employment shall commence on January 15, 2023 and continue until terminated...\n"
           "The Employee shall work as a Senior Software Engineer reporting to the CTO...\n"
           "The Employee shall receive an annual salary of $120,000 paid monthly...\n"
           "The Employee is entitled to 20 days of paid vacation per year...\n"
           "The Employee agrees to a notice period of 30 days for resignation...\n"
           "This agreement is governed by the laws of California...\n"
       ),
   )

   # Define an aspect with a specific concept, using natural language
   doc_aspect = Aspect(
       name="Compensation",
       description="Clauses defining the compensation and benefits for the employee",
       reference_depth="sentences",
   )

   # Define a concept within the aspect
   aspect_concept = StringConcept(
       name="Annual Salary",
       description="The annual base salary amount specified in the employment agreement",
       examples=[  # optional
           StringExample(
               content="$X per year",  # guidance regarding format
           )
       ],
       add_references=True,
       reference_depth="sentences",
   )

   # Add the concept to the aspect
   doc_aspect.add_concepts([aspect_concept])
   # (add more concepts to the aspect, if needed)

   # Add the aspect to the document
   doc.add_aspects([doc_aspect])
   # (add more aspects to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
   )

   # Extract information from the document
   doc = llm.extract_all(doc)
   # or use async version llm.extract_all_async(doc)

   # Access extracted information in the document object
   print("Compensation aspect:")
   print(
       doc.get_aspect_by_name("Compensation").extracted_items
   )  # extracted aspect items with references to sentences
   print("Annual Salary concept:")
   print(
       doc.get_aspect_by_name("Compensation")
       .get_concept_by_name("Annual Salary")
       .extracted_items
   )  # extracted concept items with references to sentences


📝 Concept Extraction from Document (text)
==========================================

Tip:

  Concept extraction from text documents locates specific information
  directly from text. Common use cases include:

  * Extracting anomalies from entire legal documents

  * Identifying financial figures across multiple report sections

  * Extracting citations and references from academic papers

  * Identifying product specifications from technical manuals

  * Extracting contact information from business documents

   # Quick Start Example - Extracting a concept from a document

   import os

   from contextgem import Document, DocumentLLM, JsonObjectConcept


   # Example document instance
   # Document content is shortened for brevity
   doc = Document(
       raw_text=(
           "Statement of Work\n"
           "Project: Cloud Migration Initiative\n"
           "Client: Acme Corporation\n"
           "Contractor: TechSolutions Inc.\n\n"
           "Project Timeline:\n"
           "Start Date: March 1, 2025\n"
           "End Date: August 31, 2025\n\n"
           "Deliverables:\n"
           "1. Infrastructure assessment report (Due: March 15, 2025)\n"
           "2. Migration strategy document (Due: April 10, 2025)\n"
           "3. Test environment setup (Due: May 20, 2025)\n"
           "4. Production migration (Due: July 15, 2025)\n"
           "5. Post-migration support (Due: August 31, 2025)\n\n"
           "Budget: $250,000\n"
           "Payment Schedule: 20% upfront, 30% at midpoint, 50% upon completion\n"
       ),
   )

   # Define a document-level concept using e.g. JsonObjectConcept
   # This will extract structured data from the entire document
   doc_concept = JsonObjectConcept(
       name="Project Details",
       description="Key project information including timeline, deliverables, and budget",
       structure={
           "project_name": str,
           "client": str,
           "contractor": str,
           "budget": str,
           "payment_terms": str,
       },  # simply use a dictionary with type hints (including generic aliases and union types)
       add_references=True,
       reference_depth="paragraphs",
   )

   # Add the concept to the document
   doc.add_concepts([doc_concept])
   # (add more concepts to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
   )

   # Extract information from the document
   extracted_concepts = llm.extract_concepts_from_document(doc)
   # or use async version llm.extract_concepts_from_document_async(doc)

   # Access extracted information
   print("Project Details:")
   print(
       extracted_concepts[0].extracted_items
   )  # extracted concept items with references to paragraphs
   # Or doc.get_concept_by_name("Project Details").extracted_items


🖼️ Concept Extraction from Document (vision)
============================================

Tip:

  Concept extraction using vision capabilities processes documents
  with complex layouts or images. Common use cases include:

  * Extracting data from scanned contracts or receipts

  * Identifying information from charts and graphs in reports

  * Identifying visual product features from marketing materials

   # Quick Start Example - Extracting concept from a document with an image

   import os
   from pathlib import Path

   from contextgem import Document, DocumentLLM, NumericalConcept, create_image


   # Path adapted for testing
   current_file = Path(__file__).resolve()
   root_path = current_file.parents[4]
   image_path = root_path / "tests" / "images" / "invoices" / "invoice.jpg"

   # Create an image instance using the create_image utility
   doc_image = create_image(image_path)

   # Example document instance holding only the image
   doc = Document(
       images=[doc_image],  # may contain multiple images
   )

   # Define a concept to extract the invoice total amount
   doc_concept = NumericalConcept(
       name="Invoice Total",
       description="The total amount to be paid as shown on the invoice",
       numeric_type="float",
       llm_role="extractor_vision",  # use vision model
   )

   # Add concept to the document
   doc.add_concepts([doc_concept])
   # (add more concepts to the document, if needed)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",  # Using a model with vision capabilities
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
       role="extractor_vision",  # mark LLM as vision model
   )

   # Extract information from the document
   extracted_concepts = llm.extract_concepts_from_document(doc)
   # or use async version: await llm.extract_concepts_from_document_async(doc)

   # Access extracted information
   print("Invoice Total:")
   print(extracted_concepts[0].extracted_items)  # extracted concept items
   # or doc.get_concept_by_name("Invoice Total").extracted_items


💬 Lightweight LLM Chat Interface
=================================

Note:

  While ContextGem is primarily designed for advanced structured data
  extraction, it also provides a lightweight, unified interface for
  interacting with LLMs via natural language - across both text and
  vision - with built-in fallback support.

Tip:

  To preserve message history across turns, pass a "ChatSession"
  instance via "chat_session=..." to "DocumentLLM.chat(...)" (or
  ".chat_async(...)"). Without a session, each "chat(...)" call is
  treated as a one-off message → response interaction.

   # Using LLMs for chat (text + vision), with fallback LLM support

   import os

   from contextgem import DocumentLLM
   from contextgem.public import ChatSession


   # Initialize main LLM for chat
   main_model = DocumentLLM(
       model="openai/gpt-4o",  # or another provider/model
       api_key=os.getenv("CONTEXTGEM_OPENAI_API_KEY"),  # your API key for the LLM provider
       system_message="",  # disable default system message for chat, or provide your own
   )

   # Optional: configure fallback LLM for reliability
   fallback_model = DocumentLLM(
       model="openai/gpt-4o-mini",  # or another provider/model
       api_key=os.getenv("CONTEXTGEM_OPENAI_API_KEY"),  # your API key for the LLM provider
       is_fallback=True,
       system_message="",  # also disable default system message for fallback, or provide your own
   )
   main_model.fallback_llm = fallback_model


   # Preserve conversation history across turns with a ChatSession
   session = ChatSession()
   first_response = main_model.chat(
       "Hi there!",
       # images=[Image(...)]  # optional: add images for vision models
       chat_session=session,
   )
   second_response = main_model.chat(
       "And what is EBITDA?",
       chat_session=session,
   )
   # or use async: `response = await main_model.chat_async(...)`

   # Or send a chat message without a session (one-off message → response)
   one_off_response = main_model.chat("Test")


🛠️ Chat with Tools
------------------

Tip:

  Provide OpenAI-compatible tool schemas via "tools=[...]" and
  register Python handlers with "@register_tool". Tool support is only
  used in "chat(...)" and "chat_async(...)".

Note:

  Tool handlers must return a string. If you need to return structured
  data, serialize it (e.g., with "json.dumps") before returning.

   import os

   from contextgem import ChatSession, DocumentLLM, register_tool


   # Define tool handlers and register them
   @register_tool
   def compute_invoice_total(items: list[dict]) -> str:
       total = 0
       for it in items:
           qty = float(it.get("qty", 0))
           price = float(it.get("price", 0))
           total += qty * price
       return str(total)


   # OpenAI-compatible tool schema passed to the model
   tools = [
       {
           "type": "function",
           "function": {
               "name": "compute_invoice_total",
               "description": "Compute invoice total as sum(qty*price) over items",
               "parameters": {
                   "type": "object",
                   "properties": {
                       "items": {
                           "type": "array",
                           "items": {
                               "type": "object",
                               "properties": {
                                   "qty": {"type": "number"},
                                   "price": {"type": "number"},
                               },
                               "required": ["qty", "price"],
                           },
                           "minItems": 1,
                       }
                   },
                   "required": ["items"],
               },
           },
       },
   ]


   # Configure an LLM that supports tool use
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
       system_message="You are a helpful assistant.",  # override default system message for chat
       tools=tools,
   )

   # Maintain history across turns
   session = ChatSession()

   prompt = (
       "What's the invoice total for the items "
       "[{'qty':2.0,'price':3.5},{'qty':1.0,'price':3.0}]? "
       "Prices are in USD."
   )

   answer = llm.chat(prompt, chat_session=session)
   print("Answer:", answer)


# ==== documents/document_config ====

Creating Documents
******************

This guide explains how to create and configure "Document" instances
to process textual and visual content for analysis.

Documents serve as the container for the content from which
information (aspects and concepts) can be extracted.


⚙️ Configuration Parameters
===========================

The minimum configuration for a document requires either "raw_text",
"paragraphs", or "images":

Document creation

   from pathlib import Path

   from contextgem import Document, Paragraph, create_image


   # Create a document with raw text content
   contract_document = Document(
       raw_text=(
           "...This agreement is effective as of January 1, 2025.\n\n"
           "All parties must comply with the terms outlined herein. The terms include "
           "monthly reporting requirements and quarterly performance reviews.\n\n"
           "Failure to adhere to these terms may result in termination of the agreement. "
           "Additionally, any breach of confidentiality will be subject to penalties as "
           "described in this agreement.\n\n"
           "This agreement shall remain in force for a period of three (3) years unless "
           "otherwise terminated according to the provisions stated above..."
       ),
       paragraph_segmentation_mode="newlines",  # Default mode, splits on newlines
   )

   # Create a document with more advanced paragraph segmentation using a SaT model
   report_document = Document(
       raw_text=(
           "Executive Summary "
           "This report outlines our quarterly performance. "
           "Revenue increased by [15%] compared to the previous quarter.\n\n"
           "Customer satisfaction metrics show positive trends across all regions..."
       ),
       paragraph_segmentation_mode="sat",  # Use SaT model for intelligent paragraph segmentation
       sat_model_id="sat-3l-sm",  # Specify which SaT model to use
   )

   # Create a document with predefined paragraphs, e.g. when you use a custom
   # paragraph segmentation tool
   document_from_paragraphs = Document(
       paragraphs=[
           Paragraph(raw_text="This is the first paragraph."),
           Paragraph(raw_text="This is the second paragraph with more content."),
           Paragraph(raw_text="Final paragraph concluding the document."),
           # ...
       ]
   )

   # Create document with images

   # Path is adapted for doc tests
   current_file = Path(__file__).resolve()
   root_path = current_file.parents[4]
   image_path = root_path / "tests" / "images" / "invoices" / "invoice.png"

   # Create a document with only images (no text)
   image_document = Document(
       images=[
           create_image(image_path),  # contextgem.Image instance
           # ...
       ]
   )

   # Create a document with both text and images
   mixed_document = Document(
       raw_text="This document contains both text and visual elements.",
       images=[
           create_image(image_path),  # contextgem.Image instance
           # ...
       ],
   )


The "Document" class accepts the following parameters:

+---------------------------+-----------------+-----------------+-----------------------------------------------+
| Parameter                 | Type            | Default Value   | Description                                   |
|===========================|=================|=================|===============================================|
| "raw_text"                | "str | None"    | "None"          | The main text of the document as a single     |
|                           |                 |                 | string.                                       |
+---------------------------+-----------------+-----------------+-----------------------------------------------+
| "paragraphs"              | "list[Paragrap  | "[]"            | List of "Paragraph" instances in consecutive  |
|                           | h]"             |                 | order as they appear in the document.         |
|                           |                 |                 | Normally auto-populated from "raw_text".      |
+---------------------------+-----------------+-----------------+-----------------------------------------------+
| "images"                  | "list[Image]"   | "[]"            | List of "Image" instances attached to or      |
|                           |                 |                 | representing the document. Used for visual    |
|                           |                 |                 | content analysis.                             |
+---------------------------+-----------------+-----------------+-----------------------------------------------+
| "aspects"                 | "list[Aspect]"  | "[]"            | List of "Aspect" instances associated with    |
|                           |                 |                 | the document for focused analysis. Must have  |
|                           |                 |                 | unique names and descriptions. See Aspect     |
|                           |                 |                 | Extraction for more details.                  |
+---------------------------+-----------------+-----------------+-----------------------------------------------+
| "concepts"                | "list[_Concept  | "[]"            | List of "_Concept" instances associated with  |
|                           | ]"              |                 | the document for information extraction. Must |
|                           |                 |                 | have unique names and descriptions. See       |
|                           |                 |                 | supported concept types in Supported          |
|                           |                 |                 | Concepts.                                     |
+---------------------------+-----------------+-----------------+-----------------------------------------------+
| "paragraph_segmentation_  | "Literal["newl  | ""newlines""    | Mode for paragraph segmentation. ""newlines"" |
| mode"                     | ines", "sat"]"  |                 | splits on newline characters, ""sat"" uses a  |
|                           |                 |                 | SaT (Segment Any Text) model for intelligent  |
|                           |                 |                 | segmentation.                                 |
+---------------------------+-----------------+-----------------+-----------------------------------------------+
| "sat_model_id"            | "SaTModelId"    | ""sat-3l-sm""   | SaT model ID for paragraph/sentence           |
|                           |                 |                 | segmentation or a local path to a SaT model.  |
|                           |                 |                 | See wtpsplit models for available options.    |
+---------------------------+-----------------+-----------------+-----------------------------------------------+
| "pre_segment_sentences"   | "bool"          | "False"         | Whether to pre-segment sentences during       |
|                           |                 |                 | Document initialization. When "False",        |
|                           |                 |                 | sentence segmentation is deferred until       |
|                           |                 |                 | sentences are actually needed, improving      |
|                           |                 |                 | initialization performance.                   |
+---------------------------+-----------------+-----------------+-----------------------------------------------+


🔄 DOCX Document Conversion
===========================

ContextGem provides a built-in "DocxConverter" to easily transform
DOCX files into LLM-ready "Document" instances.

For detailed usage examples and configuration options, see DOCX
Converter.


🎯 Adding Aspects and Concepts for Extraction
=============================================

Before extracting information from a document with an LLM, you must
define and add **aspects** and **concepts** to your document instance.
These components serve as the foundation for targeted analysis and
structured information extraction.

**Aspects** define the text segments (sections, topics, themes) to be
extracted from the document. They can be combined with concepts for
comprehensive analysis.

**Concepts** define specific data points to be extracted or inferred
from the document content: entities, insights, structured objects,
classifications, numerical calculations, dates, ratings, and
assessments.

For detailed guidance on creating and configuring these components,
see:

* Aspect Extraction - Complete guide to defining and using aspects

* Supported Concepts - All available concept types and how to use them


# ==== converters/docx ====

DOCX Converter
**************

ContextGem provides built-in converter to easily transform DOCX files
into LLM-ready ContextGem document objects.

* 📑 **Comprehensive extraction of document elements**: paragraphs,
  headings, lists, tables, comments, footnotes, textboxes,
  headers/footers, links, embedded images, and inline formatting

* 🧩 **Document structure preservation** with rich metadata for
  improved LLM analysis

* 🛠️ **Built-in converter** that directly processes Word XML

Note:

  ✨ **Performance improvement in v0.17.1**: DOCX converter now
  converts files **~2X faster**.


🚀 Usage
========

   # Using ContextGem's DocxConverter

   from contextgem import DocxConverter


   converter = DocxConverter()

   # Convert a DOCX file to an LLM-ready ContextGem Document
   # from path
   document = converter.convert("path/to/document.docx")
   # or from file object
   with open("path/to/document.docx", "rb") as docx_file_object:
       document = converter.convert(docx_file_object)

   # Perform data extraction on the resulting Document object
   # document.add_aspects(...)
   # document.add_concepts(...)
   # llm.extract_all(document)

   # You can also use DocxConverter instance as a standalone text extractor
   docx_text = converter.convert_to_text_format(
       "path/to/document.docx",
       output_format="markdown",  # or "raw"
   )


🔄 Conversion Process
=====================

The "DocxConverter" performs the following operations when converting
a DOCX file to a ContextGem Document with "convert()" method:

+---------------------------+----------------------------------------------------+---------------------------+
| Elements                  | Extraction Details                                 | Control Parameter         |
|                           |                                                    | (Default)                 |
|===========================|====================================================|===========================|
| **Text**                  | Extracts the full document text as raw text, and   | "apply_markdown=True"     |
|                           | optionally applies markdown processing and         |                           |
|                           | formatting while preserving raw text separately    |                           |
+---------------------------+----------------------------------------------------+---------------------------+
| **Paragraphs**            | Extracts "Paragraph" objects with rich metadata    | *Always included*         |
|                           | serving as additional context for LLM (e.g.,       |                           |
|                           | *"Style: Normal, Table: 3, Row: 1, Column: 3,      |                           |
|                           | Table Cell"*)                                      |                           |
+---------------------------+----------------------------------------------------+---------------------------+
| **Headings**              | Preserves heading levels and formats as markdown   | *Always included*         |
|                           | headings when in markdown mode                     |                           |
+---------------------------+----------------------------------------------------+---------------------------+
| **Lists**                 | Maintains list hierarchy, numbering, and           | *Always included*         |
|                           | formatting with proper indentation and list type   |                           |
|                           | information                                        |                           |
+---------------------------+----------------------------------------------------+---------------------------+
| **Tables**                | Preserves table structure and formats tables in    | "include_tables=True"     |
|                           | markdown mode                                      |                           |
+---------------------------+----------------------------------------------------+---------------------------+
| **Headers & Footers**     | Captures document headers and footers with         | "include_headers=True" /  |
|                           | appropriate metadata                               | "include_footers=True"    |
+---------------------------+----------------------------------------------------+---------------------------+
| **Footnotes**             | Extracts footnotes with references and preserves   | "include_footnotes=True"  |
|                           | connection to original text                        |                           |
+---------------------------+----------------------------------------------------+---------------------------+
| **Comments**              | Preserves document comments with author            | "include_comments=True"   |
|                           | information and timestamps                         |                           |
+---------------------------+----------------------------------------------------+---------------------------+
| **Links**                 | Processes and formats hyperlinks, preserving both  | "include_links=True"      |
|                           | link text and target URLs                          |                           |
+---------------------------+----------------------------------------------------+---------------------------+
| **Text Boxes**            | Extracts text from various text box formats        | "include_textboxes=True"  |
+---------------------------+----------------------------------------------------+---------------------------+
| **Inline Formatting**     | Applies inline formatting such as bold, italic,    | "include_inline_formatti  |
|                           | underline, etc. when in markdown mode              | ng=True"                  |
+---------------------------+----------------------------------------------------+---------------------------+
| **Images**                | Extracts embedded images and converts them to      | "include_images=True"     |
|                           | "Image" objects for further processing with vision |                           |
|                           | models                                             |                           |
+---------------------------+----------------------------------------------------+---------------------------+


ℹ️ Current Limitations
======================

DocxConverter has the following limitations:

* Drawings such as charts are skipped as it is challenging to
  represent them in text format.

* Inline markdown formatting (bold, italic, etc.) and hyperlink
  formatting are not supported in specially marked sections (headers,
  footers, footnotes, comments).

* Extraction of generated table of contents (ToC) is not supported. (A
  ToC is an automatically generated list of document headings with
  page numbers that Word creates based on heading styles.)


# ==== aspects/aspects ====

Aspect Extraction
*****************

"Aspect" is a fundamental component of ContextGem that represents a
defined area or topic within a document that requires focused
attention. Aspects help identify and extract specific sections or
themes from documents according to predefined criteria.


📝 Overview
===========

Aspects serve as containers for organizing and structuring document
content extraction. They allow you to:

* **Extract document sections**: Identify and extract specific parts
  of documents (e.g., contract clauses, report sections, policy terms)

* **Organize content hierarchically**: Create sub-aspects to break
  down complex topics into logical components

* **Define extraction scope**: Focus on specific areas of interest
  before applying detailed concept extraction

While concepts extract specific data points, aspects extract entire
sections or topics from documents, providing context for subsequent
detailed analysis.


⭐ Key Features
===============


Hierarchical Organization
-------------------------

Aspects support nested structures through sub-aspects, allowing you to
break down complex topics:

* **Parent aspects** represent broad topics (e.g., *"Termination
  Clauses"*)

* **Sub-aspects** represent specific components (e.g., *"Notice
  Period"*, *"Severance Terms"*, *"Company Rights"*)


Integration with Concepts
-------------------------

Aspects can contain "_Concept" instances for detailed data extraction
within the identified sections, creating a two-stage extraction
workflow.

Note:

  See supported concept types in Supported Concepts. All public
  concept types inherit from the internal "_Concept" base class.


💻 Basic Usage
==============


Simple Aspect Extraction
------------------------

Here's how to extract a specific section from a document:

   # ContextGem: Aspect Extraction

   import os

   from contextgem import Aspect, Document, DocumentLLM


   # Create a document instance
   doc = Document(
       raw_text=(
           "Software License Agreement\n"
           "This software license agreement (Agreement) is entered into between Tech Corp (Licensor) and Client Corp (Licensee).\n"
           "...\n"
           "2. Term and Termination\n"
           "This Agreement shall commence on the Effective Date and shall continue for a period of three (3) years, "
           "unless earlier terminated in accordance with the provisions hereof. Either party may terminate this Agreement "
           "upon thirty (30) days written notice to the other party.\n"
           "\n"
           "3. Payment Terms\n"
           "Licensee agrees to pay Licensor an annual license fee of $10,000, payable within thirty (30) days of the "
           "invoice date. Late payments shall incur a penalty of 1.5% per month.\n"
           "...\n"
       ),
   )

   # Define an aspect to extract the termination clause
   termination_aspect = Aspect(
       name="Termination Clauses",
       description="Sections describing how and when the agreement can be terminated, including notice periods and conditions",
   )

   # Add the aspect to the document
   doc.add_aspects([termination_aspect])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the aspect from the document
   termination_aspect = llm.extract_aspects_from_document(doc)[0]

   # Access the extracted information
   print("Extracted Termination Clauses:")
   for item in termination_aspect.extracted_items:
       print(f"- {item.value}")


Aspect with Sub-Aspects
-----------------------

Breaking down complex topics into components:

   # ContextGem: Aspect Extraction with Sub-Aspects

   import os

   from contextgem import Aspect, Document, DocumentLLM


   # Create a document instance
   doc = Document(
       raw_text=(
           "Employment Agreement\n"
           "This Employment Agreement is entered into between Global Tech Inc. (Company) and John Smith (Employee).\n"
           "\n"
           "Section 8: Termination\n"
           "8.1 Termination by Company\n"
           "The Company may terminate this agreement at any time with or without cause by providing thirty (30) days "
           "written notice to the Employee. In case of termination for cause, no notice period is required.\n"
           "\n"
           "8.2 Termination by Employee\n"
           "The Employee may terminate this agreement by providing fourteen (14) days written notice to the Company. "
           "The Employee must complete all pending assignments before the termination date.\n"
           "\n"
           "8.3 Severance Benefits\n"
           "Upon termination without cause, the Employee shall receive severance pay equal to two (2) weeks of base salary "
           "for each year of service, with a minimum of four (4) weeks and a maximum of twenty-six (26) weeks. "
           "Severance benefits are contingent upon signing a release agreement.\n"
           "\n"
           "8.4 Return of Company Property\n"
           "Upon termination, the Employee must immediately return all Company property, including laptops, access cards, "
           "confidential documents, and any other materials belonging to the Company.\n"
           "\n"
           "Section 9: Non-Competition\n"
           "The Employee agrees not to engage in any business that competes with the Company for a period of twelve (12) "
           "months following termination of employment within a 50-mile radius of the Company's headquarters.\n"
       ),
   )

   # Define the main termination aspect with sub-aspects
   termination_aspect = Aspect(
       name="Termination Provisions",
       description="All provisions related to employment termination including conditions, procedures, and consequences",
       aspects=[
           Aspect(
               name="Company Termination Rights",
               description="Conditions and procedures for the company to terminate the employee, including notice periods and cause requirements",
           ),
           Aspect(
               name="Employee Termination Rights",
               description="Conditions and procedures for the employee to terminate employment, including notice requirements and obligations",
           ),
           Aspect(
               name="Severance Benefits",
               description="Compensation and benefits provided to the employee upon termination, including calculation methods and conditions",
           ),
           Aspect(
               name="Post-Termination Obligations",
               description="Employee obligations that continue after termination, including property return and non-competition requirements",
           ),
       ],
   )

   # Add the aspect to the document
   doc.add_aspects([termination_aspect])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract aspects from the document
   termination_aspect = llm.extract_aspects_from_document(doc)[0]

   # Access the extracted information
   print("All Termination Provisions:")
   for item in termination_aspect.extracted_items:
       print(f"- {item.value}")
   print("\nSub-Aspects:")
   for sub_aspect in termination_aspect.aspects:
       print(f"\n{sub_aspect.name}:")
       for item in sub_aspect.extracted_items:
           print(f"- {item.value}")


⚙️ Parameters
=============

When creating an "Aspect", you can configure the following parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "name"               | "str"           | (Required)      | A unique name identifier for the aspect. Must be   |
|                      |                 |                 | unique among sibling aspects.                      |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "description"        | "str"           | (Required)      | A detailed description of what the aspect          |
|                      |                 |                 | represents and what content should be extracted.   |
|                      |                 |                 | Must be unique among sibling aspects.              |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "aspects"            | "list[Aspect]"  | "[]"            | *Optional*. List of sub-aspects for hierarchical   |
|                      |                 |                 | organization. Limited to one nesting level.        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "concepts"           | "list[_Concept  | "[]"            | *Optional*. List of concepts associated with the   |
|                      | ]"              |                 | aspect for detailed data extraction within the     |
|                      |                 |                 | aspect's scope. See supported concept types in     |
|                      |                 |                 | Supported Concepts.                                |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "llm_role"           | "str"           | ""extractor_te  | The role of the LLM responsible for aspect         |
|                      |                 | xt""            | extraction. Available values: ""extractor_text"",  |
|                      |                 |                 | ""reasoner_text"". For more details, see 🏷️ LLM    |
|                      |                 |                 | Roles. Note that aspects only support text-based   |
|                      |                 |                 | extraction. For this reason, aspects cannot have   |
|                      |                 |                 | vision LLM roles (i.e. "llm_role" parameter value  |
|                      |                 |                 | ending with "_vision"). Concepts with vision LLM   |
|                      |                 |                 | roles cannot be used within aspects.               |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "reference_depth"    | "str"           | ""paragraphs""  | The structural depth of references. Available      |
|                      |                 |                 | values: ""paragraphs"", ""sentences"". Paragraph   |
|                      |                 |                 | references are always populated for aspect's       |
|                      |                 |                 | extracted items, as aspect's extracted items       |
|                      |                 |                 | represent existing text segments. Sentence         |
|                      |                 |                 | references are only populated when                 |
|                      |                 |                 | "reference_depth="sentences"".                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_justifications" | "bool"          | "False"         | Whether the LLM will output justification for each |
|                      |                 |                 | extracted item. Justifications provide valuable    |
|                      |                 |                 | insights into why specific text segments were      |
|                      |                 |                 | extracted for the aspect, helping you understand   |
|                      |                 |                 | the LLM's reasoning, verify extraction accuracy,   |
|                      |                 |                 | and debug unexpected results. This is particularly |
|                      |                 |                 | useful when working with complex aspects.          |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_dept  | "str"           | ""brief""       | The level of detail for justifications. Available  |
| h"                   |                 |                 | values: ""brief"", ""balanced"",                   |
|                      |                 |                 | ""comprehensive"".                                 |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_max_  | "int"           | "2"             | Maximum number of sentences in a justification.    |
| sents"               |                 |                 |                                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+


📊 Extracted Items
==================

When an "Aspect" is extracted, it is populated with **a list of
extracted items** accessible through the ".extracted_items" property.
Each item is an instance of the "_StringItem" class with the following
attributes:

+----------------------+----------------------+--------------------------------------------------------------+
| Attribute            | Type                 | Description                                                  |
|======================|======================|==============================================================|
| "value"              | str                  | The extracted text segment representing the aspect           |
+----------------------+----------------------+--------------------------------------------------------------+
| "justification"      | str                  | Explanation of why this text segment was identified as       |
|                      |                      | relevant to the aspect (only if "add_justifications=True")   |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_paragrap  | list["Paragraph"]    | List of paragraph objects that contain the extracted aspect  |
| hs"                  |                      | content (always populated for aspect's extracted items)      |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_sentence  | list["Sentence"]     | List of sentence objects that contain the extracted aspect   |
| s"                   |                      | content (only if "reference_depth="sentences"")              |
+----------------------+----------------------+--------------------------------------------------------------+


🚀 Advanced Usage
=================


Aspects with Concepts
---------------------

Combining aspect extraction with detailed concept extraction:

   # ContextGem: Aspect Extraction with Concepts

   import os

   from contextgem import Aspect, Document, DocumentLLM, NumericalConcept, StringConcept


   # Create a document instance
   doc = Document(
       raw_text=(
           "Service Agreement\n"
           "This Service Agreement is between DataFlow Solutions (Provider) and Enterprise Corp (Client).\n"
           "\n"
           "3. Payment Terms\n"
           "3.1 Service Fees\n"
           "The Client shall pay the Provider a monthly service fee of $5,000 for basic services. "
           "Additional premium features are available for an extra $1,200 per month. "
           "Setup fee is a one-time payment of $2,500.\n"
           "\n"
           "3.2 Payment Schedule\n"
           "All payments are due within 15 business days of invoice receipt. "
           "Invoices will be sent on the first day of each month for the upcoming service period. "
           "Late payments will incur a penalty of 2% per month on the outstanding balance.\n"
           "\n"
           "3.3 Payment Methods\n"
           "Payments may be made by bank transfer, corporate check, or ACH. "
           "Credit card payments are accepted for amounts under $1,000 with a 3% processing fee. "
           "Wire transfer fees are the responsibility of the Client.\n"
           "\n"
           "3.4 Refund Policy\n"
           "Services are non-refundable once delivered. However, if services are terminated "
           "with 30 days notice, any prepaid fees for future periods will be refunded on a pro-rata basis.\n"
       ),
   )

   # Define an aspect with associated concepts
   payment_aspect = Aspect(
       name="Payment Terms",
       description="All clauses and provisions related to payment, including fees, schedules, methods, and policies",
       concepts=[
           NumericalConcept(
               name="Monthly Service Fee",
               description="The regular monthly fee for basic services",
               numeric_type="float",
           ),
           NumericalConcept(
               name="Premium Features Fee",
               description="Additional monthly fee for premium features",
               numeric_type="float",
           ),
           NumericalConcept(
               name="Setup Fee",
               description="One-time initial setup or onboarding fee",
               numeric_type="float",
           ),
           NumericalConcept(
               name="Payment Due Days",
               description="Number of days the client has to make payment after receiving invoice",
               numeric_type="int",
           ),
           NumericalConcept(
               name="Late Payment Penalty Rate",
               description="Percentage penalty charged per month for late payments",
               numeric_type="float",
           ),
           StringConcept(
               name="Accepted Payment Methods",
               description="List of payment methods that are accepted by the provider",
           ),
           StringConcept(
               name="Refund Policy",
               description="Conditions and procedures for refunds or credits",
           ),
       ],
   )

   # Add the aspect to the document
   doc.add_aspects([payment_aspect])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract aspects and their concepts from the document
   doc = llm.extract_all(doc)

   # Access the extracted payment terms aspect and concepts
   payment_terms_aspect = doc.get_aspect_by_name("Payment Terms")
   print("Extracted Payment Terms Section:")
   for item in payment_terms_aspect.extracted_items:
       print(f"- {item.value}")
   print("\nExtracted Payment Details:")
   for concept in payment_terms_aspect.concepts:
       print(f"\n{concept.name}:")
       for item in concept.extracted_items:
           print(f"- {item.value}")

   # Access specific extracted values
   monthly_fee = payment_terms_aspect.get_concept_by_name("Monthly Service Fee")
   print(f"\nMonthly Service Fee: ${monthly_fee.extracted_items[0].value}")


Complex Hierarchical Structure
------------------------------

Creating a comprehensive document analysis structure with aspects,
sub-aspects and concepts:

   # ContextGem: Complex Hierarchical Aspect Extraction with Sub-Aspects and Concepts

   import os

   from contextgem import (
       Aspect,
       BooleanConcept,
       Document,
       DocumentLLM,
       NumericalConcept,
       StringConcept,
   )


   # Create a document instance
   doc = Document(
       raw_text=(
           "Software Development and Licensing Agreement\n"
           "\n"
           "1. Intellectual Property Rights\n"
           "1.1 Ownership of Developed Software\n"
           "All software developed under this Agreement shall remain the exclusive property of the Developer. "
           "The Client receives a non-exclusive license to use the software as specified in Section 2.\n"
           "\n"
           "1.2 Client Data and Content\n"
           "The Client retains all rights to data and content provided to the Developer. "
           "The Developer may not use Client data for any purpose other than fulfilling this Agreement.\n"
           "\n"
           "1.3 Third-Party Components\n"
           "The software may include third-party open-source components. The Client agrees to comply "
           "with all applicable open-source licenses.\n"
           "\n"
           "2. License Terms\n"
           "2.1 Grant of License\n"
           "Developer grants Client a perpetual, non-transferable license to use the software "
           "for internal business purposes only, limited to 100 concurrent users.\n"
           "\n"
           "2.2 License Restrictions\n"
           "Client may not redistribute, sublicense, or create derivative works. "
           "Reverse engineering is prohibited except as required by law.\n"
           "\n"
           "3. Payment and Financial Terms\n"
           "3.1 Development Fees\n"
           "Total development fee is $150,000, payable in three installments: "
           "$50,000 upon signing, $50,000 at 50% completion, and $50,000 upon delivery.\n"
           "\n"
           "3.2 Ongoing License Fees\n"
           "Annual license fee of $12,000 is due each year starting from the first anniversary. "
           "Fees may increase by up to 5% annually with 60 days notice.\n"
           "\n"
           "3.3 Payment Terms\n"
           "All payments due within 30 days of invoice. Late payments incur 1.5% monthly penalty.\n"
           "\n"
           "4. Liability and Risk Allocation\n"
           "4.1 Limitation of Liability\n"
           "Developer's total liability shall not exceed the total amount paid under this Agreement. "
           "Neither party shall be liable for indirect, consequential, or punitive damages.\n"
           "\n"
           "4.2 Indemnification\n"
           "Client agrees to indemnify Developer against third-party claims arising from Client's use "
           "of the software, except for claims related to Developer's IP infringement.\n"
           "\n"
           "4.3 Insurance Requirements\n"
           "Developer shall maintain professional liability insurance of at least $1,000,000. "
           "Client shall maintain general liability insurance of at least $2,000,000.\n"
       ),
   )

   # Define a complex hierarchical structure
   contract_aspects = [
       Aspect(
           name="Intellectual Property Provisions",
           description="All provisions related to intellectual property rights, ownership, and usage",
           aspects=[
               Aspect(
                   name="Software Ownership",
                   description="Clauses defining who owns the developed software and related IP rights",
                   concepts=[
                       StringConcept(
                           name="Software Owner",
                           description="The party that owns the developed software",
                       ),
                       BooleanConcept(
                           name="Exclusive Ownership",
                           description="Whether the ownership is exclusive to one party",
                       ),
                   ],
               ),
               Aspect(
                   name="Client Data Rights",
                   description="Provisions about client data ownership and developer's permitted use",
                   concepts=[
                       StringConcept(
                           name="Data Usage Restrictions",
                           description="Limitations on how developer can use client data",
                       ),
                   ],
               ),
               Aspect(
                   name="Third-Party Components",
                   description="Terms regarding use of third-party or open-source components",
                   concepts=[
                       BooleanConcept(
                           name="Open Source Included",
                           description="Whether the software includes open-source components",
                       ),
                   ],
               ),
           ],
       ),
       Aspect(
           name="License Grant and Restrictions",
           description="Terms defining the software license granted to the client and any restrictions",
           aspects=[
               Aspect(
                   name="License Scope",
                   description="The extent and limitations of the license granted",
                   concepts=[
                       StringConcept(
                           name="License Type",
                           description="The type of license granted (exclusive, non-exclusive, etc.)",
                       ),
                       NumericalConcept(
                           name="User Limit",
                           description="Maximum number of concurrent users allowed",
                           numeric_type="int",
                       ),
                       BooleanConcept(
                           name="Perpetual License",
                           description="Whether the license is perpetual or time-limited",
                       ),
                   ],
               ),
               Aspect(
                   name="Usage Restrictions",
                   description="Prohibited uses and activities under the license",
                   concepts=[
                       BooleanConcept(
                           name="Redistribution Allowed",
                           description="Whether client can redistribute the software",
                       ),
                       BooleanConcept(
                           name="Derivative Works Allowed",
                           description="Whether client can create derivative works",
                       ),
                   ],
               ),
           ],
       ),
       Aspect(
           name="Financial Terms",
           description="All payment-related provisions including fees, schedules, and penalties",
           concepts=[
               NumericalConcept(
                   name="Total Development Fee",
                   description="The total amount for software development",
                   numeric_type="float",
               ),
               NumericalConcept(
                   name="Annual License Fee",
                   description="Yearly fee for using the software",
                   numeric_type="float",
               ),
               NumericalConcept(
                   name="Payment Due Days",
                   description="Number of days to make payment after invoice",
                   numeric_type="int",
               ),
           ],
       ),
       Aspect(
           name="Risk and Liability Management",
           description="Provisions for managing risks, liability limitations, and insurance requirements",
           aspects=[
               Aspect(
                   name="Liability Limitations",
                   description="Caps and exclusions on each party's liability",
                   concepts=[
                       StringConcept(
                           name="Liability Cap",
                           description="Maximum amount of liability for each party",
                       ),
                       StringConcept(
                           name="Excluded Damages",
                           description="Types of damages that are excluded from liability",
                       ),
                   ],
               ),
               Aspect(
                   name="Insurance Requirements",
                   description="Required insurance coverage for each party",
                   concepts=[
                       NumericalConcept(
                           name="Developer Insurance Amount",
                           description="Minimum professional liability insurance for developer",
                           numeric_type="float",
                       ),
                       NumericalConcept(
                           name="Client Insurance Amount",
                           description="Minimum general liability insurance for client",
                           numeric_type="float",
                       ),
                   ],
               ),
           ],
       ),
   ]

   # Add all aspects to the document
   doc.add_aspects(contract_aspects)

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract aspects and concepts
   doc = llm.extract_all(doc)

   # Access the hierarchical extraction results
   print("=== CONTRACT ANALYSIS RESULTS ===\n")

   for main_aspect in doc.aspects:
       print(f"{main_aspect.name.upper()}")
       for item in main_aspect.extracted_items:
           print(f"- {item.value}")

       # Access main aspect concepts
       if main_aspect.concepts:
           print("  Main Aspect Concepts:")
           for concept in main_aspect.concepts:
               print(f"    • {concept.name}:")
               for item in concept.extracted_items:
                   print(f"      - {item.value}")

       # Access sub-aspects
       if main_aspect.aspects:
           print("  Sub-Aspects:")
           for sub_aspect in main_aspect.aspects:
               print(f"    {sub_aspect.name}")
               for item in sub_aspect.extracted_items:
                   print(f"    - {item.value}")

               # Access sub-aspect concepts
               if sub_aspect.concepts:
                   print("    Sub-Aspect Concepts:")
                   for concept in sub_aspect.concepts:
                       print(f"      • {concept.name}:")
                       for item in concept.extracted_items:
                           print(f"        - {item.value}")

       print()


Justifications for Extraction
-----------------------------

Justifications provide explanations for why specific text segments
were identified as relevant to an aspect. Justifications help users
understand the reasoning behind extractions and evaluate their
relevance. When enabled, each extracted item includes a generated
explanation of why that text segment was considered part of the
aspect.

Example:

   # ContextGem: Aspect Extraction with Justifications

   import os

   from contextgem import Aspect, Document, DocumentLLM


   # Create a document instance
   doc = Document(
       raw_text=(
           "NON-DISCLOSURE AGREEMENT\n"
           "\n"
           'This Non-Disclosure Agreement ("Agreement") is entered into between TechCorp Inc. '
           '("Disclosing Party") and Innovation Labs LLC ("Receiving Party") on January 15, 2024.\n'
           "...\n"
       ),
   )

   # Define a single aspect focused on NDA direction with justifications
   nda_direction_aspect = Aspect(
       name="NDA Direction",
       description="Provisions informing the NDA direction (whether mutual or one-way) and information flow between parties",
       add_justifications=True,
       justification_depth="balanced",
       justification_max_sents=4,
   )

   # Add the aspect to the document
   doc.aspects = [nda_direction_aspect]

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the aspect with justifications
   nda_direction_aspect = llm.extract_aspects_from_document(doc)[0]
   for i, item in enumerate(nda_direction_aspect.extracted_items, 1):
       print(f"- {i}. {item.value}")
       print(f"  Justification: {item.justification}")
       print()

Note:

  References are always included for aspects. The
  "reference_paragraphs" field is automatically populated in extracted
  items of aspects, as they represent existing text segments in the
  document. The "reference_sentences" field is only populated when
  "reference_depth" is set to ""sentences"". You can access these
  references as follows:

     # Always available for aspects
     aspect.extracted_items[0].reference_paragraphs

     # Only populated if reference_depth="sentences"
     aspect.extracted_items[0].reference_sentences


💡 Best Practices
=================


Aspect Definition
-----------------

* **Be specific**: Provide clear, detailed descriptions that help the
  LLM understand exactly what content constitutes the aspect

* **Use domain terminology**: Include relevant domain-specific terms
  that help identify the target content

* **Define scope clearly**: Specify what should and shouldn't be
  included in the aspect


Structuring Complex Content
---------------------------

* **Logical decomposition**: Break down complex topics into logical,
  non-overlapping components

* **Meaningful relationships**: Ensure sub-aspects and/or concepts
  genuinely belong to their parent aspect


Integration Strategy
--------------------

* **Two-stage extraction**: Use aspects to identify relevant sections
  first, then apply sub-aspects and/or concepts for detailed data
  extraction

* **Scope alignment**: Ensure sub-aspects and/or concepts are relevant
  to their containing aspects

* **Reference tracking**: Enable references when you need to trace
  extracted data back to source locations


🎯 Example Use Cases
====================

These are examples of how aspects may be used in different domains:


Contract Analysis
-----------------

* **Termination Clauses**: Extract and analyze termination conditions,
  notice periods, and severance terms

* **Payment Terms**: Identify payment schedules, amounts, and
  conditions

* **Liability Sections**: Extract liability caps, limitations, and
  indemnification clauses

* **Intellectual Property**: Identify IP ownership, licensing, and
  usage rights


Financial Reports
-----------------

* **Revenue Sections**: Extract revenue recognition, breakdown by
  segments, and growth analysis

* **Compliance Sections**: Identify regulatory compliance statements
  and audit findings

* **Key Performance Indicators**: Extract precise numerical metrics
  like EBITDA margins, debt-to-equity ratios, and year-over-year
  percentage changes


Technical Documentation
-----------------------

* **Product Specifications**: Extract technical requirements,
  features, and performance criteria

* **Installation Procedures**: Identify setup steps, configuration
  requirements, and dependencies

* **Troubleshooting Sections**: Extract problem descriptions,
  diagnostic steps, and solutions

* **API Documentation**: Identify endpoints, parameters, and usage
  examples


Research Papers
---------------

* **Methodology Sections**: Extract research methods, data collection,
  and analysis approaches

* **Results Sections**: Identify findings, statistical outcomes, and
  experimental results

* **Discussion Sections**: Extract interpretation, implications, and
  future research directions


# ==== concepts/supported_concepts ====

Supported Concepts
******************

In ContextGem, Concepts are building blocks for defining the
structured data you want to extract from documents. Each concept type
is designed for different kinds of information, allowing you to build
complex extraction schemas.


Available Concept Types
=======================

ContextGem provides several types of concepts, each tailored for
specific extraction needs:

* 📝 StringConcept: For extracting text values

* ✅ BooleanConcept: For extracting boolean (True/False) values

* 🔢 NumericalConcept: For extracting numerical values (integers or
  floats)

* 📅 DateConcept: For extracting date objects

* ⭐ RatingConcept: For extracting numerical ratings within a defined
  scale

* 📊 JsonObjectConcept: For extracting structured data with multiple
  fields

* 🏷️ LabelConcept: For classification using predefined labels (multi-
  class or multi-label)

This section provides detailed documentation for each concept type,
including usage examples and best practices.


# ==== concepts/string_concept ====

StringConcept
*************

"StringConcept" is a versatile concept type in ContextGem that
extracts text-based information from documents, ranging from simple
data fields to complex analytical insights.


📝 Overview
===========

"StringConcept" is used when you need to extract text values from
documents, including:

* **Simple fields**: names, titles, descriptions, identifiers

* **Complex analyses**: conclusions, assessments, recommendations,
  summaries

* **Detected elements**: anomalies, patterns, key findings, critical
  insights

This concept type offers flexibility to extract both factual
information and interpretive content that requires advanced
understanding.


💻 Usage Example
================

Here's a simple example of how to use "StringConcept" to extract a
person's name from a document:

   # ContextGem: StringConcept Extraction

   import os

   from contextgem import Document, DocumentLLM, StringConcept


   # Create a Document object from text
   doc = Document(raw_text="My name is John Smith and I am 30 years old.")

   # Define a StringConcept to extract a person's name
   name_concept = StringConcept(
       name="Person name",
       description="Full name of the person",
   )

   # Attach the concept to the document
   doc.add_concepts([name_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   name_concept = llm.extract_concepts_from_document(doc)[0]

   # Get the extracted value
   print(name_concept.extracted_items[0].value)  # Output: "John Smith"
   # Or access the extracted value from the document object
   print(doc.concepts[0].extracted_items[0].value)  # Output: "John Smith"


⚙️ Parameters
=============

When creating a "StringConcept", you can specify the following
parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "name"               | "str"           | (Required)      | A unique name identifier for the concept           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "description"        | "str"           | (Required)      | A clear description of what the concept represents |
|                      |                 |                 | and what should be extracted                       |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "examples"           | "list[StringEx  | "[]"            | Optional. Example values that help the LLM better  |
|                      | ample]"         |                 | understand what to extract and the expected format |
|                      |                 |                 | (e.g., *"Party Name (Role)"* format for contract   |
|                      |                 |                 | parties). This additional guidance helps improve   |
|                      |                 |                 | extraction accuracy and consistency.               |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "llm_role"           | "str"           | ""extractor_te  | The role of the LLM responsible for extracting the |
|                      |                 | xt""            | concept. Available values: ""extractor_text"",     |
|                      |                 |                 | ""reasoner_text"", ""extractor_vision"",           |
|                      |                 |                 | ""reasoner_vision"", ""extractor_multimodal"",     |
|                      |                 |                 | ""reasoner_multimodal"". For more details, see 🏷️  |
|                      |                 |                 | LLM Roles.                                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_justifications" | "bool"          | "False"         | Whether to include justifications for extracted    |
|                      |                 |                 | items. Justifications provide explanations of why  |
|                      |                 |                 | the LLM extracted specific values and the          |
|                      |                 |                 | reasoning behind the extraction, which is          |
|                      |                 |                 | especially useful for complex extractions or when  |
|                      |                 |                 | debugging results.                                 |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_dept  | "str"           | ""brief""       | Justification detail level. Available values:      |
| h"                   |                 |                 | ""brief"", ""balanced"", ""comprehensive"".        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_max_  | "int"           | "2"             | Maximum sentences in a justification.              |
| sents"               |                 |                 |                                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_references"     | "bool"          | "False"         | Whether to include source references for extracted |
|                      |                 |                 | items. References indicate the specific locations  |
|                      |                 |                 | in the document where the information was either   |
|                      |                 |                 | directly found or from which it was inferred,      |
|                      |                 |                 | helping to trace back extracted values to their    |
|                      |                 |                 | source content even when the extraction involves   |
|                      |                 |                 | reasoning or interpretation.                       |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "reference_depth"    | "str"           | ""paragraphs""  | Source reference granularity. Available values:    |
|                      |                 |                 | ""paragraphs"", ""sentences"".                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "singular_occurrenc  | "bool"          | "False"         | Whether this concept is restricted to having only  |
| e"                   |                 |                 | one extracted item. If "True", only a single       |
|                      |                 |                 | extracted item will be extracted. This is          |
|                      |                 |                 | particularly relevant when it might be unclear for |
|                      |                 |                 | the LLM whether to focus on the concept as a       |
|                      |                 |                 | single item or extract multiple items. For         |
|                      |                 |                 | example, when extracting the total amount of       |
|                      |                 |                 | payments in a contract, where payments might be    |
|                      |                 |                 | mentioned in different parts of the document but   |
|                      |                 |                 | you only want the final total. Note that with      |
|                      |                 |                 | advanced LLMs, this constraint may not be strictly |
|                      |                 |                 | required as they can often infer the appropriate   |
|                      |                 |                 | number of items to extract from the concept's      |
|                      |                 |                 | name, description, and type (e.g., "document       |
|                      |                 |                 | title" vs "key findings").                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "custom_data"        | "dict"          | "{}"            | Optional. Dictionary for storing any additional    |
|                      |                 |                 | data that you want to associate with the concept.  |
|                      |                 |                 | This data must be JSON-serializable. This data is  |
|                      |                 |                 | not used for extraction but can be useful for      |
|                      |                 |                 | custom processing or downstream tasks.             |
+----------------------+-----------------+-----------------+----------------------------------------------------+


🚀 Advanced Usage
=================


✏️ Adding Examples
------------------

You can add examples to improve the extraction accuracy and set the
expected format for a "StringConcept":

   # ContextGem: StringConcept Extraction with Examples

   import os

   from contextgem import Document, DocumentLLM, StringConcept, StringExample


   # Create a Document object from text
   contract_text = """
   SERVICE AGREEMENT
   This Service Agreement (the "Agreement") is entered into as of January 15, 2025 by and between:
   XYZ Innovations Inc., a Delaware corporation with offices at 123 Tech Avenue, San Francisco, CA 
   ("Provider"), and
   Omega Enterprises LLC, a New York limited liability company with offices at 456 Business Plaza, 
   New York, NY ("Customer").
   """
   doc = Document(raw_text=contract_text)

   # Create a StringConcept for extracting parties and their roles
   parties_concept = StringConcept(
       name="Contract parties",
       description="Names of parties and their roles in the contract",
       examples=[
           StringExample(content="Acme Corporation (Supplier)"),
           StringExample(content="TechGroup Inc. (Client)"),
       ],  # add examples providing additional guidance to the LLM
   )

   # Attach the concept to the document
   doc.add_concepts([parties_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   parties_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted parties and their roles
   print("Extracted parties and roles:")
   for item in parties_concept.extracted_items:
       print(f"- {item.value}")

   # Expected output:
   # - XYZ Innovations Inc. (Provider)
   # - Omega Enterprises LLC (Customer)


🔍 References and Justifications for Extraction
-----------------------------------------------

You can configure a "StringConcept" to include justifications and
references. Justifications help explain the reasoning behind extracted
values, especially for complex or inferred information like
conclusions or assessments, while references point to the specific
parts of the document that informed the extraction:

   # ContextGem: StringConcept Extraction with References and Justifications

   import os

   from contextgem import Document, DocumentLLM, StringConcept


   # Sample document text containing financial information
   financial_text = """
   2024 Financial Performance Summary

   Revenue increased to $120 million in fiscal year 2024, representing 15% growth compared to the previous year. This growth was primarily driven by the expansion of our enterprise client base and the successful launch of our premium service tier.

   The Board has recommended a dividend of $1.25 per share, which will be payable to shareholders of record as of March 15, 2025.
   """

   # Create a Document from the text
   doc = Document(raw_text=financial_text)

   # Create a StringConcept with justifications and references enabled
   key_figures_concept = StringConcept(
       name="Financial key figures",
       description="Important financial metrics and figures mentioned in the report",
       add_justifications=True,  # enable justifications to understand extraction reasoning
       justification_depth="balanced",
       justification_max_sents=3,  # allow up to 3 sentences for each justification
       add_references=True,  # include references to source text
       reference_depth="sentences",  # reference specific sentences rather than paragraphs
   )

   # Attach the concept to the document
   doc.add_concepts([key_figures_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4o-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept
   key_figures_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted items with justifications and references
   print("Extracted financial key figures:")
   for item in key_figures_concept.extracted_items:
       print(f"\nFigure: {item.value}")
       print(f"Justification: {item.justification}")
       print("Source references:")
       for sent in item.reference_sentences:
           print(f"- {sent.raw_text}")


📊 Extracted Items
==================

When a "StringConcept" is extracted, it is populated with **a list of
extracted items** accessible through the ".extracted_items" property.
Each item is an instance of the "_StringItem" class with the following
attributes:

+----------------------+----------------------+--------------------------------------------------------------+
| Attribute            | Type                 | Description                                                  |
|======================|======================|==============================================================|
| "value"              | str                  | The extracted text string                                    |
+----------------------+----------------------+--------------------------------------------------------------+
| "justification"      | str                  | Explanation of why this string was extracted (only if        |
|                      |                      | "add_justifications=True")                                   |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_paragrap  | list["Paragraph"]    | List of paragraph objects that informed the extraction (only |
| hs"                  |                      | if "add_references=True")                                    |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_sentence  | list["Sentence"]     | List of sentence objects that informed the extraction (only  |
| s"                   |                      | if "add_references=True" and "reference_depth="sentences"")  |
+----------------------+----------------------+--------------------------------------------------------------+


💡 Best Practices
=================

Here are some best practices to optimize your use of "StringConcept":

* Provide a clear and specific description that helps the LLM
  understand exactly what to extract.

* Include examples (using "StringExample") to improve extraction
  accuracy and demonstrate the expected format (e.g., *"Party Name
  (Role)"* for contract parties or *"Revenue: $X million"* for
  financial figures).

* Enable justifications (using "add_justifications=True") when you
  need to see why the LLM extracted certain values.

* Enable references (using "add_references=True") when you need to
  trace back to where in the document the information was found or
  understand what evidence informed extracted values (especially for
  inferred information).

* When relevant, enforce only a single item extraction (using
  "singular_occurrence=True"). This is particularly relevant when it
  might be unclear for the LLM whether to focus on the concept as a
  single item or extract multiple items. For example, when extracting
  the total amount of payments in a contract, where payments might be
  mentioned in different parts of the document but you only want the
  final total.


# ==== concepts/boolean_concept ====

BooleanConcept
**************

"BooleanConcept" is a specialized concept type that evaluates document
content and produces True/False assessments based on specific
criteria, conditions, or properties you define.


📝 Overview
===========

"BooleanConcept" is used when you need to determine if a document
contains or satisfies specific attributes, properties, or conditions
that can be represented as True or False values, such as:

* **Presence checks**: contains confidential information, includes
  specific clauses, mentions certain topics

* **Compliance assessments**: meets regulatory requirements, follows
  specific formatting standards

* **Binary classifications**: is favorable/unfavorable, is
  complete/incomplete, is approved/rejected


💻 Usage Example
================

Here's a simple example of how to use "BooleanConcept" to determine if
a document mentions confidential information:

   # ContextGem: BooleanConcept Extraction

   import os

   from contextgem import BooleanConcept, Document, DocumentLLM


   # Create a Document object from text
   doc = Document(
       raw_text="This document contains confidential information and should not be shared publicly."
   )

   # Define a BooleanConcept to detect confidential content
   confidentiality_concept = BooleanConcept(
       name="Is confidential",
       description="Whether the document contains confidential information",
   )

   # Attach the concept to the document
   doc.add_concepts([confidentiality_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   confidentiality_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted value
   print(confidentiality_concept.extracted_items[0].value)  # Output: True
   # Or access the extracted value from the document object
   print(doc.concepts[0].extracted_items[0].value)  # Output: True


⚙️ Parameters
=============

When creating a "BooleanConcept", you can specify the following
parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "name"               | "str"           | (Required)      | A unique name identifier for the concept           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "description"        | "str"           | (Required)      | A clear description of what condition or property  |
|                      |                 |                 | the concept evaluates and the criteria for         |
|                      |                 |                 | determining true or false values                   |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "llm_role"           | "str"           | ""extractor_te  | The role of the LLM responsible for extracting the |
|                      |                 | xt""            | concept. Available values: ""extractor_text"",     |
|                      |                 |                 | ""reasoner_text"", ""extractor_vision"",           |
|                      |                 |                 | ""reasoner_vision"", ""extractor_multimodal"",     |
|                      |                 |                 | ""reasoner_multimodal"". For more details, see 🏷️  |
|                      |                 |                 | LLM Roles.                                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_justifications" | "bool"          | "False"         | Whether to include justifications for extracted    |
|                      |                 |                 | items. Justifications provide explanations of why  |
|                      |                 |                 | the LLM extracted specific values and the          |
|                      |                 |                 | reasoning behind the extraction, which is          |
|                      |                 |                 | especially useful for complex extractions or when  |
|                      |                 |                 | debugging results.                                 |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_dept  | "str"           | ""brief""       | Justification detail level. Available values:      |
| h"                   |                 |                 | ""brief"", ""balanced"", ""comprehensive"".        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_max_  | "int"           | "2"             | Maximum sentences in a justification.              |
| sents"               |                 |                 |                                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_references"     | "bool"          | "False"         | Whether to include source references for extracted |
|                      |                 |                 | items. References indicate the specific locations  |
|                      |                 |                 | in the document where evidence supporting the      |
|                      |                 |                 | boolean determination was found, helping to trace  |
|                      |                 |                 | back the true/false value to relevant content that |
|                      |                 |                 | influenced the decision.                           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "reference_depth"    | "str"           | ""paragraphs""  | Source reference granularity. Available values:    |
|                      |                 |                 | ""paragraphs"", ""sentences"".                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "singular_occurrenc  | "bool"          | "False"         | Whether this concept is restricted to having only  |
| e"                   |                 |                 | one extracted item. If "True", only a single       |
|                      |                 |                 | extracted item will be extracted. For boolean      |
|                      |                 |                 | concepts, this parameter is particularly useful    |
|                      |                 |                 | when you want to make a single true/false          |
|                      |                 |                 | determination about the entire document (e.g.,     |
|                      |                 |                 | "contains confidential information") or a unique   |
|                      |                 |                 | determination about a specific aspect (e.g., "is   |
|                      |                 |                 | the payment schedule finalized"). This helps       |
|                      |                 |                 | distinguish between evaluating overall document    |
|                      |                 |                 | properties versus identifying multiple instances   |
|                      |                 |                 | where a condition might be true/false. Note that   |
|                      |                 |                 | with advanced LLMs, this constraint may not be     |
|                      |                 |                 | required as they can often infer the appropriate   |
|                      |                 |                 | number of items to extract from the concept's      |
|                      |                 |                 | name, description, and type (e.g., "contains       |
|                      |                 |                 | confidential information" vs "compliance           |
|                      |                 |                 | violations").                                      |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "custom_data"        | "dict"          | "{}"            | Optional. Dictionary for storing any additional    |
|                      |                 |                 | data that you want to associate with the concept.  |
|                      |                 |                 | This data must be JSON-serializable. This data is  |
|                      |                 |                 | not used for extraction but can be useful for      |
|                      |                 |                 | custom processing or downstream tasks.             |
+----------------------+-----------------+-----------------+----------------------------------------------------+


🚀 Advanced Usage
=================


🔍 References and Justifications for Extraction
-----------------------------------------------

You can configure a "BooleanConcept" to include justifications and
references. Justifications help explain the reasoning behind
true/false determinations, while references point to the specific
parts of the document that influenced the decision:

   # ContextGem: BooleanConcept Extraction with References and Justifications

   import os

   from contextgem import BooleanConcept, Document, DocumentLLM


   # Sample document text containing policy information
   policy_text = """
   Company Data Retention Policy (Updated 2024)

   All customer data must be encrypted at rest and in transit using industry-standard encryption protocols.
   Personal information should be retained for no longer than 3 years after the customer relationship ends.
   Employees are required to complete data privacy training annually.
   """

   # Create a Document from the text
   doc = Document(raw_text=policy_text)

   # Create a BooleanConcept with justifications and references enabled
   compliance_concept = BooleanConcept(
       name="Has encryption requirement",
       description="Whether the document specifies that data must be encrypted",
       add_justifications=True,  # Enable justifications to understand reasoning
       justification_depth="brief",
       justification_max_sents=1,  # Allow up to 1 sentences for each justification
       add_references=True,  # Include references to source text
       reference_depth="sentences",  # Reference specific sentences rather than paragraphs
   )

   # Attach the concept to the document
   doc.add_concepts([compliance_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4o-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept
   compliance_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted value with justification and references
   print(f"Has encryption requirement: {compliance_concept.extracted_items[0].value}")
   print(f"\nJustification: {compliance_concept.extracted_items[0].justification}")
   print("\nSource references:")
   for sent in compliance_concept.extracted_items[0].reference_sentences:
       print(f"- {sent.raw_text}")


📊 Extracted Items
==================

When a "BooleanConcept" is extracted, it is populated with **a list of
extracted items** accessible through the ".extracted_items" property.
Each item is an instance of the "_BooleanItem" class with the
following attributes:

+----------------------+----------------------+--------------------------------------------------------------+
| Attribute            | Type                 | Description                                                  |
|======================|======================|==============================================================|
| "value"              | bool                 | The extracted boolean value (True or False)                  |
+----------------------+----------------------+--------------------------------------------------------------+
| "justification"      | str                  | Explanation of why this boolean value was determined (only   |
|                      |                      | if "add_justifications=True")                                |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_paragrap  | list["Paragraph"]    | List of paragraph objects that influenced the boolean        |
| hs"                  |                      | determination (only if "add_references=True")                |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_sentence  | list["Sentence"]     | List of sentence objects that influenced the boolean         |
| s"                   |                      | determination (only if "add_references=True" and             |
|                      |                      | "reference_depth="sentences"")                               |
+----------------------+----------------------+--------------------------------------------------------------+


💡 Best Practices
=================

Here are some best practices to optimize your use of "BooleanConcept":

* Provide a clear and specific description that helps the LLM
  understand exactly what condition to evaluate, using precise and
  unambiguous language in your concept names and descriptions. Since
  boolean concepts yield true/false values, focus on describing what
  criteria should be used to make the determination (e.g., *"whether
  the document mentions specific compliance requirements"* rather than
  just *"compliance requirements"*). Avoid vague terms that could be
  interpreted multiple ways—for example, use *"contains legally
  binding obligations"* instead of *"contains important content"* to
  ensure consistent and accurate determinations.

* Break down complex conditions into multiple simpler boolean concepts
  when appropriate. Instead of one concept checking *"document is
  complete and compliant and approved,"* consider separate concepts
  for each condition. This provides more granular insights and makes
  it easier to identify specific issues when any condition fails.

* Enable justifications (using "add_justifications=True") when you
  need to understand the reasoning behind the LLM's true/false
  determination.

* Enable references (using "add_references=True") when you need to
  trace back to specific parts of the document that influenced the
  boolean decision or verify the evidence used to make the
  determination.

* Use "singular_occurrence=True" to enforce only a single boolean
  determination for the entire document. This is particularly useful
  for concepts that should yield a single true/false answer, such as
  *"contains confidential information"* or *"is compliant with
  regulations,"* rather than identifying multiple instances where the
  condition might be true or false throughout the document.


# ==== concepts/numerical_concept ====

NumericalConcept
****************

"NumericalConcept" is a specialized concept type that extracts,
calculates, or derives numerical values (integers, floats, or both)
from document content.


📝 Overview
===========

"NumericalConcept" enables powerful numerical data extraction and
analysis from documents, such as:

* **Direct extraction**: retrieving explicitly stated values like
  prices, percentages, dates, or measurements

* **Calculated values**: computing sums, averages, growth rates, or
  other derived metrics

* **Quantitative assessments**: determining counts, frequencies,
  totals, or numerical scores

The concept can work with integers, floating-point numbers, or both
types based on your configuration.


💻 Usage Example
================

Here's a simple example of how to use "NumericalConcept" to extract a
price from a document:

   # ContextGem: NumericalConcept Extraction

   import os

   from contextgem import Document, DocumentLLM, NumericalConcept


   # Create a Document object from text
   doc = Document(
       raw_text="The latest smartphone model costs $899.99 and will be available next week."
   )

   # Define a NumericalConcept to extract the price
   price_concept = NumericalConcept(
       name="Product price",
       description="The price of the product",
       numeric_type="float",  # We expect a decimal price
   )

   # Attach the concept to the document
   doc.add_concepts([price_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   price_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted value
   print(price_concept.extracted_items[0].value)  # Output: 899.99
   # Or access the extracted value from the document object
   print(doc.concepts[0].extracted_items[0].value)  # Output: 899.99


⚙️ Parameters
=============

When creating a "NumericalConcept", you can specify the following
parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "name"               | "str"           | (Required)      | A unique name identifier for the concept           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "description"        | "str"           | (Required)      | A clear description of what numerical value to     |
|                      |                 |                 | extract, which can include explicit values to      |
|                      |                 |                 | find, calculations to perform, or quantitative     |
|                      |                 |                 | assessments to derive from the document content    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "numeric_type"       | "str"           | ""any""         | The type of numerical values to extract. Available |
|                      |                 |                 | values: ""int"", ""float"", ""any"". When ""any""  |
|                      |                 |                 | is specified, the system will automatically        |
|                      |                 |                 | determine whether to use an integer or floating-   |
|                      |                 |                 | point representation based on the extracted value, |
|                      |                 |                 | choosing the most appropriate type for each        |
|                      |                 |                 | numerical item.                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "llm_role"           | "str"           | ""extractor_te  | The role of the LLM responsible for extracting the |
|                      |                 | xt""            | concept. Available values: ""extractor_text"",     |
|                      |                 |                 | ""reasoner_text"", ""extractor_vision"",           |
|                      |                 |                 | ""reasoner_vision"", ""extractor_multimodal"",     |
|                      |                 |                 | ""reasoner_multimodal"". For more details, see 🏷️  |
|                      |                 |                 | LLM Roles.                                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_justifications" | "bool"          | "False"         | Whether to include justifications for extracted    |
|                      |                 |                 | items. Justifications provide explanations of why  |
|                      |                 |                 | the LLM extracted specific numerical values and    |
|                      |                 |                 | the reasoning behind the extraction, which is      |
|                      |                 |                 | especially useful for complex calculations,        |
|                      |                 |                 | inferred values, or when debugging results.        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_dept  | "str"           | ""brief""       | Justification detail level. Available values:      |
| h"                   |                 |                 | ""brief"", ""balanced"", ""comprehensive"".        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_max_  | "int"           | "2"             | Maximum sentences in a justification.              |
| sents"               |                 |                 |                                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_references"     | "bool"          | "False"         | Whether to include source references for extracted |
|                      |                 |                 | items. References indicate the specific locations  |
|                      |                 |                 | in the document where the numerical values were    |
|                      |                 |                 | either directly found or from which they were      |
|                      |                 |                 | calculated or inferred, helping to trace back      |
|                      |                 |                 | extracted values to their source content even when |
|                      |                 |                 | the extraction involves complex calculations or    |
|                      |                 |                 | mathematical reasoning.                            |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "reference_depth"    | "str"           | ""paragraphs""  | Source reference granularity. Available values:    |
|                      |                 |                 | ""paragraphs"", ""sentences"".                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "singular_occurrenc  | "bool"          | "False"         | Whether this concept is restricted to having only  |
| e"                   |                 |                 | one extracted item. If "True", only a single       |
|                      |                 |                 | numerical value will be extracted. For numerical   |
|                      |                 |                 | concepts, this parameter is particularly useful    |
|                      |                 |                 | when you want to extract a single specific value   |
|                      |                 |                 | rather than identifying multiple numerical values  |
|                      |                 |                 | throughout the document. This helps distinguish    |
|                      |                 |                 | between single-value concepts versus multi-value   |
|                      |                 |                 | concepts (e.g., *"total contract value"* vs *"all  |
|                      |                 |                 | payment amounts"*). Note that with advanced LLMs,  |
|                      |                 |                 | this constraint may not be required as they can    |
|                      |                 |                 | often infer the appropriate number of items to     |
|                      |                 |                 | extract from the concept's name, description, and  |
|                      |                 |                 | type.                                              |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "custom_data"        | "dict"          | "{}"            | Optional. Dictionary for storing any additional    |
|                      |                 |                 | data that you want to associate with the concept.  |
|                      |                 |                 | This data must be JSON-serializable. This data is  |
|                      |                 |                 | not used for extraction but can be useful for      |
|                      |                 |                 | custom processing or downstream tasks.             |
+----------------------+-----------------+-----------------+----------------------------------------------------+


🚀 Advanced Usage
=================


🔍 References and Justifications for Extraction
-----------------------------------------------

You can configure a "NumericalConcept" to include justifications and
references. Justifications help explain the reasoning behind the
extracted values, while references point to the specific parts of the
document where the numerical values were either directly found or from
which they were calculated or inferred, helping to trace back
extracted values to their source content even when the extraction
involves complex calculations or mathematical reasoning:

   # ContextGem: NumericalConcept Extraction with References and Justifications

   import os

   from contextgem import Document, DocumentLLM, NumericalConcept


   # Document with values that require calculation/inference
   report_text = """
   Quarterly Sales Report - Q2 2023

   Product A: Sold 450 units at $75 each
   Product B: Sold 320 units at $125 each
   Product C: Sold 180 units at $95 each

   Marketing expenses: $28,500
   Operating costs: $42,700
   """

   # Create a Document from the text
   doc = Document(raw_text=report_text)

   # Create a NumericalConcept for total revenue
   total_revenue_concept = NumericalConcept(
       name="Total quarterly revenue",
       description="The total revenue calculated by multiplying units sold by their price",
       add_justifications=True,
       justification_depth="comprehensive",  # Detailed justification to show calculation steps
       justification_max_sents=4,  # Maximum number of sentences for justification
       add_references=True,
       reference_depth="paragraphs",  # Reference specific paragraphs
       singular_occurrence=True,  # Ensure that the data is merged into a single item
   )

   # Attach the concept to the document
   doc.add_concepts([total_revenue_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/o4-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept
   total_revenue_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted inferred value with justification
   print("Calculated total quarterly revenue:")
   for item in total_revenue_concept.extracted_items:
       print(f"\nTotal Revenue: {item.value}")
       print(f"Calculation Justification: {item.justification}")
       print("Source references:")
       for para in item.reference_paragraphs:
           print(f"- {para.raw_text}")


📊 Extracted Items
==================

When a "NumericalConcept" is extracted, it is populated with **a list
of extracted items** accessible through the ".extracted_items"
property. Each item is an instance of the "_NumericalItem" class with
the following attributes:

+----------------------+----------------------+--------------------------------------------------------------+
| Attribute            | Type                 | Description                                                  |
|======================|======================|==============================================================|
| "value"              | int or float         | The extracted numerical value, either an integer or          |
|                      |                      | floating-point number depending on the "numeric_type"        |
|                      |                      | setting                                                      |
+----------------------+----------------------+--------------------------------------------------------------+
| "justification"      | str                  | Explanation of why this numerical value was extracted (only  |
|                      |                      | if "add_justifications=True")                                |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_paragrap  | list["Paragraph"]    | List of paragraph objects where the numerical value was      |
| hs"                  |                      | found or from which it was calculated or inferred (only if   |
|                      |                      | "add_references=True")                                       |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_sentence  | list["Sentence"]     | List of sentence objects where the numerical value was found |
| s"                   |                      | or from which it was calculated or inferred (only if         |
|                      |                      | "add_references=True" and "reference_depth="sentences"")     |
+----------------------+----------------------+--------------------------------------------------------------+


💡 Best Practices
=================

Here are some best practices to optimize your use of
"NumericalConcept":

* Provide a clear and specific description that helps the LLM
  understand exactly what numerical values to extract, using precise
  and unambiguous language in your concept names and descriptions. For
  numerical concepts, be explicit about the exact values you're
  seeking (e.g., *"the total contract value in USD"* rather than just
  *"contract value"*). Avoid vague terms that could lead to incorrect
  extractions—for example, use *"quarterly revenue figures in
  millions"* instead of *"revenue numbers"* to ensure consistent and
  accurate extractions.

* Use the appropriate "numeric_type" based on what you expect to
  extract or calculate:

  * Use ""int"" for counts, quantities, or whole numbers

  * Use ""float"" for prices, measurements, or values that may have
    decimal points

  * Use ""any"" when you're not sure or need to extract both types

* Break down complex numerical extractions into multiple simpler
  numerical concepts when appropriate. Instead of one concept
  extracting *"all financial metrics,"* consider separate concepts for
  *"revenue figures,"* *"expense amounts,"* and *"profit margins."*
  This provides more structured data and makes it easier to process
  the results for specific purposes.

* Enable justifications (using "add_justifications=True") when you
  need to understand the reasoning behind the LLM's numerical
  extractions, especially when calculations or conversions are
  involved.

* Enable references (using "add_references=True") when you need to
  trace back to specific parts of the document that contained the
  numerical values or were used to calculate derived values.

* Use "singular_occurrence=True" to enforce only a single numerical
  value extraction. This is particularly useful for concepts that
  should yield a unique value, such as *"total contract value"* or
  *"effective interest rate,"* rather than identifying multiple
  numerical values throughout the document.


# ==== concepts/date_concept ====

DateConcept
***********

"DateConcept" is a specialized concept type that extracts, interprets,
and processes date information from documents, returning standardized
"datetime.date" objects.


📝 Overview
===========

"DateConcept" is used when you need to extract date information from
documents, allowing you to:

* **Extract explicit dates**: Identify dates that are directly
  mentioned in various formats (e.g., "January 15, 2025",
  "15/01/2025", "2025-01-15")

* **Infer implicit dates**: Deduce dates from contextual information
  (e.g., "next Monday", "two weeks from signing", "the following
  quarter")

* **Calculate derived dates**: Determine dates based on other temporal
  references (e.g., "30 days after delivery", "the fiscal year
  ending")

* **Normalize date representations**: Convert various date formats
  into standardized Python "datetime.date" objects for consistent
  processing

This concept type is particularly valuable for extracting temporal
information from documents such as:

* Contract effective dates, expiration dates, and renewal periods

* Report publication dates and data collection periods

* Event scheduling information and deadline specifications

* Historical dates and chronological sequences


💻 Usage Example
================

Here's a simple example of how to use "DateConcept" to extract a
publication date from a document:

   # ContextGem: DateConcept Extraction

   import os

   from contextgem import DateConcept, Document, DocumentLLM


   # Create a Document object from text
   doc = Document(
       raw_text="The research paper was published on March 15, 2025 and has been cited 42 times since."
   )

   # Define a DateConcept to extract the publication date
   date_concept = DateConcept(
       name="Publication date",
       description="The date when the paper was published",
   )

   # Attach the concept to the document
   doc.add_concepts([date_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   date_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted value
   print(
       type(date_concept.extracted_items[0].value), date_concept.extracted_items[0].value
   )
   # Output: <class 'datetime.date'> 2025-03-15

   # Or access the extracted value from the document object
   print(
       type(doc.concepts[0].extracted_items[0].value),
       doc.concepts[0].extracted_items[0].value,
   )
   # Output: <class 'datetime.date'> 2025-03-15


⚙️ Parameters
=============

When creating a "DateConcept", you can specify the following
parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "name"               | "str"           | (Required)      | A unique name identifier for the concept           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "description"        | "str"           | (Required)      | A clear description of what date information to    |
|                      |                 |                 | extract, which can include explicit dates to find, |
|                      |                 |                 | implicit dates to infer, or temporal relationships |
|                      |                 |                 | to identify. For date concepts, be specific about  |
|                      |                 |                 | the exact date information sought (e.g., *"the     |
|                      |                 |                 | contract signing date"* rather than just *"dates   |
|                      |                 |                 | in the document"*) to ensure consistent and        |
|                      |                 |                 | accurate extractions.                              |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "llm_role"           | "str"           | ""extractor_te  | The role of the LLM responsible for extracting the |
|                      |                 | xt""            | concept. Available values: ""extractor_text"",     |
|                      |                 |                 | ""reasoner_text"", ""extractor_vision"",           |
|                      |                 |                 | ""reasoner_vision"", ""extractor_multimodal"",     |
|                      |                 |                 | ""reasoner_multimodal"". For more details, see 🏷️  |
|                      |                 |                 | LLM Roles.                                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_justifications" | "bool"          | "False"         | Whether to include justifications for extracted    |
|                      |                 |                 | items. Justifications provide explanations of why  |
|                      |                 |                 | specific dates were extracted, which is especially |
|                      |                 |                 | valuable when dates are inferred from contextual   |
|                      |                 |                 | clues (e.g., *"next quarter"* or *"30 days after   |
|                      |                 |                 | signing"*) or when resolving ambiguous date        |
|                      |                 |                 | references in the document.                        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_dept  | "str"           | ""brief""       | Justification detail level. Available values:      |
| h"                   |                 |                 | ""brief"", ""balanced"", ""comprehensive"".        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_max_  | "int"           | "2"             | Maximum sentences in a justification.              |
| sents"               |                 |                 |                                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_references"     | "bool"          | "False"         | Whether to include source references for extracted |
|                      |                 |                 | items. References indicate the specific locations  |
|                      |                 |                 | in the document where date information was found,  |
|                      |                 |                 | derived, or inferred from. This is particularly    |
|                      |                 |                 | useful for tracing dates back to their original    |
|                      |                 |                 | context, understanding how relative dates were     |
|                      |                 |                 | calculated (e.g., *"30 days after delivery"*), or  |
|                      |                 |                 | verifying how the system resolved ambiguous        |
|                      |                 |                 | temporal references (e.g., *"next fiscal year"*).  |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "reference_depth"    | "str"           | ""paragraphs""  | Source reference granularity. Available values:    |
|                      |                 |                 | ""paragraphs"", ""sentences"".                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "singular_occurrenc  | "bool"          | "False"         | Whether this concept is restricted to having only  |
| e"                   |                 |                 | one extracted item. If "True", only a single date  |
|                      |                 |                 | will be extracted. For date concepts, this         |
|                      |                 |                 | parameter is particularly useful when you want to  |
|                      |                 |                 | extract a specific, unique date in the document    |
|                      |                 |                 | (e.g., *"publication date"* or *"contract signing  |
|                      |                 |                 | date"*) rather than identifying multiple dates     |
|                      |                 |                 | throughout the document. Note that with advanced   |
|                      |                 |                 | LLMs, this constraint may not be required as they  |
|                      |                 |                 | can often infer the appropriate cardinality from   |
|                      |                 |                 | the concept's name, description, and type.         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "custom_data"        | "dict"          | "{}"            | Optional. Dictionary for storing any additional    |
|                      |                 |                 | data that you want to associate with the concept.  |
|                      |                 |                 | This data must be JSON-serializable. This data is  |
|                      |                 |                 | not used for extraction but can be useful for      |
|                      |                 |                 | custom processing or downstream tasks.             |
+----------------------+-----------------+-----------------+----------------------------------------------------+


🚀 Advanced Usage
=================


🔍 References and Justifications for Extraction
-----------------------------------------------

You can configure a "DateConcept" to include justifications and
references. Justifications help explain the reasoning behind extracted
dates, especially for complex or inferred temporal information (like
dates derived from expressions such as *"30 days after delivery"* or
*"next fiscal year"*), while references point to the specific parts of
the document that contained the date information or based on which
date information was inferred:

   # ContextGem: DateConcept Extraction with References and Justifications

   import os

   from contextgem import DateConcept, Document, DocumentLLM


   # Sample document text containing project timeline information
   project_text = """
   Project Timeline: Website Redesign

   The website redesign project officially kicked off on March 1, 2024.
   The development team has estimated the project will take 4 months to complete.

   Key milestones:
   - Design phase: 1 month
   - Development phase: 2 months  
   - Testing and deployment: 1 month

   The marketing team needs the final completion date to plan the launch campaign.
   """

   # Create a Document from the text
   doc = Document(raw_text=project_text)

   # Create a DateConcept to calculate the project completion date
   completion_date_concept = DateConcept(
       name="Project completion date",
       description="The final completion date for the website redesign project",
       add_justifications=True,  # enable justifications to understand extraction logic
       justification_depth="balanced",
       justification_max_sents=3,  # allow up to 3 sentences for the calculation justification
       add_references=True,  # include references to source text
       reference_depth="sentences",  # reference specific sentences rather than paragraphs
       singular_occurrence=True,  # extract only one calculated date
   )

   # Attach the concept to the document
   doc.add_concepts([completion_date_concept])

   # Configure DocumentLLM
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept
   completion_date_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the calculated completion date with justification and references
   print("Calculated project completion date:")
   extracted_item = completion_date_concept.extracted_items[
       0
   ]  # get the single calculated date
   print(f"\nCompletion Date: {extracted_item.value}")  # expected output: 2024-07-01
   print(f"Calculation Justification: {extracted_item.justification}")
   print("Source references used for calculation:")
   for sent in extracted_item.reference_sentences:
       print(f"- {sent.raw_text}")


📊 Extracted Items
==================

When a "DateConcept" is extracted, it is populated with **a list of
extracted items** accessible through the ".extracted_items" property.
Each item is an instance of the "_DateItem" class with the following
attributes:

+----------------------+----------------------+--------------------------------------------------------------+
| Attribute            | Type                 | Description                                                  |
|======================|======================|==============================================================|
| "value"              | datetime.date        | The extracted date as a Python "datetime.date" object        |
+----------------------+----------------------+--------------------------------------------------------------+
| "justification"      | str                  | Explanation of why this date was extracted (only if          |
|                      |                      | "add_justifications=True")                                   |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_paragrap  | list["Paragraph"]    | List of paragraph objects where the date was found or from   |
| hs"                  |                      | which it was calculated, derived, or inferred (only if       |
|                      |                      | "add_references=True")                                       |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_sentence  | list["Sentence"]     | List of sentence objects where the date was found or from    |
| s"                   |                      | which it was calculated, derived, or inferred (only if       |
|                      |                      | "add_references=True" and "reference_depth="sentences"")     |
+----------------------+----------------------+--------------------------------------------------------------+


💡 Best Practices
=================

Here are some best practices to optimize your use of "DateConcept":

* Provide a clear and specific description that helps the LLM
  understand exactly what date to extract, using precise and
  unambiguous language (e.g., *"contract signing date"* rather than
  just *"date"*).

* For dates that require interpretation or calculation (like *"30 days
  after delivery"* or *"end of next fiscal year"*), include these
  requirements explicitly in your description to ensure the LLM
  performs the necessary temporal reasoning.

* Break down complex date extractions into multiple simpler date
  concepts when appropriate. Instead of one concept extracting *"all
  contract dates,"* consider separate concepts for *"contract signing
  date,"* *"effective date,"* and *"termination date."*

* Enable justifications (using "add_justifications=True") when you
  need to understand the reasoning behind date calculations or
  extractions, especially for relative or inferred dates.

* Enable references (using "add_references=True") when you need to
  trace back to specific parts of the document that contained the
  original date information or where dates were calculated from (e.g.,
  deriving a project completion date from a start date plus duration
  information).

* Use "singular_occurrence=True" to enforce only a single date
  extraction. This is particularly useful for concepts that should
  yield a unique calculated date, such as *"project completion
  deadline"* where multiple timeline elements need to be synthesized
  into a single target date, or when multiple date mentions actually
  refer to the same event.

* Leverage the returned Python "datetime.date" objects for direct
  integration with date-based calculations, comparisons, or formatting
  in your application logic.


# ==== concepts/rating_concept ====

RatingConcept
*************

"RatingConcept" is a specialized concept type that calculates, infers,
and derives rating values from documents within a clearly defined
numerical scale.


📝 Overview
===========

"RatingConcept" enables sophisticated rating analysis from documents,
allowing you to:

* **Derive implicit ratings**: Calculate ratings based on sentiment
  analysis, key criteria, or contextual evaluation

* **Generate evaluative scores**: Produce numerical assessments that
  quantify quality, relevance, or performance

* **Normalize diverse signals**: Convert qualitative assessments into
  consistent numerical ratings within your defined scale

* **Synthesize overall scores**: Combine multiple factors or opinions
  into comprehensive rating assessments

This concept type is particularly valuable for generating evaluative
information from documents such as:

* Product and service reviews where sentiment must be quantified on a
  standardized scale

* Performance assessments requiring numerical quality or satisfaction
  scoring

* Risk evaluations needing severity or probability measurements

* Content analyses where subjective characteristics must be rated
  objectively


💻 Usage Example
================

Here's a simple example of how to use "RatingConcept" to extract a
product rating:

   # ContextGem: RatingConcept Extraction

   import os

   from contextgem import Document, DocumentLLM, RatingConcept


   # Create a Document object from text describing a product without an explicit rating
   smartphone_description = (
       "This smartphone features a 5000mAh battery that lasts all day with heavy use. "
       "The display is 6.7 inch AMOLED with 120Hz refresh rate. "
       "Camera system includes a 50MP main sensor, 12MP ultrawide, and 8MP telephoto lens. "
       "The phone runs on the latest processor with 8GB RAM and 256GB storage. "
       "It has IP68 water resistance and Gorilla Glass Victus protection."
   )

   doc = Document(raw_text=smartphone_description)

   # Define a RatingConcept that requires analysis to determine a rating
   product_quality = RatingConcept(
       name="Product Quality Rating",
       description=(
           "Evaluate the overall quality of the smartphone based on its specifications, "
           "features, and adherence to industry best practices"
       ),
       rating_scale=(1, 10),
       add_justifications=True,  # include justification for the rating
       justification_depth="balanced",
       justification_max_sents=5,
   )

   # Attach the concept to the document
   doc.add_concepts([product_quality])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document - the LLM will analyze and assign a rating
   product_quality = llm.extract_concepts_from_document(doc)[0]

   # Print the calculated rating
   print(f"Quality Rating: {product_quality.extracted_items[0].value}")
   # Print the justification
   print(f"Justification: {product_quality.extracted_items[0].justification}")


⚙️ Parameters
=============

When creating a "RatingConcept", you can specify the following
parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "name"               | "str"           | (Required)      | A unique name identifier for the concept           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "description"        | "str"           | (Required)      | A clear description of what should be evaluated    |
|                      |                 |                 | and rated, including the criteria for assigning    |
|                      |                 |                 | different values within the rating scale (e.g.,    |
|                      |                 |                 | "Evaluate product quality based on features,       |
|                      |                 |                 | durability, and performance where 1 represents     |
|                      |                 |                 | poor quality and 10 represents exceptional         |
|                      |                 |                 | quality"). The more specific the description, the  |
|                      |                 |                 | more consistent and accurate the ratings will be.  |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "rating_scale"       | "tuple[int,     | (Required)      | Defines the boundaries for valid ratings as a      |
|                      | int]"           |                 | tuple of (start, end) values (e.g., "(1, 5)" for a |
|                      |                 |                 | 1-5 star rating, or "(0, 100)" for a percentage-   |
|                      |                 |                 | based evaluation). This parameter establishes the  |
|                      |                 |                 | numerical range within which all ratings must      |
|                      |                 |                 | fall, ensuring consistency across evaluations.     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "llm_role"           | "str"           | ""extractor_te  | The role of the LLM responsible for extracting the |
|                      |                 | xt""            | concept. Available values: ""extractor_text"",     |
|                      |                 |                 | ""reasoner_text"", ""extractor_vision"",           |
|                      |                 |                 | ""reasoner_vision"", ""extractor_multimodal"",     |
|                      |                 |                 | ""reasoner_multimodal"". For more details, see 🏷️  |
|                      |                 |                 | LLM Roles.                                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_justifications" | "bool"          | "False"         | Whether to include justifications for extracted    |
|                      |                 |                 | items. Justifications provide explanations of why  |
|                      |                 |                 | the LLM assigned specific rating values and the    |
|                      |                 |                 | reasoning behind the evaluation, which is          |
|                      |                 |                 | especially useful for understanding the factors    |
|                      |                 |                 | that influenced the rating. For example, a         |
|                      |                 |                 | justification might explain that a smartphone      |
|                      |                 |                 | received an 8/10 quality rating based on its       |
|                      |                 |                 | premium build materials, advanced camera system,   |
|                      |                 |                 | and long battery life, despite lacking expandable  |
|                      |                 |                 | storage.                                           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_dept  | "str"           | ""brief""       | Justification detail level. Available values:      |
| h"                   |                 |                 | ""brief"", ""balanced"", ""comprehensive"".        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_max_  | "int"           | "2"             | Maximum sentences in a justification.              |
| sents"               |                 |                 |                                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_references"     | "bool"          | "False"         | Whether to include source references for extracted |
|                      |                 |                 | items. References indicate the specific locations  |
|                      |                 |                 | in the document that provided information or       |
|                      |                 |                 | evidence used to determine the rating. This is     |
|                      |                 |                 | particularly useful for understanding which parts  |
|                      |                 |                 | of the document influenced the rating assessment,  |
|                      |                 |                 | allowing to trace back evaluations to relevant     |
|                      |                 |                 | content that supports the numerical value          |
|                      |                 |                 | assigned.                                          |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "reference_depth"    | "str"           | ""paragraphs""  | Source reference granularity. Available values:    |
|                      |                 |                 | ""paragraphs"", ""sentences"".                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "singular_occurrenc  | "bool"          | "False"         | Whether this concept is restricted to having only  |
| e"                   |                 |                 | one extracted item. If "True", only a single       |
|                      |                 |                 | rating will be extracted. For rating concepts,     |
|                      |                 |                 | this parameter is particularly useful when you     |
|                      |                 |                 | want to extract a single overall score (e.g.,      |
|                      |                 |                 | *"overall product quality"*) rather than           |
|                      |                 |                 | identifying multiple ratings throughout the        |
|                      |                 |                 | document for different aspects or features. This   |
|                      |                 |                 | helps distinguish between a global evaluation      |
|                      |                 |                 | versus component-specific ratings. Note that with  |
|                      |                 |                 | advanced LLMs, this constraint may not be required |
|                      |                 |                 | as they can often infer the appropriate number of  |
|                      |                 |                 | ratings to extract from the concept's name,        |
|                      |                 |                 | description, and rating context.                   |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "custom_data"        | "dict"          | "{}"            | Optional. Dictionary for storing any additional    |
|                      |                 |                 | data that you want to associate with the concept.  |
|                      |                 |                 | This data must be JSON-serializable. This data is  |
|                      |                 |                 | not used for extraction but can be useful for      |
|                      |                 |                 | custom processing or downstream tasks.             |
+----------------------+-----------------+-----------------+----------------------------------------------------+


🚀 Advanced Usage
=================


🔍 References and Justifications for Extraction
-----------------------------------------------

When extracting a "RatingConcept", it's often useful to include
justifications to understand the reasoning behind the score:

   # ContextGem: RatingConcept Extraction with References and Justifications

   import os

   from contextgem import Document, DocumentLLM, RatingConcept


   # Sample document text about a software product with various aspects
   software_review = """
   Software Review: ProjectManager Pro 5.0

   User Interface: The interface is clean and modern, with intuitive navigation. New users can quickly find what they need without extensive training. The dashboard provides a comprehensive overview of project status.

   Performance: The application loads quickly even with large projects. Resource-intensive operations like generating reports occasionally cause minor lag on older systems. The mobile app performs exceptionally well, even on limited bandwidth.

   Features: Project templates are well-designed and cover most common project types. Task dependencies are easily managed, and the Gantt chart visualization is excellent. However, the software lacks advanced risk management tools that competitors offer.

   Support: The documentation is comprehensive and well-organized. Customer service response time averages 4 hours, which is acceptable but not industry-leading. The knowledge base needs more video tutorials.
   """

   # Create a Document from the text
   doc = Document(raw_text=software_review)

   # Create a RatingConcept with justifications and references enabled
   usability_rating_concept = RatingConcept(
       name="Software usability rating",
       description="Evaluate the overall usability of the software on a scale of 1-10 based on UI design, intuitiveness, and learning curve",
       rating_scale=(1, 10),
       add_justifications=True,  # enable justifications to explain the rating
       justification_depth="comprehensive",  # provide detailed reasoning
       justification_max_sents=5,  # allow up to 5 sentences for justification
       add_references=True,  # include references to source text
       reference_depth="sentences",  # reference specific sentences rather than paragraphs
   )

   # Attach the concept to the document
   doc.add_concepts([usability_rating_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept
   usability_rating_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted rating item with justification and references
   extracted_item = usability_rating_concept.extracted_items[0]
   print(f"Software Usability Rating: {extracted_item.value}/10")
   print(f"\nJustification: {extracted_item.justification}")
   print("\nSource references:")
   for sent in extracted_item.reference_sentences:
       print(f"- {sent.raw_text}")


⭐⭐ Multiple Rating Categories
-------------------------------

You can extract multiple rating categories from a document by creating
separate rating concepts:

   # ContextGem: Multiple RatingConcept Extraction

   import os

   from contextgem import Document, DocumentLLM, RatingConcept


   # Sample document text about a restaurant review with multiple quality aspects to rate
   restaurant_review = """
   Restaurant Review: Bella Cucina

   Atmosphere: The restaurant has a warm, inviting ambiance with soft lighting and comfortable seating. The décor is elegant without being pretentious, and the noise level allows for easy conversation.

   Food Quality: The ingredients were fresh and high-quality. The pasta was perfectly cooked al dente, and the sauces were flavorful and well-balanced. The seafood dish had slightly overcooked shrimp, but the fish was excellent.

   Service: Our server was knowledgeable about the menu and wine list. Water glasses were kept filled, and plates were cleared promptly. However, there was a noticeable delay between appetizers and main courses.

   Value: Portion sizes were generous for the price point. The wine list offers selections at various price points, though markup is slightly higher than average for comparable restaurants in the area.
   """

   # Create a Document from the text
   doc = Document(raw_text=restaurant_review)

   # Define a consistent rating scale to be used across all rating categories
   restaurant_rating_scale = (1, 5)

   # Define multiple rating concepts for different quality aspects of the restaurant
   atmosphere_rating = RatingConcept(
       name="Atmosphere Rating",
       description="Rate the restaurant's atmosphere and ambiance",
       rating_scale=restaurant_rating_scale,
   )

   food_rating = RatingConcept(
       name="Food Quality Rating",
       description="Rate the quality, preparation, and taste of the food",
       rating_scale=restaurant_rating_scale,
   )

   service_rating = RatingConcept(
       name="Service Rating",
       description="Rate the efficiency, knowledge, and attentiveness of the service",
       rating_scale=restaurant_rating_scale,
   )

   value_rating = RatingConcept(
       name="Value Rating",
       description="Rate the value for money considering portion sizes and pricing",
       rating_scale=restaurant_rating_scale,
   )

   # Attach all concepts to the document
   doc.add_concepts([atmosphere_rating, food_rating, service_rating, value_rating])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract all concepts from the document
   extracted_concepts = llm.extract_concepts_from_document(doc)

   # Print all ratings
   print("Restaurant Ratings (1-5 scale):")
   for concept in extracted_concepts:
       if concept.extracted_items:
           print(f"{concept.name}: {concept.extracted_items[0].value}/5")

   # Calculate and print overall average rating
   average_rating = sum(
       concept.extracted_items[0].value for concept in extracted_concepts
   ) / len(extracted_concepts)
   print(f"\nOverall Rating: {average_rating:.1f}/5")


📊 Extracted Items
==================

When a "RatingConcept" is extracted, it is populated with **a list of
extracted items** accessible through the ".extracted_items" property.
Each item is an instance of the "_IntegerItem" class with the
following attributes:

+----------------------+----------------------+--------------------------------------------------------------+
| Attribute            | Type                 | Description                                                  |
|======================|======================|==============================================================|
| "value"              | int                  | The extracted rating value as an integer within the defined  |
|                      |                      | rating scale                                                 |
+----------------------+----------------------+--------------------------------------------------------------+
| "justification"      | str                  | Explanation of why this rating was extracted (only if        |
|                      |                      | "add_justifications=True")                                   |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_paragrap  | list["Paragraph"]    | List of paragraph objects that influenced the rating         |
| hs"                  |                      | determination (only if "add_references=True")                |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_sentence  | list["Sentence"]     | List of sentence objects that influenced the rating          |
| s"                   |                      | determination (only if "add_references=True" and             |
|                      |                      | "reference_depth="sentences"")                               |
+----------------------+----------------------+--------------------------------------------------------------+


💡 Best Practices
=================

* Create descriptive names for your rating concepts that clearly
  indicate what aspect is being evaluated (e.g., *"Product Usability
  Rating"* rather than just *"Rating"*).

* Enhance extraction quality by including clear definitions of what
  each point on the scale represents in your concept description
  (e.g., *"1 = poor, 3 = average, 5 = excellent"*).

* Provide specific evaluation criteria in your concept description to
  guide the LLM's assessment process. For example, when rating
  software usability, specify that factors like interface
  intuitiveness, learning curve, and navigation efficiency should be
  considered.

* Enable justifications (using "add_justifications=True") when you
  need to understand the reasoning behind a rating, which is
  particularly valuable for evaluations that involve complex criteria
  where the rationale may not be immediately obvious from the score
  alone.

* Enable references (using "add_references=True") to trace ratings
  back to specific evidence in the document that informed the
  evaluation.

* Apply "singular_occurrence=True" for concepts that should yield a
  single comprehensive rating (like an overall product score) rather
  than multiple ratings throughout the document.


# ==== concepts/json_object_concept ====

JsonObjectConcept
*****************

"JsonObjectConcept" is a powerful concept type that extracts
structured data in the form of JSON objects from documents, enabling
sophisticated information organization and retrieval.


📝 Overview
===========

"JsonObjectConcept" is used when you need to extract complex,
structured information from unstructured text, including:

* **Nested data structures**: objects with multiple fields,
  hierarchical information, and related attributes

* **Standardized formats**: consistent data extraction following
  predefined schemas for reliable downstream processing

* **Complex entity extraction**: comprehensive extraction of entities
  with multiple attributes and relationships

This concept type offers the flexibility to define precise schemas
that match your data requirements, ensuring that extracted information
maintains structural integrity and relationships between different
data elements.


💻 Usage Example
================

Here's a simple example of how to use "JsonObjectConcept" to extract
product information:

   # ContextGem: JsonObjectConcept Extraction

   import os
   from pprint import pprint
   from typing import Literal

   from contextgem import Document, DocumentLLM, JsonObjectConcept


   # Define product information text
   product_text = """
   Product: Smart Fitness Watch X7
   Price: $199.99
   Features: Heart rate monitoring, GPS tracking, Sleep analysis
   Battery Life: 5 days
   Water Resistance: IP68
   Available Colors: Black, Silver, Blue
   Customer Rating: 4.5/5
   """

   # Create a Document object from text
   doc = Document(raw_text=product_text)

   # Define a JsonObjectConcept with a structure for product information
   product_concept = JsonObjectConcept(
       name="Product Information",
       description="Extract detailed product information including name, price, features, and specifications",
       structure={
           "name": str,
           "price": float,
           "features": list[str],
           "specifications": {
               "battery_life": str,
               "water_resistance": Literal["IP67", "IP68", "IPX7", "Not water resistant"],
           },
           "available_colors": list[str],
           "customer_rating": float,
       },
   )

   # Attach the concept to the document
   doc.add_concepts([product_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   product_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted structured data
   extracted_product = product_concept.extracted_items[0].value
   pprint(extracted_product)


⚙️ Parameters
=============

When creating a "JsonObjectConcept", you can specify the following
parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "name"               | "str"           | (Required)      | A unique name identifier for the concept           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "description"        | "str"           | (Required)      | A clear description of what the concept represents |
|                      |                 |                 | and what should be extracted                       |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "structure"          | "type |         | (Required)      | JSON object schema defining the data structure to  |
|                      | dict[str, Any]" |                 | be extracted. Can be specified as a Python class   |
|                      |                 |                 | with type annotations or a dictionary with field   |
|                      |                 |                 | names as keys and their corresponding types as     |
|                      |                 |                 | values. This schema can represent simple flat      |
|                      |                 |                 | structures or complex nested hierarchies with      |
|                      |                 |                 | multiple levels of organization. The LLM will      |
|                      |                 |                 | attempt to extract data that conforms to this      |
|                      |                 |                 | structure, enabling precise and consistent         |
|                      |                 |                 | extraction of complex information patterns.        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "examples"           | "list[JsonObje  | "[]"            | Optional. Example JSON objects illustrating the    |
|                      | ctExample]"     |                 | concept usage. Such examples must conform to the   |
|                      |                 |                 | "structure" schema. Examples significantly improve |
|                      |                 |                 | extraction accuracy by showing the LLM concrete    |
|                      |                 |                 | instances of the expected output format and        |
|                      |                 |                 | content patterns. This is particularly valuable    |
|                      |                 |                 | for complex schemas with nested structures or when |
|                      |                 |                 | there are specific formatting conventions that     |
|                      |                 |                 | should be followed (e.g., how dates, identifiers,  |
|                      |                 |                 | or specialized fields should be represented).      |
|                      |                 |                 | Examples also help clarify how to handle edge      |
|                      |                 |                 | cases or ambiguous information in the source       |
|                      |                 |                 | document.                                          |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "llm_role"           | "str"           | ""extractor_te  | The role of the LLM responsible for extracting the |
|                      |                 | xt""            | concept. Available values: ""extractor_text"",     |
|                      |                 |                 | ""reasoner_text"", ""extractor_vision"",           |
|                      |                 |                 | ""reasoner_vision"", ""extractor_multimodal"",     |
|                      |                 |                 | ""reasoner_multimodal"". For more details, see 🏷️  |
|                      |                 |                 | LLM Roles.                                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_justifications" | "bool"          | "False"         | Whether to include justifications for extracted    |
|                      |                 |                 | items. Justifications provide explanations of why  |
|                      |                 |                 | the LLM extracted specific JSON structures and the |
|                      |                 |                 | reasoning behind field values. This is especially  |
|                      |                 |                 | valuable for complex structures where the          |
|                      |                 |                 | extraction process involves inference or when      |
|                      |                 |                 | multiple data points must be synthesized. For      |
|                      |                 |                 | example, a justification might explain how the LLM |
|                      |                 |                 | determined a product's category based on various   |
|                      |                 |                 | features mentioned across different paragraphs, or |
|                      |                 |                 | why certain optional fields were populated or left |
|                      |                 |                 | empty based on available information in the        |
|                      |                 |                 | document.                                          |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_dept  | "str"           | ""brief""       | Justification detail level. Available values:      |
| h"                   |                 |                 | ""brief"", ""balanced"", ""comprehensive"".        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_max_  | "int"           | "2"             | Maximum sentences in a justification.              |
| sents"               |                 |                 |                                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_references"     | "bool"          | "False"         | Whether to include source references for extracted |
|                      |                 |                 | items. References indicate the specific locations  |
|                      |                 |                 | in the document that informed the extraction of    |
|                      |                 |                 | the JSON structure. This is particularly valuable  |
|                      |                 |                 | for complex objects where field values may be      |
|                      |                 |                 | calculated or inferred from multiple scattered     |
|                      |                 |                 | pieces of information throughout the document.     |
|                      |                 |                 | References help trace back extracted values to     |
|                      |                 |                 | their source evidence, validate the extraction     |
|                      |                 |                 | reasoning, and understand which parts of the       |
|                      |                 |                 | document contributed to the synthesis of           |
|                      |                 |                 | structured data, especially for fields requiring   |
|                      |                 |                 | interpretation, not only direct extraction.        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "reference_depth"    | "str"           | ""paragraphs""  | Source reference granularity. Available values:    |
|                      |                 |                 | ""paragraphs"", ""sentences"".                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "singular_occurrenc  | "bool"          | "False"         | Whether this concept is restricted to having only  |
| e"                   |                 |                 | one extracted item. If "True", only a single JSON  |
|                      |                 |                 | object will be extracted. For JSON object          |
|                      |                 |                 | concepts, this parameter is particularly useful    |
|                      |                 |                 | when you want to extract a comprehensive           |
|                      |                 |                 | structured representation of a single entity       |
|                      |                 |                 | (e.g., "product specifications" or "company        |
|                      |                 |                 | profile") rather than multiple instances of        |
|                      |                 |                 | structured data scattered throughout the document. |
|                      |                 |                 | This is especially valuable when extracting        |
|                      |                 |                 | complex nested objects that aggregate information  |
|                      |                 |                 | from different parts of the document into a        |
|                      |                 |                 | cohesive whole. Note that with advanced LLMs, this |
|                      |                 |                 | constraint may not be required as they can often   |
|                      |                 |                 | infer the appropriate number of objects to extract |
|                      |                 |                 | from the concept's name, description, and schema   |
|                      |                 |                 | structure.                                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "custom_data"        | "dict"          | "{}"            | Optional. Dictionary for storing any additional    |
|                      |                 |                 | data that you want to associate with the concept.  |
|                      |                 |                 | This data must be JSON-serializable. This data is  |
|                      |                 |                 | not used for extraction but can be useful for      |
|                      |                 |                 | custom processing or downstream tasks.             |
+----------------------+-----------------+-----------------+----------------------------------------------------+


🏗️ Defining Structure
=====================

The "structure" parameter defines the schema for the data you want to
extract. JsonObjectConcept uses Pydantic models internally to validate
all structures, ensuring type safety and data integrity. You can
define this structure using either dictionaries or classes.
Dictionary-based definitions provide a simpler abstraction for
defining JSON object structures, while still benefiting from
Pydantic's robust validation system under the hood.

You can define the structure in several ways:

1. **Using a dictionary with type annotations:**

   from contextgem import JsonObjectConcept


   product_info_concept = JsonObjectConcept(
       name="Product Information",
       description="Product details",
       structure={
           "name": str,
           "price": float,
           "is_available": bool,
           "ratings": list[float],
       },
   )

2. **Using nested dictionaries for complex structures:**

   from contextgem import JsonObjectConcept


   device_config_concept = JsonObjectConcept(
       name="Device Configuration",
       description="Configuration details for a networked device",
       structure={
           "device": {"id": str, "type": str, "model": str},
           "network": {"ip_address": str, "subnet_mask": str, "gateway": str},
           "settings": {"enabled": bool, "mode": str},
       },
   )

3. **Using a Python class with type annotations:**

While dictionary structures provide the simplest way to define JSON
schemas, you may prefer to use class definitions if that better fits
your codebase style. You can define your structure using a Python
class with type annotations:

   from pydantic import BaseModel

   from contextgem import JsonObjectConcept


   # Use a Pydantic model to define the structure of the JSON object
   class ProductSpec(BaseModel):
       name: str
       version: str
       features: list[str]


   product_spec_concept = JsonObjectConcept(
       name="Product Specification",
       description="Technical specifications for a product",
       structure=ProductSpec,
   )

4. **Using nested classes for complex structures:**

If you prefer to use class definitions for hierarchical data
structures (already supported by dictionary structures), you can use
nested class definitions. This approach offers a more object-oriented
style that may better align with your existing codebase, especially
when working with dataclasses or Pydantic models in your application
code.

When using nested class definitions, all classes in the structure must
inherit from the "JsonObjectClassStruct" utility class to enable
automatic conversion of the whole class hierarchy to a dictionary
structure:

   from dataclasses import dataclass

   from contextgem import JsonObjectConcept
   from contextgem.public.utils import JsonObjectClassStruct


   # Use dataclasses to define the structure of the JSON object


   # All classes in the nested class structure must inherit from JsonObjectClassStruct
   # to enable automatic conversion of the class hierarchy to a dictionary structure
   # for JsonObjectConcept
   @dataclass
   class Location(JsonObjectClassStruct):
       latitude: float
       longitude: float
       altitude: float


   @dataclass
   class Sensor(JsonObjectClassStruct):
       id: str
       type: str
       location: Location  # reference to another class
       active: bool


   @dataclass
   class SensorNetwork(JsonObjectClassStruct):
       network_id: str
       primary_sensor: Sensor  # reference to another class
       backup_sensors: list[Sensor]  # list of another class


   sensor_network_concept = JsonObjectConcept(
       name="IoT Sensor Network",
       description="Configuration for a network of IoT sensors",
       structure=SensorNetwork,  # nested class structure
   )


🚀 Advanced Usage
=================


✏️ Adding Examples
------------------

You can provide examples of structured JSON objects to improve
extraction accuracy, especially for complex schemas or when there
might be ambiguity in how to organize or format the extracted
information:

   # ContextGem: JsonObjectConcept Extraction with Examples

   import os
   from pprint import pprint

   from contextgem import Document, DocumentLLM, JsonObjectConcept, JsonObjectExample


   # Document object with ambiguous medical report text
   medical_report = """
   PATIENT ASSESSMENT
   Date: March 15, 2023
   Patient: John Doe (ID: 12345)

   Vital Signs:
   BP: 125/82 mmHg
   HR: 72 bpm
   Temp: 98.6°F
   SpO2: 98%

   Chief Complaint:
   Patient presents with persistent cough for 2 weeks, mild fever in evenings (up to 100.4°F), and fatigue. 
   No shortness of breath. Patient reports recent travel to Southeast Asia 3 weeks ago.

   Assessment:
   Physical examination shows slight wheezing in upper right lung. No signs of pneumonia on chest X-ray.
   WBC slightly elevated at 11,500. Patient appears in stable condition but fatigued.

   Impression:
   1. Acute bronchitis, likely viral
   2. Rule out early TB given travel history
   3. Fatigue, likely secondary to infection

   Plan:
   - Rest for 5 days
   - Symptomatic treatment with over-the-counter cough suppressant
   - Follow-up in 1 week
   - TB test ordered

   Dr. Sarah Johnson, MD
   """
   doc = Document(raw_text=medical_report)

   # Create a JsonObjectConcept for extracting medical assessment data
   # Without examples, the LLM might struggle with ambiguous fields or formatting variations
   medical_assessment_concept = JsonObjectConcept(
       name="Medical Assessment",
       description="Key information from a patient medical assessment",
       structure={
           "patient": {
               "id": str,
               "vital_signs": {
                   "blood_pressure": str,
                   "heart_rate": int,
                   "temperature": float,
                   "oxygen_saturation": int,
               },
           },
           "clinical": {
               "symptoms": list[str],
               "diagnosis": list[str],
               "travel_history": bool,
           },
           "treatment": {"recommendations": list[str], "follow_up_days": int},
       },
       # Examples provide helpful guidance on how to:
       # 1. Map data from unstructured text to structured fields
       # 2. Handle formatting variations (BP as "120/80" vs separate systolic/diastolic)
       # 3. Extract implicit information (converting "SpO2: 98%" to just 98)
       examples=[
           JsonObjectExample(
               content={
                   "patient": {
                       "id": "87654",
                       "vital_signs": {
                           "blood_pressure": "130/85",
                           "heart_rate": 68,
                           "temperature": 98.2,
                           "oxygen_saturation": 99,
                       },
                   },
                   "clinical": {
                       "symptoms": ["headache", "dizziness", "nausea"],
                       "diagnosis": ["Migraine", "Dehydration"],
                       "travel_history": False,
                   },
                   "treatment": {
                       "recommendations": [
                           "Hydration",
                           "Pain medication",
                           "Dark room rest",
                       ],
                       "follow_up_days": 14,
                   },
               }
           ),
           JsonObjectExample(
               content={
                   "patient": {
                       "id": "23456",
                       "vital_signs": {
                           "blood_pressure": "145/92",
                           "heart_rate": 88,
                           "temperature": 100.8,
                           "oxygen_saturation": 96,
                       },
                   },
                   "clinical": {
                       "symptoms": ["sore throat", "cough", "fever"],
                       "diagnosis": ["Strep throat", "Pharyngitis"],
                       "travel_history": True,
                   },
                   "treatment": {
                       "recommendations": ["Antibiotics", "Throat lozenges", "Rest"],
                       "follow_up_days": 7,
                   },
               }
           ),
       ],
   )

   # Attach the concept to the document
   doc.add_concepts([medical_assessment_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   medical_assessment_concept = llm.extract_concepts_from_document(doc)[0]

   # Print the extracted medical assessment
   print("Extracted medical assessment:")
   assessment = medical_assessment_concept.extracted_items[0].value
   pprint(assessment)


🔍 References and Justifications for Extraction
-----------------------------------------------

You can configure a "JsonObjectConcept" to include justifications and
references, which provide transparency into the extraction process.
Justifications explain the reasoning behind the extracted values,
while references point to the specific parts of the document that were
used as sources for the extraction:

   # ContextGem: JsonObjectConcept Extraction with References and Justifications

   import os
   from pprint import pprint
   from typing import Literal

   from contextgem import Document, DocumentLLM, JsonObjectConcept


   # Sample document text containing a customer complaint
   customer_complaint = """
   CUSTOMER COMPLAINT #CR-2023-0472
   Date: November 15, 2023
   Customer: Sarah Johnson

   Description:
   I purchased the Ultra Premium Blender (Model XJ-5000) from your online store on October 3, 2023. The product was delivered on October 10, 2023. After using it only 5 times, the motor started making loud grinding noises and then completely stopped working on November 12.

   I've tried troubleshooting using the manual, including checking for obstructions and resetting the device, but nothing has resolved the issue. I expected much better quality given the premium price point ($249.99) and the 5-year warranty advertised.

   I've been a loyal customer for over 7 years and have purchased several kitchen appliances from your company. This is the first time I've experienced such a significant quality issue. I would like a replacement unit or a full refund.

   Previous interactions:
   - Spoke with customer service representative Alex on Nov 13 (Ref #CS-98721)
   - Was told to submit this formal complaint after troubleshooting was unsuccessful
   - No resolution offered during initial call

   Contact: sarah.johnson@example.com | (555) 123-4567
   """

   # Create a Document from the text
   doc = Document(raw_text=customer_complaint)

   # Create a JsonObjectConcept with justifications and references enabled
   complaint_analysis_concept = JsonObjectConcept(
       name="Complaint analysis",
       description="Detailed analysis of a customer complaint",
       structure={
           "issue_type": Literal[
               "product defect",
               "delivery problem",
               "billing error",
               "service issue",
               "other",
           ],
           "warranty_applicable": bool,
           "severity": Literal["low", "medium", "high", "critical"],
           "customer_loyalty_status": Literal["new", "regular", "loyal", "premium"],
           "recommended_resolution": Literal[
               "replacement", "refund", "repair", "partial refund", "other"
           ],
           "priority_level": Literal["low", "standard", "high", "urgent"],
           "expected_business_impact": Literal["minimal", "moderate", "significant"],
       },
       add_justifications=True,
       justification_depth="comprehensive",  # provide detailed justifications
       justification_max_sents=10,  # provide up to 10 sentences for each justification
       add_references=True,
       reference_depth="sentences",  # provide references to the sentences in the document
   )

   # Attach the concept to the document
   doc.add_concepts([complaint_analysis_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept
   complaint_analysis_concept = llm.extract_concepts_from_document(doc)[0]

   # Get the extracted complaint analysis
   complaint_analysis_item = complaint_analysis_concept.extracted_items[0]

   # Print the structured analysis
   print("Complaint Analysis\n")
   pprint(complaint_analysis_item.value)

   print("\nJustification:")
   print(complaint_analysis_item.justification)

   # Print key source references
   print("\nReferences:")
   for sent in complaint_analysis_item.reference_sentences:
       print(f"- {sent.raw_text}")


💡 Best Practices
=================

* Keep your JSON structures simple yet comprehensive, focusing on the
  essential fields needed for your use case to avoid LLM prompt
  overloading.

* Include realistic examples (using "JsonObjectExample") that
  precisely match your schema to guide extraction, especially for
  ambiguous or specialized data formats.

* Provide detailed descriptions for your JsonObjectConcept that
  specify exactly what structured data to extract and how fields
  should be interpreted.

* For complex JSON objects, use nested dictionaries or class
  hierarchies to organize related fields logically.

* Enable justifications (using "add_justifications=True") when
  interpretation rationale is important, especially for extractions
  that involve judgment or qualitative assessment, such as sentiment
  analysis (positive/negative), priority assignment (high/medium/low),
  or data categorization where the LLM must make interpretive
  decisions rather than extract explicit facts.

* Enable references (using "add_references=True") when you need to
  verify the document source of extracted values for compliance or
  verification purposes. This is especially valuable when the LLM is
  not just directly extracting explicit text, but also interpreting or
  inferring information from context. For example, in legal document
  analysis where traceability of information is essential for auditing
  or validation, references help track both explicit statements and
  the implicit information the model has derived from them.

* Use "singular_occurrence=True" when you expect exactly one instance
  of the structured data in the document (e.g., a single product
  specification, one patient medical record, or a unique customer
  complaint). This is useful for documents with a clear singular
  focus. Conversely, omit this parameter ("False" is the default) when
  you need to extract multiple instances of the same structure from a
  document, such as multiple product listings in a catalog, several
  patient records in a hospital report, or various customer complaints
  in a feedback compilation.


# ==== concepts/label_concept ====

LabelConcept
************

"LabelConcept" is a classification concept type in ContextGem that
categorizes documents or content using predefined labels, supporting
both single-label and multi-label classification approaches.


🏷️ Overview
===========

"LabelConcept" is used when you need to classify content into
predefined categories, including:

* **Document classification**: contract types, document categories,
  legal classifications

* **Content categorization**: topics, themes, subjects, areas of focus

* **Quality assessment**: compliance levels, risk categories, priority
  levels

* **Multi-faceted tagging**: multiple applicable labels for
  comprehensive classification

This concept type supports two classification modes:

* **Multi-class**: Always selects exactly one label from the
  predefined set (mutually exclusive labels) - used for classifying
  the content into a single type or category. A label is always
  returned, even if none perfectly fit the content.

* **Multi-label**: Selects zero, one, or multiple labels from the
  predefined set (non-exclusive labels) - used when multiple topics or
  attributes can apply simultaneously. Returns only applicable labels,
  or no labels if none apply.

Note:

  **For multi-label classification**: When none of the predefined
  labels apply to the content being classified, no extracted items
  will be returned for the concept (empty "extracted_items" list).
  This ensures that only applicable labels are selected.**For multi-
  class classification**: A label is always returned, as this
  classification type requires selecting the best-fitting option from
  the predefined set, even if none perfectly match the content.

Important:

  **For multi-class classification**: Since multi-class classification
  will always return exactly one label, you should consider including
  a general "other" label (such as "N/A", "misc", "unspecified", etc.)
  to handle cases where none of the specific labels apply, unless your
  labels are broad enough to cover all cases, or you know that the
  classified content always falls under one of the predefined labels
  without edge cases. This ensures appropriate classification even
  when the content doesn't clearly fit into any of the predefined
  specific categories.


💻 Usage Example
================

Here's a basic example of how to use "LabelConcept" for document type
classification:

   # ContextGem: Contract Type Classification using LabelConcept

   import os

   from contextgem import Document, DocumentLLM, LabelConcept


   # Create a Document object from legal document text
   legal_doc_text = """
   NON-DISCLOSURE AGREEMENT

   This Non-Disclosure Agreement ("Agreement") is entered into as of January 15, 2025, by and between TechCorp Inc., a Delaware corporation ("Disclosing Party"), and DataSystems LLC, a California limited liability company ("Receiving Party").

   WHEREAS, Disclosing Party possesses certain confidential information relating to its proprietary technology and business operations;

   NOW, THEREFORE, in consideration of the mutual covenants contained herein, the parties agree as follows:

   1. CONFIDENTIAL INFORMATION
   The term "Confidential Information" shall mean any and all non-public information...

   2. OBLIGATIONS OF RECEIVING PARTY
   Receiving Party agrees to hold all Confidential Information in strict confidence...
   """

   doc = Document(raw_text=legal_doc_text)

   # Define a LabelConcept for contract type classification
   contract_type_concept = LabelConcept(
       name="Contract Type",
       description="Classify the type of contract",
       labels=["NDA", "Consultancy Agreement", "Privacy Policy", "Other"],
       classification_type="multi_class",  # only one label can be selected (mutually exclusive labels)
       singular_occurrence=True,  # expect only one classification result
   )
   print(contract_type_concept._format_labels_in_prompt)

   # Attach the concept to the document
   doc.add_concepts([contract_type_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   contract_type_concept = llm.extract_concepts_from_document(doc)[0]

   # Check if any labels were extracted
   if contract_type_concept.extracted_items:
       # Get the classified document type
       classified_type = contract_type_concept.extracted_items[0].value
       print(f"Document classified as: {classified_type}")  # Output: ['NDA']
   else:
       print("No applicable labels found for this document")


⚙️ Parameters
=============

When creating a "LabelConcept", you can specify the following
parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "name"               | "str"           | (Required)      | A unique name identifier for the concept           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "description"        | "str"           | (Required)      | A clear description of what the concept represents |
|                      |                 |                 | and how classification should be performed         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "labels"             | "list[str]"     | (Required)      | List of predefined labels for classification. Must |
|                      |                 |                 | contain at least 2 unique labels                   |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "classification_typ  | "str"           | ""multi_class"" | Classification mode. Available values:             |
| e"                   |                 |                 | ""multi_class"" (select exactly one label),        |
|                      |                 |                 | ""multi_label"" (select one or more labels).       |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "llm_role"           | "str"           | ""extractor_te  | The role of the LLM responsible for extracting the |
|                      |                 | xt""            | concept. Available values: ""extractor_text"",     |
|                      |                 |                 | ""reasoner_text"", ""extractor_vision"",           |
|                      |                 |                 | ""reasoner_vision"", ""extractor_multimodal"",     |
|                      |                 |                 | ""reasoner_multimodal"". For more details, see 🏷️  |
|                      |                 |                 | LLM Roles.                                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_justifications" | "bool"          | "False"         | Whether to include justifications for extracted    |
|                      |                 |                 | items. Justifications provide explanations of why  |
|                      |                 |                 | specific labels were selected and the reasoning    |
|                      |                 |                 | behind the classification decision.                |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_dept  | "str"           | ""brief""       | Justification detail level. Available values:      |
| h"                   |                 |                 | ""brief"", ""balanced"", ""comprehensive"".        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "justification_max_  | "int"           | "2"             | Maximum sentences in a justification.              |
| sents"               |                 |                 |                                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "add_references"     | "bool"          | "False"         | Whether to include source references for extracted |
|                      |                 |                 | items. References indicate the specific locations  |
|                      |                 |                 | in the document that informed the classification   |
|                      |                 |                 | decision.                                          |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "reference_depth"    | "str"           | ""paragraphs""  | Source reference granularity. Available values:    |
|                      |                 |                 | ""paragraphs"", ""sentences"".                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "singular_occurrenc  | "bool"          | "False"         | Whether this concept is restricted to having only  |
| e"                   |                 |                 | one extracted item. If "True", only a single       |
|                      |                 |                 | extracted item will be extracted. This is          |
|                      |                 |                 | particularly useful for global document            |
|                      |                 |                 | classifications where only one classification      |
|                      |                 |                 | result is expected.                                |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "custom_data"        | "dict"          | "{}"            | Optional. Dictionary for storing any additional    |
|                      |                 |                 | data that you want to associate with the concept.  |
|                      |                 |                 | This data must be JSON-serializable. This data is  |
|                      |                 |                 | not used for extraction but can be useful for      |
|                      |                 |                 | custom processing or downstream tasks.             |
+----------------------+-----------------+-----------------+----------------------------------------------------+


🚀 Advanced Usage
=================


🏷️ Multi-Class vs Multi-Label Classification
--------------------------------------------

Choose the appropriate classification type based on your use case:

**Multi-Class Classification** ("classification_type="multi_class""):

* Always selects exactly one label from the predefined set (mutually
  exclusive labels)

* A label is always returned, even if none perfectly fit the content

* Ideal for: document types, priority levels, status categories

* Example: A document must be classified as one type: "NDA",
  "Consultancy Agreement", or "Privacy Policy" (or "Other" if none
  apply)

**Multi-Label Classification** ("classification_type="multi_label""):

* Selects zero, one, or multiple labels from the predefined set (non-
  exclusive labels)

* Returns only applicable labels; can return no labels if none apply

* Ideal for: content topics, applicable regulations, feature tags

* Example: A document can cover multiple topics: "Finance", "Legal",
  "Technology", or none of these topics

Here's an example demonstrating multi-label classification for content
topic identification:

   # ContextGem: Multi-Label Classification with LabelConcept

   import os

   from contextgem import Document, DocumentLLM, LabelConcept


   # Create a Document object with business document text covering multiple topics
   business_doc_text = """
   QUARTERLY BUSINESS REVIEW - Q4 2024

   FINANCIAL PERFORMANCE
   Revenue for Q4 2024 reached $2.8 million, exceeding our target by 12%. The finance team has prepared detailed budget projections for 2025, with anticipated growth of 18% across all divisions.

   TECHNOLOGY INITIATIVES
   Our development team has successfully implemented the new cloud infrastructure, reducing operational costs by 25%. The IT department is now focusing on cybersecurity enhancements and data analytics platform upgrades.

   HUMAN RESOURCES UPDATE
   We welcomed 15 new employees this quarter, bringing our total headcount to 145. The HR team has launched a comprehensive employee wellness program and updated our remote work policies.

   LEGAL AND COMPLIANCE
   All regulatory compliance requirements have been met for Q4. The legal department has reviewed and updated our data privacy policies in accordance with recent legislation changes.

   MARKETING STRATEGY
   The marketing team launched three successful campaigns this quarter, resulting in a 40% increase in lead generation. Our digital marketing efforts have expanded to include LinkedIn advertising and content marketing.
   """

   doc = Document(raw_text=business_doc_text)

   # Define a LabelConcept for topic classification allowing multiple topics
   content_topics_concept = LabelConcept(
       name="Document Topics",
       description="Identify all relevant business topics covered in this document",
       labels=[
           "Finance",
           "Technology",
           "HR",
           "Legal",
           "Marketing",
           "Operations",
           "Sales",
           "Strategy",
       ],
       classification_type="multi_label",  # multiple labels can be selected (non-exclusive labels)
   )


   # Attach the concept to the document
   doc.add_concepts([content_topics_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   content_topics_concept = llm.extract_concepts_from_document(doc)[0]

   # Check if any labels were extracted
   if content_topics_concept.extracted_items:
       # Get all identified topics
       identified_topics = content_topics_concept.extracted_items[0].value
       print(f"Document covers the following topics: {', '.join(identified_topics)}")
       # Expected output might include: Finance, Technology, HR, Legal, Marketing
   else:
       print("No applicable topic labels found for this document")


🔍 References and Justifications for Classification
---------------------------------------------------

You can configure a "LabelConcept" to include justifications and
references to understand classification decisions. This is
particularly valuable when dealing with complex documents that might
contain elements of multiple document types:

   # ContextGem: LabelConcept with References and Justifications

   import os

   from contextgem import Document, DocumentLLM, LabelConcept


   # Create a Document with content that might be challenging to classify
   mixed_content_text = """
   QUARTERLY BUSINESS REVIEW AND POLICY UPDATES
   GlobalTech Solutions Inc. - February 2025

   EMPLOYMENT AGREEMENT AND CONFIDENTIALITY PROVISIONS

   This Employment Agreement ("Agreement") is entered into between GlobalTech Solutions Inc. ("Company") and Sarah Johnson ("Employee") as of February 1, 2025.

   EMPLOYMENT TERMS
   Employee shall serve as Senior Software Engineer with responsibilities including software development, code review, and technical leadership. The position is full-time with an annual salary of $125,000.

   CONFIDENTIALITY OBLIGATIONS
   Employee acknowledges that during employment, they may have access to confidential information including proprietary algorithms, customer data, and business strategies. Employee agrees to maintain strict confidentiality of such information both during and after employment.

   NON-COMPETE PROVISIONS
   For a period of 12 months following termination, Employee agrees not to engage in any business activities that directly compete with Company's core services within the same geographic market.

   INTELLECTUAL PROPERTY
   All work products, inventions, and discoveries made during employment shall be the exclusive property of the Company.

   ADDITIONAL INFORMATION:

   FINANCIAL PERFORMANCE SUMMARY
   Q4 2024 revenue exceeded projections by 12%, reaching $3.2M. Cost optimization initiatives reduced operational expenses by 8%. The board approved a $500K investment in new data analytics infrastructure for 2025.

   PRODUCT LAUNCH TIMELINE
   The AI-powered customer analytics platform will launch Q2 2025. Marketing budget allocated: $200K for digital campaigns. Expected customer acquisition target: 150 new enterprise clients in the first quarter post-launch.
   """

   doc = Document(raw_text=mixed_content_text)

   # Define a LabelConcept with justifications and references enabled
   document_classification_concept = LabelConcept(
       name="Document Classification with Evidence",
       description="Classify this document type and provide reasoning for the classification",
       labels=[
           "Employment Contract",
           "NDA",
           "Consulting Agreement",
           "Service Agreement",
           "Partnership Agreement",
           "Other",
       ],
       classification_type="multi_class",  # a single label is always returned
       add_justifications=True,  # enable justifications to understand classification reasoning
       justification_depth="comprehensive",  # provide detailed reasoning
       justification_max_sents=5,  # allow up to 5 sentences for justification
       add_references=True,  # include references to source text
       reference_depth="paragraphs",  # reference specific paragraphs that informed classification
       singular_occurrence=True,  # expect only one classification result
   )

   # Attach the concept to the document
   doc.add_concepts([document_classification_concept])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract the concept from the document
   document_classification_concept = llm.extract_concepts_from_document(doc)[0]

   # Display the classification results with evidence
   if document_classification_concept.extracted_items:
       item = document_classification_concept.extracted_items[0]

       print("=== DOCUMENT CLASSIFICATION RESULTS ===")
       print(f"Classification: {item.value[0]}")
       print("\nJustification:")
       print(f"{item.justification}")

       print("\nEvidence from document:")
       for i, paragraph in enumerate(item.reference_paragraphs, 1):
           print(f"{i}. {paragraph.raw_text}")

   else:
       print("No classification could be determined - none of the predefined labels apply")

   # This example demonstrates how justifications help explain why the LLM
   # chose a specific classification and how references show which parts
   # of the document informed that decision


🎯 Document Aspect Analysis
---------------------------

"LabelConcept" can be used to classify extracted "Aspect" instances,
providing a powerful way to analyze and categorize specific
information that has been extracted from documents. This approach
allows you to first extract relevant content using aspects, then apply
classification logic to those extracted items.

Here's an example that demonstrates using "LabelConcept" to classify
the financial risk level of extracted financial obligations from legal
contracts:

   # ContextGem: Aspect Analysis with LabelConcept

   import os

   from contextgem import Aspect, Document, DocumentLLM, LabelConcept


   # Create a Document object from contract text
   contract_text = """
   SOFTWARE DEVELOPMENT AGREEMENT
   ...

   SECTION 5. PAYMENT TERMS
   Client shall pay Developer a total fee of $150,000 for the complete software development project, payable in three installments: $50,000 upon signing, $50,000 at milestone completion, and $50,000 upon final delivery.
   ...

   SECTION 8. MAINTENANCE AND SUPPORT
   Following project completion, Developer shall provide 12 months of maintenance and support services at a rate of $5,000 per month, totaling $60,000 annually.
   ...

   SECTION 12. PENALTY CLAUSES
   In the event of project delay beyond the agreed timeline, Developer shall pay liquidated damages of $2,000 per day of delay, with a maximum penalty cap of $50,000.
   ...

   SECTION 15. INTELLECTUAL PROPERTY LICENSING
   Client agrees to pay ongoing licensing fees of $10,000 annually for the use of Developer's proprietary frameworks and libraries integrated into the software solution.
   ...

   SECTION 18. TERMINATION COSTS
   Should Client terminate this agreement without cause, Client shall pay Developer 75% of all remaining unpaid fees, estimated at approximately $100,000 based on current project status.
   ...
   """

   doc = Document(raw_text=contract_text)

   # Define a LabelConcept to classify the financial risk level of the obligations
   risk_classification_concept = LabelConcept(
       name="Client Financial Risk Level",
       description=(
           "Classify the financial risk level for the Client's financial obligations based on:\n"
           "- Amount size and impact on Client's cash flow\n"
           "- Payment timing and predictability for the Client\n"
           "- Penalty or liability exposure for the Client\n"
           "- Ongoing vs. one-time obligations for the Client"
       ),
       labels=["Low Risk", "Moderate Risk", "High Risk", "Critical Risk"],
       classification_type="multi_class",
       add_justifications=True,
       justification_depth="comprehensive",  # provide a comprehensive justification
       justification_max_sents=10,  # set an adequate justification length
       singular_occurrence=True,  # global risk level for the client's financial obligations
   )

   # Define Aspect containing the concept
   financial_obligations_aspect = Aspect(
       name="Client Financial Obligations",
       description="Financial obligations that the Client must fulfill under the contract",
       concepts=[risk_classification_concept],
   )

   # Attach the aspect to the document
   doc.add_aspects([financial_obligations_aspect])

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract all data from the document
   doc = llm.extract_all(doc)

   # Get the extracted aspect and concept
   financial_obligations_aspect = doc.get_aspect_by_name(
       "Client Financial Obligations"
   )  # or `doc.aspects[0]`
   risk_classification_concept = financial_obligations_aspect.get_concept_by_name(
       "Client Financial Risk Level"
   )  # or `financial_obligations_aspect.concepts[0]`

   # Display the extracted information

   print("Extracted Client Financial Obligations:")
   for extracted_item in financial_obligations_aspect.extracted_items:
       print(f"- {extracted_item.value}")

   if risk_classification_concept.extracted_items:
       assert (
           len(risk_classification_concept.extracted_items) == 1
       )  # as we have set `singular_occurrence=True` on the concept
       risk_item = risk_classification_concept.extracted_items[0]
       print(f"\nClient Financial Risk Level: {risk_item.value[0]}")
       print(f"Justification: {risk_item.justification}")
   else:
       print("\nRisk level could not be determined")


📊 Extracted Items
==================

When a "LabelConcept" is extracted, it is populated with **a list of
extracted items** accessible through the ".extracted_items" property.
Each item is an instance of the "_LabelItem" class with the following
attributes:

+----------------------+----------------------+--------------------------------------------------------------+
| Attribute            | Type                 | Description                                                  |
|======================|======================|==============================================================|
| "value"              | list[str]            | List of selected labels (always a list for API consistency,  |
|                      |                      | even for multi-class with single selection)                  |
+----------------------+----------------------+--------------------------------------------------------------+
| "justification"      | str                  | Explanation of why these labels were selected (only if       |
|                      |                      | "add_justifications=True")                                   |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_paragrap  | list["Paragraph"]    | List of paragraph objects that informed the classification   |
| hs"                  |                      | (only if "add_references=True")                              |
+----------------------+----------------------+--------------------------------------------------------------+
| "reference_sentence  | list["Sentence"]     | List of sentence objects that informed the classification    |
| s"                   |                      | (only if "add_references=True" and                           |
|                      |                      | "reference_depth="sentences"")                               |
+----------------------+----------------------+--------------------------------------------------------------+


💡 Best Practices
=================

Here are some best practices to optimize your use of "LabelConcept":

* **Choose meaningful labels**: Use clear, distinct labels that cover
  your classification needs without overlap.

* **Provide clear descriptions**: Explain what each classification
  represents and when each label should be applied.

* **Consider label granularity**: Balance between too few labels
  (insufficient precision) and too many labels (classification
  complexity).

* **For multi-class classification**: Consider including a general
  "other" label (like "Other", "N/A", "Mixed", etc.) since a label is
  always returned, even when none of the specific labels perfectly fit
  the content, unless your labels are broad enough to cover all cases,
  or you know that the classified content always falls under one of
  the predefined labels without edge cases.

* **For multi-label classification**: Design your workflow to handle
  cases where none of the predefined labels apply (resulting in empty
  "extracted_items"), as this classification type can return zero
  labels.

* **Use appropriate classification type**: Set
  "classification_type="multi_class"" for mutually exclusive
  categories where exactly one choice is required,
  "classification_type="multi_label"" for potentially overlapping
  attributes where zero, one, or multiple labels can apply.

* **Enable justifications**: Use "add_justifications=True" to
  understand and validate classification decisions, especially for
  complex or ambiguous content.


# ==== pipelines/extraction_pipelines ====

Extraction Pipelines
********************

"ExtractionPipeline" is a powerful component that enables you to
create reusable collections of predefined aspects and concepts for
consistent document analysis. Pipelines serve as templates that can be
applied to multiple documents, ensuring standardized data extraction
across your application.


📝 Overview
===========

Extraction pipelines package common extraction patterns into reusable
units, allowing you to:

* **Standardize document processing**: Define a consistent set of
  aspects and concepts once, then apply them to multiple documents

* **Create reusable templates**: Build domain-specific pipelines
  (e.g., contract analysis, invoice processing, report analysis)

* **Ensure consistent analysis**: Maintain uniform extraction criteria
  across document batches

* **Simplify workflow management**: Organize complex extraction
  workflows into manageable, reusable components

Pipelines are particularly valuable when processing multiple documents
of the same type, where you need to extract the same categories of
information consistently.


⭐ Key Features
===============


Template-Based Extraction
-------------------------

Pipelines act as extraction templates that define what information to
extract from documents. Once created, a pipeline can be assigned to
any number of documents, ensuring consistent analysis criteria.


Aspect and Concept Organization
-------------------------------

Pipelines can contain both:

* **Aspects**: For extracting document sections and organizing content
  hierarchically

* **Concepts**: For extracting specific data points with intelligent
  inference

This allows you to create comprehensive extraction workflows that
combine broad content organization with detailed data extraction.


Reusability and Scalability
---------------------------

A single pipeline can be applied to multiple documents, making it
ideal for batch processing, automated workflows, and applications that
need to process similar document types repeatedly.


💻 Basic Usage
==============


Simple Pipeline Creation
------------------------

Here's how to create and use a basic extraction pipeline:

   from contextgem import (
       Aspect,
       BooleanConcept,
       DateConcept,
       Document,
       ExtractionPipeline,
       StringConcept,
   )


   # Create a pipeline for NDA (Non-Disclosure Agreement) review
   nda_pipeline = ExtractionPipeline(
       aspects=[
           Aspect(
               name="Confidential information",
               description="Clauses defining the confidential information",
           ),
           Aspect(
               name="Exclusions",
               description="Clauses defining exclusions from confidential information",
           ),
           Aspect(
               name="Obligations",
               description="Clauses defining confidentiality obligations",
           ),
           Aspect(
               name="Liability",
               description="Clauses defining liability for breach of the agreement",
           ),
           # ... Add more aspects as needed
       ],
       concepts=[
           StringConcept(
               name="Anomaly",
               description="Anomaly in the contract, e.g. out-of-context or nonsensical clauses",
               llm_role="reasoner_text",
               add_references=True,  # Add references to the source text
               reference_depth="sentences",  # Reference to the sentence level
               add_justifications=True,  # Add justifications for the anomaly
               justification_depth="balanced",  # Justification at the sentence level
               justification_max_sents=5,  # Maximum number of sentences in the justification
           ),
           BooleanConcept(
               name="Is mutual",
               description="Whether the NDA is mutual (bidirectional) or one-way",
               singular_occurrence=True,
               llm_role="reasoner_text",  # Use the reasoner role for this concept
           ),
           DateConcept(
               name="Effective date",
               description="The date when the NDA agreement becomes effective",
               singular_occurrence=True,
           ),
           StringConcept(
               name="Term",
               description="The term of the NDA",
           ),
           StringConcept(
               name="Governing law",
               description="The governing law of the agreement",
               singular_occurrence=True,
           ),
           # ... Add more concepts as needed
       ],
   )

   # Assign the pipeline to the NDA document
   nda_document = Document(raw_text="[NDA text]")
   nda_document.assign_pipeline(nda_pipeline)

   # Now the document is ready for processing with the NDA review pipeline!
   # The document can be processed to extract the defined aspects and concepts

   # Extract all aspects and concepts from the NDA using an LLM group
   # with LLMs with roles "extractor_text" and "reasoner_text".
   # llm_group.extract_all(nda_document)


Pipeline Assignment to Documents
--------------------------------

Once created, pipelines can be easily assigned to documents:

   from contextgem import Document, ExtractionPipeline

   # Create your pipeline
   my_pipeline = ExtractionPipeline(aspects=[...], concepts=[...])

   # Create documents
   doc1 = Document(raw_text="First document content...")
   doc2 = Document(raw_text="Second document content...")

   # Assign the same pipeline to multiple documents
   doc1.assign_pipeline(my_pipeline)
   doc2.assign_pipeline(my_pipeline)

   # Now both documents have the same extraction configuration


⚙️ Parameters
=============

When creating an "ExtractionPipeline", you can configure the following
parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "aspects"            | "list[Aspect]"  | "[]"            | *Optional*. List of "Aspect" instances to extract  |
|                      |                 |                 | from documents. Aspects represent structural       |
|                      |                 |                 | categories of information and can contain their    |
|                      |                 |                 | own sub-aspects and concepts for detailed          |
|                      |                 |                 | analysis. See Aspect Extraction for more           |
|                      |                 |                 | information.                                       |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "concepts"           | "list[_Concept  | "[]"            | *Optional*. List of "_Concept" instances to        |
|                      | ]"              |                 | identify within or infer from documents. These are |
|                      |                 |                 | document-level concepts that apply to the entire   |
|                      |                 |                 | document content. See supported concept types in   |
|                      |                 |                 | Supported Concepts.                                |
+----------------------+-----------------+-----------------+----------------------------------------------------+


📊 Pipeline Assignment
======================

The "assign_pipeline()" method is used to apply a pipeline to a
document. This method:

* **Assigns aspects and concepts**: Transfers the pipeline's aspects
  and concepts to the document

* **Validates compatibility**: Ensures no conflicts with existing
  document configuration


Assignment Options
------------------

   # Basic assignment (will raise error if document already has aspects/concepts)
   document.assign_pipeline(my_pipeline)

   # Overwrite existing configuration
   document.assign_pipeline(my_pipeline, overwrite_existing=True)


🚀 Advanced Usage
=================


Multi-Document Processing
-------------------------

Pipelines excel at processing multiple documents of the same type.
Here's a comprehensive example:

   # Advanced Usage Example - analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency and cost tracking

   import os

   from contextgem import (
       Aspect,
       DateConcept,
       Document,
       DocumentLLM,
       DocumentLLMGroup,
       ExtractionPipeline,
       JsonObjectConcept,
       JsonObjectExample,
       LLMPricing,
       NumericalConcept,
       RatingConcept,
       StringConcept,
       StringExample,
   )


   # Construct documents

   # Document 1 - Consultancy Agreement (shortened for brevity)
   doc1 = Document(
       raw_text=(
           "Consultancy Agreement\n"
           "This agreement between Company A (Supplier) and Company B (Customer)...\n"
           "The term of the agreement is 1 year from the Effective Date...\n"
           "The Supplier shall provide consultancy services as described in Annex 2...\n"
           "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
           "All intellectual property created during the provision of services shall belong to the Customer...\n"
           "This agreement is governed by the laws of Norway...\n"
           "Annex 1: Data processing agreement...\n"
           "Annex 2: Statement of Work...\n"
           "Annex 3: Service Level Agreement...\n"
       ),
   )

   # Document 2 - Service Level Agreement (shortened for brevity)
   doc2 = Document(
       raw_text=(
           "Service Level Agreement\n"
           "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
           "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
           "The Provider shall deliver IT support services as outlined in Schedule A...\n"
           "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
           "The Provider guarantees [99.9%] uptime for all critical systems...\n"
           "Either party may terminate with 60 days written notice...\n"
           "This agreement is governed by the laws of California...\n"
           "Schedule A: Service Descriptions...\n"
           "Schedule B: Response Time Requirements...\n"
       ),
   )

   # Create a reusable extraction pipeline
   contract_pipeline = ExtractionPipeline()

   # Define aspects and aspect-level concepts in the pipeline
   # Concepts in the aspects will be extracted from the extracted aspect context
   contract_pipeline.aspects = [  # or use .add_aspects([...])
       Aspect(
           name="Contract Parties",
           description="Clauses defining the parties to the agreement",
           concepts=[  # define aspect-level concepts, if any
               StringConcept(
                   name="Party names and roles",
                   description="Names of all parties entering into the agreement and their roles",
                   examples=[  # optional
                       StringExample(
                           content="X (Client)",  # guidance regarding the expected output format
                       )
                   ],
               )
           ],
       ),
       Aspect(
           name="Term",
           description="Clauses defining the term of the agreement",
           concepts=[
               NumericalConcept(
                   name="Contract term",
                   description="The term of the agreement in years",
                   numeric_type="int",  # or "float", or "any" for auto-detection
                   add_references=True,  # extract references to the source text
                   reference_depth="paragraphs",
               )
           ],
       ),
   ]

   # Define document-level concepts
   # Concepts in the document will be extracted from the whole document content
   contract_pipeline.concepts = [  # or use .add_concepts()
       DateConcept(
           name="Effective date",
           description="The effective date of the agreement",
       ),
       StringConcept(
           name="Contract type",
           description="The type of agreement",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
       ),
       StringConcept(
           name="Governing law",
           description="The law that governs the agreement",
       ),
       JsonObjectConcept(
           name="Attachments",
           description="The titles and concise descriptions of the attachments to the agreement",
           structure={"title": str, "description": str | None},
           examples=[  # optional
               JsonObjectExample(  # guidance regarding the expected output format
                   content={
                       "title": "Appendix A",
                       "description": "Code of conduct",
                   }
               ),
           ],
       ),
       RatingConcept(
           name="Duration adequacy",
           description="Contract duration adequacy considering the subject matter and best practices.",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
           rating_scale=(1, 10),
           add_justifications=True,  # add justifications for the rating
           justification_depth="balanced",  # provide a balanced justification
           justification_max_sents=3,
       ),
   ]

   # Assign pipeline to the documents
   # You can re-use the same pipeline for multiple documents
   doc1.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document
   doc2.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document

   # Create an LLM group for data extraction and reasoning
   llm_extractor = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="extractor_text",  # signifies the LLM is used for data extraction tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=0.150,
           output_per_1m_tokens=0.600,
       ),
       # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider
   )
   llm_reasoner = DocumentLLM(
       model="openai/o3-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="reasoner_text",  # signifies the LLM is used for reasoning tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=1.10,
           output_per_1m_tokens=4.40,
       ),
       # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider
   )
   # The LLM group is used for all extraction tasks within the pipeline
   llm_group = DocumentLLMGroup(llms=[llm_extractor, llm_reasoner])

   # Extract all information from the documents at once
   doc1 = llm_group.extract_all(
       doc1, use_concurrency=True
   )  # use concurrency to speed up extraction
   doc2 = llm_group.extract_all(
       doc2, use_concurrency=True
   )  # use concurrency to speed up extraction
   # Or use async variants .extract_all_async(...)

   # Get the extracted data
   print("Some extracted data from doc 1:")
   print("Contract Parties > Party names and roles:")
   print(
       doc1.get_aspect_by_name("Contract Parties")
       .get_concept_by_name("Party names and roles")
       .extracted_items
   )
   print("Attachments:")
   print(doc1.get_concept_by_name("Attachments").extracted_items)
   # ...

   print("\nSome extracted data from doc 2:")
   print("Term > Contract term:")
   print(
       doc2.get_aspect_by_name("Term")
       .get_concept_by_name("Contract term")
       .extracted_items[0]
       .value
   )
   print("Duration adequacy:")
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].value)
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].justification)
   # ...

   # Output processing costs (requires setting the pricing details for each LLM)
   print("\nProcessing costs:")
   print(llm_group.get_cost())


Pipeline Serialization
----------------------

Pipelines can be serialized for storage and later reuse:

   # Serialize the pipeline
   pipeline_json = pipeline.to_json()  # or to_dict() / to_disk()

   # Deserialize the pipeline
   pipeline_deserialized = ExtractionPipeline.from_json(
       pipeline_json
   )  # or from_dict() / from_disk()


💡 Best Practices
=================


Pipeline Design
---------------

* **Domain-specific organization**: Create pipelines tailored to
  specific document types (contracts, invoices, reports, etc.)

* **Logical grouping**: Group related aspects and concepts together
  for coherent analysis

* **Reusable templates**: Design pipelines to be generic enough for
  reuse across similar documents


Concept Placement Strategy
--------------------------

* **Document-level concepts**: Place concepts that apply to the entire
  document in the pipeline's "concepts" list

* **Aspect-level concepts**: Place concepts that are specific to
  particular document sections within the relevant aspects

* **Avoid duplication**: Don't create similar concepts at both
  document and aspect levels


🎯 Example Use Cases
====================


Invoice Processing Pipeline
---------------------------

   invoice_pipeline = ExtractionPipeline(
       concepts=[
           StringConcept(name="Vendor Name", description="Name of the vendor/supplier"),
           StringConcept(name="Invoice Number", description="Unique invoice identifier"),
           DateConcept(name="Invoice Date", description="Date the invoice was issued"),
           DateConcept(name="Due Date", description="Payment due date"),
           NumericalConcept(name="Total Amount", description="Total invoice amount"),
           StringConcept(name="Currency", description="Currency of the invoice"),
       ]
   )


Research Paper Analysis Pipeline
--------------------------------

   research_pipeline = ExtractionPipeline(
       aspects=[
           Aspect(name="Abstract", description="Paper abstract and summary"),
           Aspect(name="Methodology", description="Research methods and approach"),
           Aspect(name="Results", description="Findings and outcomes"),
           Aspect(name="Conclusions", description="Conclusions and implications"),
       ],
       concepts=[
           StringConcept(name="Research Field", description="Primary research domain"),
           StringConcept(name="Keywords", description="Paper keywords and topics"),
           DateConcept(name="Publication Date", description="When the paper was published"),
           RatingConcept(name="Novelty Score", description="Novelty of the research", rating_scale=(1, 10)),
       ]
   )


⚡ Pipeline Reuse Benefits
==========================

* **Consistency**: Ensures all documents are processed with identical
  extraction criteria

* **Efficiency**: Eliminates the need to recreate aspects and concepts
  for each document

* **Maintainability**: Changes to extraction logic only need to be
  made in one place


📚 Related Documentation
========================

* Aspect Extraction - Learn about aspect extraction

* Supported Concepts - Explore available concept types and how to use
  them

* Advanced usage examples - See advanced pipeline usage examples

* Extraction Methods - Understand LLM extraction methods

* Serializing objects and results - Learn about pipeline serialization
  and storage


# ==== llms/supported_llms ====

Supported LLMs
**************

ContextGem supports all LLM providers and models available through the
LiteLLM integration. This means you can use models from major cloud
providers like OpenAI, Anthropic, Google, Azure, and xAI, as well as
run local models through providers like Ollama and LM Studio.

ContextGem works with both types of LLM architectures:

* Reasoning/CoT-capable models (e.g., "openai/o4-mini",
  "ollama_chat/deepseek-r1:32b")

* Non-reasoning models (e.g., "openai/gpt-4.1",
  "ollama_chat/llama3.3:70b")

For a complete list of supported providers, see the LiteLLM Providers
documentation.


☁️ Cloud-based LLMs
===================

You can initialize cloud-based LLMs by specifying the provider and
model name in the format "<provider>/<model_name>":

Using cloud LLM providers

   from contextgem import DocumentLLM


   # Pattern for using any cloud LLM provider
   llm = DocumentLLM(
       model="<provider>/<model_name>",
       api_key="<api_key>",
   )

   # Example - Using OpenAI LLM
   llm_openai = DocumentLLM(
       model="openai/gpt-4.1-mini",
       api_key="<api_key>",
       # see DocumentLLM API reference for all configuration options
   )

   # Example - Using Azure OpenAI LLM
   llm_azure_openai = DocumentLLM(
       model="azure/o4-mini",
       api_key="<api_key>",
       api_version="<api_version>",
       api_base="<api_base>",
       # see DocumentLLM API reference for all configuration options
   )


💻 Local LLMs
=============

For local LLMs, you'll need to specify the provider, model name, and
the appropriate API base URL:

Using local LLM providers

   from contextgem import DocumentLLM


   local_llm = DocumentLLM(
       model="ollama_chat/<model_name>",
       api_base="http://localhost:11434",  # Default Ollama endpoint
   )

   # Example - Using Llama 3.1 LLM via Ollama
   llm_llama = DocumentLLM(
       model="ollama_chat/llama3.3:70b",
       api_base="http://localhost:11434",
       # see DocumentLLM API reference for all configuration options
   )

   # Example - Using DeepSeek R1 reasoning model via Ollama
   llm_deepseek = DocumentLLM(
       model="ollama_chat/deepseek-r1:32b",
       api_base="http://localhost:11434",
       # see DocumentLLM API reference for all configuration options
   )

Note:

  **Vision Models with Ollama**: For local vision models that process
  images, use the "ollama/" prefix instead of "ollama_chat/", as the
  latter does not yet support image inputs. For more details, see the
  relevant Ollama GitHub issue and LiteLLM GitHub issue.

Note:

  **LM Studio Connection Error**: If you encounter a connection error
  ("litellm.APIError: APIError: Lm_studioException - Connection
  error") when using LM Studio, check that you have provided a dummy
  API key. While API keys are usually not expected for local models,
  this is a specific case where LM Studio requires one:LM Studio with
  dummy API key

     from contextgem import DocumentLLM


     llm = DocumentLLM(
         model="lm_studio/mistralai/mistral-small-3.2",
         api_base="http://localhost:1234/v1",
         api_key="dummy-key",  # dummy key to avoid connection error
     )

     # This is a known issue with calling LM Studio API in litellm:
     # https://github.com/openai/openai-python/issues/961

  This is a known issue with calling LM Studio API in litellm:
  https://github.com/openai/openai-python/issues/961

For a complete list of configuration options available when
initializing DocumentLLM instances, see the next section Configuring
LLM(s).


# ==== llms/llm_config ====

Configuring LLM(s)
******************

This guide explains how to configure "DocumentLLM" instances to
process documents using various LLM providers. ContextGem uses LiteLLM
under the hood, providing uniform access to a wide range of models.
For more information on supported LLMs, see Supported LLMs.


🚀 Basic Configuration
======================

The minimum configuration for a cloud-based LLM requires the "model"
parameter and an "api_key":

Using a cloud-based LLM

   from contextgem import DocumentLLM


   # Pattern for using any cloud LLM provider
   llm = DocumentLLM(
       model="<provider>/<model_name>",
       api_key="<api_key>",
   )

   # Example - Using OpenAI LLM
   llm_openai = DocumentLLM(
       model="openai/gpt-4.1-mini",
       api_key="<api_key>",
       # see DocumentLLM API reference for all configuration options
   )

   # Example - Using Azure OpenAI LLM
   llm_azure_openai = DocumentLLM(
       model="azure/o4-mini",
       api_key="<api_key>",
       api_version="<api_version>",
       api_base="<api_base>",
       # see DocumentLLM API reference for all configuration options
   )

For local models, usually you need to specify the "api_base" instead
of the API key:

Using a local LLM

   from contextgem import DocumentLLM


   local_llm = DocumentLLM(
       model="ollama_chat/<model_name>",
       api_base="http://localhost:11434",  # Default Ollama endpoint
   )

   # Example - Using Llama 3.1 LLM via Ollama
   llm_llama = DocumentLLM(
       model="ollama_chat/llama3.3:70b",
       api_base="http://localhost:11434",
       # see DocumentLLM API reference for all configuration options
   )

   # Example - Using DeepSeek R1 reasoning model via Ollama
   llm_deepseek = DocumentLLM(
       model="ollama_chat/deepseek-r1:32b",
       api_base="http://localhost:11434",
       # see DocumentLLM API reference for all configuration options
   )

Note:

  **LM Studio Connection Error**: If you encounter a connection error
  ("litellm.APIError: APIError: Lm_studioException - Connection
  error") when using LM Studio, check that you have provided a dummy
  API key. While API keys are usually not expected for local models,
  this is a specific case where LM Studio requires one:LM Studio with
  dummy API key

     from contextgem import DocumentLLM


     llm = DocumentLLM(
         model="lm_studio/mistralai/mistral-small-3.2",
         api_base="http://localhost:1234/v1",
         api_key="dummy-key",  # dummy key to avoid connection error
     )

     # This is a known issue with calling LM Studio API in litellm:
     # https://github.com/openai/openai-python/issues/961

  This is a known issue with calling LM Studio API in litellm:
  https://github.com/openai/openai-python/issues/961


📝 Configuration Parameters
===========================

The "DocumentLLM" class accepts the following parameters:

+----------------------+-----------------+-----------------+----------------------------------------------------+
| Parameter            | Type            | Default Value   | Description                                        |
|======================|=================|=================|====================================================|
| "model"              | "str"           | (Required)      | Model identifier in format                         |
|                      |                 |                 | "<provider>/<model_name>". See LiteLLM Providers   |
|                      |                 |                 | for all supported providers.                       |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "api_key"            | "str | None"    | "None"          | API key for authentication. Required for most      |
|                      |                 |                 | cloud providers but not for local models.          |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "api_base"           | "str | None"    | "None"          | Base URL of the API endpoint. Required for local   |
|                      |                 |                 | models and some cloud providers (e.g. Azure        |
|                      |                 |                 | OpenAI).                                           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "deployment_id"      | "str | None"    | "None"          | Deployment ID for the model. Primarily used with   |
|                      |                 |                 | Azure OpenAI.                                      |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "api_version"        | "str | None"    | "None"          | API version. Primarily used with Azure OpenAI.     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "role"               | "str"           | ""extractor_te  | Role type for the LLM. Values: ""extractor_text"", |
|                      |                 | xt""            | ""reasoner_text"", ""extractor_vision"",           |
|                      |                 |                 | ""reasoner_vision"", ""extractor_multimodal"",     |
|                      |                 |                 | ""reasoner_multimodal"". The role parameter is an  |
|                      |                 |                 | abstraction that can be explicitly assigned to     |
|                      |                 |                 | extraction components (aspects and concepts) in    |
|                      |                 |                 | the pipeline. ContextGem then routes extraction    |
|                      |                 |                 | tasks based on these assigned roles, matching      |
|                      |                 |                 | components with LLMs of the same role. This allows |
|                      |                 |                 | you to structure your pipeline with different      |
|                      |                 |                 | models for different tasks (e.g., using simpler    |
|                      |                 |                 | models for basic extractions and more powerful     |
|                      |                 |                 | models for complex reasoning). For more details,   |
|                      |                 |                 | see 🏷️ LLM Roles.                                  |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "system_message"     | "str | None"    | "None"          | If not provided (or set to None), ContextGem       |
|                      |                 |                 | automatically sets a default system message        |
|                      |                 |                 | optimized for extraction tasks, rendered based on  |
|                      |                 |                 | the configured "output_language". This default     |
|                      |                 |                 | system message template can be found here in the   |
|                      |                 |                 | source code. Note that for certain models (such as |
|                      |                 |                 | OpenAI's o1-preview), system messages are not      |
|                      |                 |                 | supported and will be ignored. Overriding this is  |
|                      |                 |                 | typically only necessary for advanced use cases,   |
|                      |                 |                 | such as custom priming or non- extraction tasks.   |
|                      |                 |                 | For simple chat interactions, consider setting     |
|                      |                 |                 | "system_message=''" to disable the default         |
|                      |                 |                 | entirely (meaning no system message will be sent). |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "max_tokens"         | "int"           | "4096"          | Maximum tokens in the generated response           |
|                      |                 |                 | (applicable to most models).                       |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "max_completion_tok  | "int"           | "16000"         | Maximum tokens for output completions in reasoning |
| ens"                 |                 |                 | (CoT-capable) models.                              |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "reasoning_effort"   | "str | None"    | "None"          | Reasoning effort for reasoning (CoT-capable)       |
|                      |                 |                 | models. Values: ""minimal"" (gpt-5 models only),   |
|                      |                 |                 | ""low"", ""medium"", ""high"".                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "timeout"            | "int"           | "120"           | Timeout in seconds for LLM API calls.              |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "num_retries_failed  | "int"           | "3"             | Number of retries when LLM request fails.          |
| _request"            |                 |                 |                                                    |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "max_retries_failed  | "int"           | "0"             | LLM provider-specific retry count for failed       |
| _request"            |                 |                 | requests.                                          |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "max_retries_invali  | "int"           | "3"             | Number of retries when LLM request succeeds but    |
| d_data"              |                 |                 | returns invalid data (JSON parsing and validation  |
|                      |                 |                 | fails).                                            |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "pricing_details"    | "LLMPricing |   | "None"          | "LLMPricing" object with pricing details for cost  |
|                      | None"           |                 | calculation.                                       |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "auto_pricing"       | "bool"          | "False"         | Enable automatic cost calculation using "genai-    |
|                      |                 |                 | prices" based on the configured "model". Mutually  |
|                      |                 |                 | exclusive with "pricing_details". Not supported    |
|                      |                 |                 | for local models (e.g., "ollama/", "ollama_chat/", |
|                      |                 |                 | "lm_studio/").                                     |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "auto_pricing_refre  | "bool"          | "False"         | When "auto_pricing" is enabled, allow "genai-      |
| sh"                  |                 |                 | prices" to auto-refresh its cached pricing data at |
|                      |                 |                 | runtime.                                           |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "is_fallback"        | "bool"          | "False"         | Indicates whether the LLM is a fallback model.     |
|                      |                 |                 | Fallback LLMs are optionally assigned to the       |
|                      |                 |                 | primary LLM instance and are used when the primary |
|                      |                 |                 | LLM fails.                                         |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "fallback_llm"       | "DocumentLLM |  | "None"          | "DocumentLLM" to use as fallback if current one    |
|                      | None"           |                 | fails. Must have the same role as the primary LLM. |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "output_language"    | "str"           | ""en""          | Language for output text. Values: ""en"" or        |
|                      |                 |                 | ""adapt"" (adapts to document language). Setting   |
|                      |                 |                 | value to ""adapt"" ensures that the text output    |
|                      |                 |                 | (e.g. justifications, conclusions, explanations)   |
|                      |                 |                 | is in the same language as the document. This is   |
|                      |                 |                 | particularly useful when working with non-English  |
|                      |                 |                 | documents. For example, if you're extracting       |
|                      |                 |                 | anomalies from a contract in Spanish, setting      |
|                      |                 |                 | "output_language="adapt"" ensures that anomaly     |
|                      |                 |                 | justifications are also in Spanish, making them    |
|                      |                 |                 | immediately understandable by local end users      |
|                      |                 |                 | reviewing the document. This parameter applies     |
|                      |                 |                 | only when the default system message is used.      |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "temperature"        | "float | None"  | "0.3"           | Sampling temperature (0.0 to 1.0) controlling      |
|                      |                 |                 | response creativity. Lower values produce more     |
|                      |                 |                 | predictable outputs, higher values generate more   |
|                      |                 |                 | varied responses.                                  |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "top_p"              | "float | None"  | "0.3"           | Nucleus sampling value (0.0 to 1.0) controlling    |
|                      |                 |                 | output focus/randomness. Lower values make output  |
|                      |                 |                 | more deterministic, higher values produce more     |
|                      |                 |                 | diverse outputs.                                   |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "seed"               | "int | None"    | "None"          | Seed for random number generation to help produce  |
|                      |                 |                 | more consistent outputs across multiple runs. When |
|                      |                 |                 | set to a specific integer value, the LLM will      |
|                      |                 |                 | attempt to use this seed for sampling operations.  |
|                      |                 |                 | However, deterministic output is still not         |
|                      |                 |                 | guaranteed even with the same seed, as other       |
|                      |                 |                 | factors may influence the model's response.        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "tools"              | "list[dict] |   | "None"          | OpenAI-compatible tool schema used only for chat   |
|                      | None"           |                 | via "DocumentLLM.chat(...)"/".chat_async(...)".    |
|                      |                 |                 | Each tool must have a registered Python handler    |
|                      |                 |                 | decorated with "@register_tool" and available in   |
|                      |                 |                 | scope when creating the LLM. Handlers must return  |
|                      |                 |                 | a string; for structured data, serialize it (e.g., |
|                      |                 |                 | with "json.dumps") before returning. Ignored by    |
|                      |                 |                 | extraction methods. For more details, see 🛠️ Chat  |
|                      |                 |                 | with Tools.                                        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "tool_choice"        | "str | dict |   | "None"          | Tool choice control passed through to the provider |
|                      | None"           |                 | during chat. Ignored by extraction methods.        |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "parallel_tool_call  | "bool | None"   | "None"          | Enable parallel tool calls during chat tool usage, |
| s"                   |                 |                 | if supported by the model/provider. Ignored by     |
|                      |                 |                 | extraction methods.                                |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "tool_max_rounds"    | "int"           | "10"            | Safety limit on the number of tool-execution       |
|                      |                 |                 | rounds per chat request to prevent infinite loops. |
+----------------------+-----------------+-----------------+----------------------------------------------------+
| "async_limiter"      | "AsyncLimiter"  | "AsyncLimiter(  | Relevant when concurrency is enabled for           |
|                      |                 | 3, 10)"         | extraction tasks. Controls frequency of async LLM  |
|                      |                 |                 | API requests for concurrent tasks. Defaults to     |
|                      |                 |                 | allowing 3 acquisitions per 10-second period to    |
|                      |                 |                 | prevent rate limit issues. See aiolimiter          |
|                      |                 |                 | documentation for AsyncLimiter configuration       |
|                      |                 |                 | details. See Optimizing for Speed for an example   |
|                      |                 |                 | of how to easily set up concurrency for            |
|                      |                 |                 | extraction.                                        |
+----------------------+-----------------+-----------------+----------------------------------------------------+

Warning:

  **Auto-pricing accuracy**When using "auto_pricing=True", cost
  estimates are approximate. These prices will not be 100% accurate.
  The price data cannot be exactly correct because model providers do
  not provide exact price information for their APIs in a format which
  can be reliably processed. See Pydantic's genai-prices for more
  details.


💡 Advanced Configuration Examples
==================================


🔄 Configuring a Fallback LLM
-----------------------------

You can set up a fallback LLM that will be used if the primary LLM
fails:

Configuring a fallback LLM

   from contextgem import DocumentLLM


   # Primary LLM
   primary_llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key="<your-openai-api-key>",
       role="extractor_text",  # default role
   )

   # Fallback LLM
   fallback_llm = DocumentLLM(
       model="anthropic/claude-3-5-haiku",
       api_key="<your-anthropic-api-key>",
       role="extractor_text",  # Must match the primary LLM's role
       is_fallback=True,
   )

   # Assign fallback LLM to primary
   primary_llm.fallback_llm = fallback_llm

   # Then use the primary LLM as usual
   # document = primary_llm.extract_all(document)


💰 Setting Up Cost Tracking
---------------------------

You can configure pricing parameters to track costs:

Setting up LLM cost tracking

   from contextgem import DocumentLLM, LLMPricing


   # Option 1: Set up a LLM with pricing details
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key="<your-openai-api-key>",
       pricing_details=LLMPricing(
           input_per_1m_tokens=0.150,  # Cost per 1M input tokens
           output_per_1m_tokens=0.600,  # Cost per 1M output tokens
       ),
   )

   # Option 2: Set up a LLM with auto-pricing
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key="<your-openai-api-key>",
       auto_pricing=True,
   )

   # Perform some extraction tasks

   # Later, you can check the cost
   cost_info = llm.get_cost()


🧠 Using Model-Specific Parameters
----------------------------------

For reasoning (CoT-capable) models (such as OpenAI's o1/o3/o4), you
can set reasoning-specific parameters:

Using model-specific parameters

   from contextgem import DocumentLLM


   llm = DocumentLLM(
       model="openai/o3-mini",
       api_key="<your-openai-api-key>",
       max_completion_tokens=8000,  # Specific to reasoning (CoT-capable) models
       reasoning_effort="medium",  # Optional
   )


⚙️ Explicit Capability Declaration
----------------------------------

Model vision capabilities are automatically detected using
"litellm.supports_vision()". If this function does not correctly
identify your model's capabilities, ContextGem will typically issue a
warning, and you can explicitly declare the capability by setting
"_supports_vision=True" on the LLM instance:

   from contextgem import DocumentLLM

   # Example: Explicitly declare vision capability
   # Warning will be issued if automatic vision capability detection fails
   llm = DocumentLLM(
       model="some_provider/custom_vision_model",
       api_base="http://localhost:3456/v1",
       role="extractor_vision"
   )
   # Declare capability if automatic detection fails (warning was issued)
   llm._supports_vision = True

Warning:

  Explicit capability declarations should only be used when automatic
  capability detection fails. Incorrectly setting this flag may lead
  to unexpected behavior or API errors.


🤖🤖 LLM Groups
===============

For complex document processing, you can organize multiple LLMs with
different roles into a group:

Using LLM group

   from contextgem import DocumentLLM, DocumentLLMGroup


   # Create LLMs with different roles
   text_extractor = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key="<your-openai-api-key>",
       role="extractor_text",
       output_language="adapt",
   )

   text_reasoner = DocumentLLM(
       model="openai/o3-mini",
       api_key="<your-openai-api-key>",
       role="reasoner_text",
       max_completion_tokens=16000,
       reasoning_effort="high",
       output_language="adapt",
   )

   # Create a group
   llm_group = DocumentLLMGroup(
       llms=[text_extractor, text_reasoner],
       output_language="adapt",  # All LLMs in the group must share the same output language setting
   )

   # Then use the group as usual
   # document = llm_group.extract_all(document)

See a practical example of using an LLM group in 🔄 Using a Multi-LLM
Pipeline to Extract Data from Several Documents.


📊 Accessing Usage and Cost Statistics
======================================

You can track input/output token usage and costs:

Tracking usage and cost

   from contextgem import DocumentLLM


   llm = DocumentLLM(
       model="anthropic/claude-3-5-haiku",
       api_key="<your-anthropic-api-key>",
       auto_pricing=True,  # or set `pricing_details=LLMPricing(...)` manually
   )

   # Perform some extraction tasks

   # Get usage statistics
   usage_info = llm.get_usage()

   # Get cost statistics
   cost_info = llm.get_cost()

   # Reset usage and cost statistics
   llm.reset_usage_and_cost()

   # The same methods are available for LLM groups, with optional filtering by LLM role
   # usage_info = llm_group.get_usage(llm_role="extractor_text")
   # cost_info = llm_group.get_cost(llm_role="extractor_text")
   # llm_group.reset_usage_and_cost(llm_role="extractor_text")

The usage statistics include not only token counts but also detailed
information about each individual call made to the LLM. You can access
the call history, including prompts, responses, and timestamps:

Accessing detailed usage information

   from contextgem import DocumentLLM


   llm = DocumentLLM(
       model="openai/gpt-4.1",
       api_key="<your-openai-api-key>",
   )

   # Perform some extraction tasks

   usage_info = llm.get_usage()

   # Access the first usage container in the list (for the primary LLM)
   llm_usage = usage_info[0]

   # Get detailed call information
   for call in llm_usage.usage.calls:
       print(f"Prompt: {call.prompt}")
       print(f"Response: {call.response}")  # original, unprocessed response
       print(f"Sent at: {call.timestamp_sent}")
       print(f"Received at: {call.timestamp_received}")


# ==== llms/llm_extraction_methods ====

Extraction Methods
******************

This guide documents the extraction methods provided by the
"DocumentLLM" and "DocumentLLMGroup" classes for extracting aspects
and concepts from documents using large language models.


📄🧠 Complete Document Processing
=================================


"extract_all()"
---------------

Performs comprehensive extraction by processing a "Document" for all
"Aspect" and "_Concept" instances. This is the most commonly used
method for complete document analysis.

Note:

  See supported concept types in Supported Concepts. All public
  concept types inherit from the internal "_Concept" base class.

**Method Signature:**

   def extract_all(
       self,
       document: Document,
       overwrite_existing: bool = False,
       max_items_per_call: int = 0,
       use_concurrency: bool = False,
       max_paragraphs_to_analyze_per_call: int = 0,
       max_images_to_analyze_per_call: int = 0,
   ) -> Document

Note:

  An async equivalent "extract_all_async()" is also available.

**Parameters:**

+-----------------+-----------------+------------+--------------------------------------------------------------+
| Parameter       | Type            | Default    | Description                                                  |
|=================|=================|============|==============================================================|
| "document"      | "Document"      | (Required) | The document with attached "Aspect" and/or "_Concept"        |
|                 |                 |            | instances to extract.                                        |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "overwrite_exi  | "bool"          | "False"    | Whether to overwrite already processed "Aspect" and          |
| sting"          |                 |            | "_Concept" instances with newly extracted information. This  |
|                 |                 |            | is particularly useful when reprocessing documents with      |
|                 |                 |            | updated LLMs or extraction parameters.                       |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_items_per  | "int"           | "0"        | Maximum number of "Aspect" and/or "_Concept" instances with  |
| _call"          |                 |            | the same extraction parameters to process in a single LLM    |
|                 |                 |            | call (single LLM prompt). "0" means all aspect and/or        |
|                 |                 |            | concept instances with same extraction params in a one call. |
|                 |                 |            | This is particularly useful for complex tasks or long        |
|                 |                 |            | documents to prevent prompt overloading and allow the LLM to |
|                 |                 |            | focus on a smaller set of extraction tasks at once.          |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "use_concurren  | "bool"          | "False"    | Enable concurrent processing of multiple "Aspect" and/or     |
| cy"             |                 |            | "_Concept" instances. Can significantly reduce processing    |
|                 |                 |            | time by executing multiple extraction tasks in parallel,     |
|                 |                 |            | especially beneficial for documents with many aspects and    |
|                 |                 |            | concepts. However, it might cause rate limit errors with LLM |
|                 |                 |            | providers. When enabled, adjust the "async_limiter" on your  |
|                 |                 |            | "DocumentLLM" to control request frequency (default is 3     |
|                 |                 |            | acquisitions per 10 seconds). For optimal results, combine   |
|                 |                 |            | with "max_items_per_call=1" to maximize concurrency,         |
|                 |                 |            | although this would cause increase in LLM API costs as each  |
|                 |                 |            | aspect/concept will be processed in a separate LLM call (LLM |
|                 |                 |            | prompt). See Optimizing for Speed for examples of            |
|                 |                 |            | concurrency configuration.                                   |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_paragraph  | "int"           | "0"        | Maximum paragraphs to include in a single LLM call (single   |
| s_to_analyze_p  |                 |            | LLM prompt). "0" means all paragraphs. This parameter is     |
| er_call"        |                 |            | crucial when working with long documents that exceed the     |
|                 |                 |            | LLM's context window. By limiting the number of paragraphs   |
|                 |                 |            | per call, you can ensure the LLM processes the document in   |
|                 |                 |            | manageable segments while maintaining semantic coherence.    |
|                 |                 |            | This prevents token limit errors and often improves          |
|                 |                 |            | extraction quality by allowing the model to focus on smaller |
|                 |                 |            | portions of text at a time. For more details on handling     |
|                 |                 |            | long documents, see Dealing with Long Documents.             |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_images_to  | "int"           | "0"        | Maximum "Image" instances to analyze in a single LLM call    |
| _analyze_per_c  |                 |            | (single LLM prompt). "0" means all images. This parameter is |
| all"            |                 |            | crucial when working with documents containing multiple      |
|                 |                 |            | images that might exceed the LLM's context window. By        |
|                 |                 |            | limiting the number of images per call, you can ensure the   |
|                 |                 |            | LLM processes the document's visual content in manageable    |
|                 |                 |            | batches. Relevant only when extracting document-level        |
|                 |                 |            | concepts from document images. See 🖼️ Concept Extraction     |
|                 |                 |            | from Document (vision) for an example of extracting concepts |
|                 |                 |            | from document images.                                        |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "raise_excepti  | "bool"          | "True"     | Whether to raise an exception if the extraction fails due to |
| on_on_extracti  |                 |            | invalid data returned by an LLM or an error in the LLM API.  |
| on_error"       |                 |            | If True (default): if the LLM returns invalid data,          |
|                 |                 |            | "LLMExtractionError" will be raised, and if the LLM API call |
|                 |                 |            | fails, "LLMAPIError" will be raised. If False, a warning     |
|                 |                 |            | will be issued instead, and no extracted items will be       |
|                 |                 |            | returned.                                                    |
+-----------------+-----------------+------------+--------------------------------------------------------------+


**Return Value:**

Returns the same "Document" instance passed as input, but with all
attached "Aspect" and "_Concept" instances populated with their
extracted items. The document's aspects and concepts will have their
"extracted_items" field populated with the extracted information, and
if applicable, "reference_paragraphs"/ "reference_sentences" will be
set based on the extraction parameters. The exact structure of
references depends on the "reference_depth" setting of each aspect and
concept.

**Example Usage:**

Extracting all aspects and concepts from a document

   # ContextGem: Extracting All Aspects and Concepts from Document

   import os

   from contextgem import Aspect, Document, DocumentLLM, StringConcept


   # Sample text content
   text_content = """
   John Smith is a 30-year-old software engineer working at TechCorp. 
   He has 5 years of experience in Python development and leads a team of 8 developers.
   His annual salary is $95,000 and he graduated from MIT with a Computer Science degree.
   """

   # Create a Document object from text
   doc = Document(raw_text=text_content)

   # Define aspects and concepts directly on the document
   doc.aspects = [
       Aspect(
           name="Professional Information",
           description="Information about the person's career, job, and work experience",
       )
   ]

   doc.concepts = [
       StringConcept(
           name="Person name",
           description="Full name of the person",
       )
   ]

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract all aspects and concepts from the document
   processed_doc = llm.extract_all(doc)

   # Access extracted aspect information
   aspect = processed_doc.aspects[0]
   print(f"Aspect: {aspect.name}")
   print(f"Extracted items: {[item.value for item in aspect.extracted_items]}")

   # Access extracted concept information
   concept = processed_doc.concepts[0]
   print(f"Concept: {concept.name}")
   print(f"Extracted value: {concept.extracted_items[0].value}")


📄 Aspect Extraction Methods
============================


"extract_aspects_from_document()"
---------------------------------

Extracts "Aspect" instances from a "Document".

**Method Signature:**

   def extract_aspects_from_document(
       self,
       document: Document,
       from_aspects: list[Aspect] | None = None,
       overwrite_existing: bool = False,
       max_items_per_call: int = 0,
       use_concurrency: bool = False,
       max_paragraphs_to_analyze_per_call: int = 0,
   ) -> list[Aspect]

Note:

  An async equivalent "extract_aspects_from_document_async()" is also
  available.

**Parameters:**

+-----------------+-----------------+------------+--------------------------------------------------------------+
| Parameter       | Type            | Default    | Description                                                  |
|=================|=================|============|==============================================================|
| "document"      | "Document"      | (Required) | The document with attached "Aspect" instances to be          |
|                 |                 |            | extracted.                                                   |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "from_aspects"  | "list[Aspect] | | "None"     | Specific aspects to extract from the document. If "None",    |
|                 | None"           |            | extracts all aspects attached to the document. This allows   |
|                 |                 |            | you to selectively process only certain aspects rather than  |
|                 |                 |            | the entire set.                                              |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "overwrite_exi  | "bool"          | "False"    | Whether to overwrite already processed aspects with newly    |
| sting"          |                 |            | extracted information. This is particularly useful when      |
|                 |                 |            | reprocessing documents with updated LLMs or extraction       |
|                 |                 |            | parameters.                                                  |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_items_per  | "int"           | "0"        | Maximum number of "Aspect" instances with the same           |
| _call"          |                 |            | extraction parameters to process in a single LLM call        |
|                 |                 |            | (single LLM prompt). "0" means all aspect instances with     |
|                 |                 |            | same extraction params in a one call. This is particularly   |
|                 |                 |            | useful for complex tasks or long documents to prevent prompt |
|                 |                 |            | overloading and allow the LLM to focus on a smaller set of   |
|                 |                 |            | extraction tasks at once.                                    |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "use_concurren  | "bool"          | "False"    | Enable concurrent processing of multiple "Aspect" instances. |
| cy"             |                 |            | Can significantly reduce processing time by executing        |
|                 |                 |            | multiple extraction tasks concurrently, especially           |
|                 |                 |            | beneficial for documents with many aspects. However, it      |
|                 |                 |            | might cause rate limit errors with LLM providers. When       |
|                 |                 |            | enabled, adjust the "async_limiter" on your "DocumentLLM" to |
|                 |                 |            | control request frequency (default is 3 acquisitions per 10  |
|                 |                 |            | seconds). For optimal results, combine with                  |
|                 |                 |            | "max_items_per_call=1" to maximize concurrency, although     |
|                 |                 |            | this would cause increase in LLM API costs as each aspect    |
|                 |                 |            | will be processed in a separate LLM call (LLM prompt). See   |
|                 |                 |            | Optimizing for Speed for examples of concurrency             |
|                 |                 |            | configuration.                                               |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_paragraph  | "int"           | "0"        | Maximum paragraphs to include in a single LLM call (single   |
| s_to_analyze_p  |                 |            | LLM prompt). "0" means all paragraphs. This parameter is     |
| er_call"        |                 |            | crucial when working with long documents that exceed the     |
|                 |                 |            | LLM's context window. By limiting the number of paragraphs   |
|                 |                 |            | per call, you can ensure the LLM processes the document in   |
|                 |                 |            | manageable segments while maintaining semantic coherence.    |
|                 |                 |            | This prevents token limit errors and often improves          |
|                 |                 |            | extraction quality by allowing the model to focus on smaller |
|                 |                 |            | portions of text at a time. For more details on handling     |
|                 |                 |            | long documents, see Dealing with Long Documents.             |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "raise_excepti  | "bool"          | "True"     | Whether to raise an exception if the extraction fails due to |
| on_on_extracti  |                 |            | invalid data returned by an LLM or an error in the LLM API.  |
| on_error"       |                 |            | If True (default): if the LLM returns invalid data,          |
|                 |                 |            | "LLMExtractionError" will be raised, and if the LLM API call |
|                 |                 |            | fails, "LLMAPIError" will be raised. If False, a warning     |
|                 |                 |            | will be issued instead, and no extracted items will be       |
|                 |                 |            | returned.                                                    |
+-----------------+-----------------+------------+--------------------------------------------------------------+


**Return Value:**

Returns a list of "Aspect" instances that were processed during
extraction. If "from_aspects" was specified, returns only those
aspects; otherwise returns all aspects attached to the document. Each
aspect in the returned list will have its "extracted_items" field
populated with the extracted information, and its
"reference_paragraphs" field will always be set. The
"reference_sentences" field will only be populated when the aspect's
"reference_depth" is set to ""sentences"".

**Example Usage:**

Extracting aspects from a document

   # ContextGem: Extracting Aspects from Documents

   import os

   from contextgem import Aspect, Document, DocumentLLM


   # Sample text content
   text_content = """
   TechCorp is a leading software development company founded in 2015 with headquarters in San Francisco.
   The company specializes in cloud-based solutions and has grown to 500 employees across 12 countries.
   Their flagship product, CloudManager Pro, serves over 10,000 enterprise clients worldwide.
   TechCorp reported $50 million in revenue for 2023, representing a 25% growth from the previous year.
   The company is known for its innovative AI-powered analytics platform and excellent customer support.
   They recently expanded into the European market and plan to launch three new products in 2024.
   """

   # Create a Document object from text
   doc = Document(raw_text=text_content)

   # Define aspects to extract from the document
   doc.aspects = [
       Aspect(
           name="Company Overview",
           description="Basic information about the company, founding, location, and size",
       ),
       Aspect(
           name="Financial Performance",
           description="Revenue, growth metrics, and financial indicators",
       ),
       Aspect(
           name="Products and Services",
           description="Information about the company's products, services, and offerings",
       ),
   ]

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract aspects from the document
   extracted_aspects = llm.extract_aspects_from_document(doc)

   # Access extracted aspect information
   for aspect in extracted_aspects:
       print(f"Aspect: {aspect.name}")
       print(f"Extracted items: {[item.value for item in aspect.extracted_items]}")
       print("---")


🧠 Concept Extraction Methods
=============================


"extract_concepts_from_document()"
----------------------------------

Extracts "_Concept" instances from a "Document" object.

Note:

  See supported concept types in Supported Concepts. All public
  concept types inherit from the internal "_Concept" base class.

**Method Signature:**

   def extract_concepts_from_document(
       self,
       document: Document,
       from_concepts: list[_Concept] | None = None,
       overwrite_existing: bool = False,
       max_items_per_call: int = 0,
       use_concurrency: bool = False,
       max_paragraphs_to_analyze_per_call: int = 0,
       max_images_to_analyze_per_call: int = 0,
   ) -> list[_Concept]

Note:

  An async equivalent "extract_concepts_from_document_async()" is also
  available.

**Parameters:**

+-----------------+-----------------+------------+--------------------------------------------------------------+
| Parameter       | Type            | Default    | Description                                                  |
|=================|=================|============|==============================================================|
| "document"      | "Document"      | (Required) | The document from which concepts are to be extracted.        |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "from_concepts" | "list[_Concept] | "None"     | Specific concepts to extract from the document. If "None",   |
|                 | | None"         |            | extracts all concepts attached to the document. This allows  |
|                 |                 |            | you to selectively process only certain concepts rather than |
|                 |                 |            | the entire set.                                              |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "overwrite_exi  | "bool"          | "False"    | Whether to overwrite already processed concepts with newly   |
| sting"          |                 |            | extracted information. This is particularly useful when      |
|                 |                 |            | reprocessing documents with updated LLMs or extraction       |
|                 |                 |            | parameters.                                                  |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_items_per  | "int"           | "0"        | Maximum number of "_Concept" instances with the same         |
| _call"          |                 |            | extraction parameters to process in a single LLM call        |
|                 |                 |            | (single LLM prompt). "0" means all concept instances with    |
|                 |                 |            | same extraction params in a one call. This is particularly   |
|                 |                 |            | useful for complex tasks or long documents to prevent prompt |
|                 |                 |            | overloading and allow the LLM to focus on a smaller set of   |
|                 |                 |            | extraction tasks at once.                                    |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "use_concurren  | "bool"          | "False"    | Enable concurrent processing of multiple "_Concept"          |
| cy"             |                 |            | instances. Can significantly reduce processing time by       |
|                 |                 |            | executing multiple extraction tasks concurrently, especially |
|                 |                 |            | beneficial for documents with many concepts. However, it     |
|                 |                 |            | might cause rate limit errors with LLM providers. When       |
|                 |                 |            | enabled, adjust the "async_limiter" on your "DocumentLLM" to |
|                 |                 |            | control request frequency (default is 3 acquisitions per 10  |
|                 |                 |            | seconds). For optimal results, combine with                  |
|                 |                 |            | "max_items_per_call=1" to maximize concurrency, although     |
|                 |                 |            | this would cause increase in LLM API costs as each concept   |
|                 |                 |            | will be processed in a separate LLM call (LLM prompt). See   |
|                 |                 |            | Optimizing for Speed for examples of concurrency             |
|                 |                 |            | configuration.                                               |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_paragraph  | "int"           | "0"        | Maximum paragraphs to include in a single LLM call (single   |
| s_to_analyze_p  |                 |            | LLM prompt). "0" means all paragraphs. This parameter is     |
| er_call"        |                 |            | crucial when working with long documents that exceed the     |
|                 |                 |            | LLM's context window. By limiting the number of paragraphs   |
|                 |                 |            | per call, you can ensure the LLM processes the document in   |
|                 |                 |            | manageable segments while maintaining semantic coherence.    |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_images_to  | "int"           | "0"        | Maximum images to include in a single LLM call (single LLM   |
| _analyze_per_c  |                 |            | prompt). "0" means all images. This parameter is crucial     |
| all"            |                 |            | when extracting concepts from documents with multiple images |
|                 |                 |            | using vision-capable LLMs. It helps prevent overwhelming the |
|                 |                 |            | model with too many visual inputs at once, manages token     |
|                 |                 |            | usage more effectively, and enables more focused concept     |
|                 |                 |            | extraction from visual content. See 🖼️ Concept Extraction    |
|                 |                 |            | from Document (vision) for an example of extracting concepts |
|                 |                 |            | from document images.                                        |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "raise_excepti  | "bool"          | "True"     | Whether to raise an exception if the extraction fails due to |
| on_on_extracti  |                 |            | invalid data returned by an LLM or an error in the LLM API.  |
| on_error"       |                 |            | If True (default): if the LLM returns invalid data,          |
|                 |                 |            | "LLMExtractionError" will be raised, and if the LLM API call |
|                 |                 |            | fails, "LLMAPIError" will be raised. If False, a warning     |
|                 |                 |            | will be issued instead, and no extracted items will be       |
|                 |                 |            | returned.                                                    |
+-----------------+-----------------+------------+--------------------------------------------------------------+


**Return Value:**

Returns a list of "_Concept" instances that were processed during
extraction. If "from_concepts" was specified, returns only those
concepts; otherwise returns all concepts attached to the document.
Each concept in the returned list will have its "extracted_items"
field populated with the extracted information, and if applicable,
"reference_paragraphs"/ "reference_sentences" will be set based on the
extraction parameters.

**Example Usage:**

Extracting concepts from a document

   # ContextGem: Extracting Concepts Directly from Documents

   import os

   from contextgem import Document, DocumentLLM, NumericalConcept, StringConcept


   # Sample text content
   text_content = """
   GreenTech Solutions is an environmental technology company founded in 2018 in Portland, Oregon.
   The company develops sustainable energy solutions and has 75 employees working remotely across the United States.
   Their primary product, EcoMonitor, helps businesses track carbon emissions and has been adopted by 2,500 organizations.
   GreenTech Solutions reported strong financial performance with $8.5 million in revenue for 2024.
   The company's CEO, Sarah Johnson, announced plans to achieve carbon neutrality by 2025.
   They recently opened a new research facility in Seattle and hired 20 additional engineers.
   """

   # Create a Document object from text
   doc = Document(raw_text=text_content)

   # Define concepts to extract from the document
   doc.concepts = [
       StringConcept(
           name="Company Name",
           description="Full name of the company",
       ),
       StringConcept(
           name="CEO Name",
           description="Full name of the company's CEO",
       ),
       NumericalConcept(
           name="Employee Count",
           description="Total number of employees at the company",
           numeric_type="int",
       ),
       StringConcept(
           name="Annual Revenue",
           description="Company's total revenue for the year",
       ),
   ]

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract concepts from the document
   extracted_concepts = llm.extract_concepts_from_document(doc)

   # Access extracted concept information
   print("Concepts extracted from document:")
   for concept in extracted_concepts:
       print(f"  {concept.name}: {[item.value for item in concept.extracted_items]}")


"extract_concepts_from_aspect()"
--------------------------------

Extracts "_Concept" instances associated with a given "Aspect" in a
"Document".

The aspect must be previously processed before concept extraction can
occur. This means that the aspect should have already gone through
extraction, which identifies the relevant context (text segments) in
the document that match the aspect's description. This extracted
context is then used as the foundation for concept extraction,
allowing concepts to be identified specifically within the scope of
the aspect.

Note:

  See supported concept types in Supported Concepts. All public
  concept types inherit from the internal "_Concept" base class.

**Method Signature:**

   def extract_concepts_from_aspect(
       self,
       aspect: Aspect,
       document: Document,
       from_concepts: list[_Concept] | None = None,
       overwrite_existing: bool = False,
       max_items_per_call: int = 0,
       use_concurrency: bool = False,
       max_paragraphs_to_analyze_per_call: int = 0,
   ) -> list[_Concept]

Note:

  An async equivalent "extract_concepts_from_aspect_async()" is also
  available.

**Parameters:**

+-----------------+-----------------+------------+--------------------------------------------------------------+
| Parameter       | Type            | Default    | Description                                                  |
|=================|=================|============|==============================================================|
| "aspect"        | "Aspect"        | (Required) | The aspect from which to extract concepts. Must be           |
|                 |                 |            | previously processed through aspect extraction before        |
|                 |                 |            | concepts can be extracted.                                   |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "document"      | "Document"      | (Required) | The document that contains the aspect with the attached      |
|                 |                 |            | concepts to be extracted.                                    |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "from_concepts" | "list[_Concept] | "None"     | Specific concepts to extract from the aspect. If "None",     |
|                 | | None"         |            | extracts all concepts attached to the aspect. This allows    |
|                 |                 |            | you to selectively process only certain concepts rather than |
|                 |                 |            | the entire set.                                              |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "overwrite_exi  | "bool"          | "False"    | Whether to overwrite already processed concepts with newly   |
| sting"          |                 |            | extracted information. This is particularly useful when      |
|                 |                 |            | reprocessing documents with updated LLMs or extraction       |
|                 |                 |            | parameters.                                                  |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_items_per  | "int"           | "0"        | Maximum number of "_Concept" instances with the same         |
| _call"          |                 |            | extraction parameters to process in a single LLM call        |
|                 |                 |            | (single LLM prompt). "0" means all concept instances with    |
|                 |                 |            | same extraction params in one call. This is particularly     |
|                 |                 |            | useful for complex tasks to prevent prompt overloading and   |
|                 |                 |            | allow the LLM to focus on a smaller set of extraction tasks  |
|                 |                 |            | at once.                                                     |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "use_concurren  | "bool"          | "False"    | Enable concurrent processing of multiple "_Concept"          |
| cy"             |                 |            | instances. Can significantly reduce processing time by       |
|                 |                 |            | executing multiple extraction tasks concurrently, especially |
|                 |                 |            | beneficial for aspects with many concepts. However, it might |
|                 |                 |            | cause rate limit errors with LLM providers. When enabled,    |
|                 |                 |            | adjust the "async_limiter" on your "DocumentLLM" to control  |
|                 |                 |            | request frequency (default is 3 acquisitions per 10          |
|                 |                 |            | seconds). For optimal results, combine with                  |
|                 |                 |            | "max_items_per_call=1" to maximize concurrency, although     |
|                 |                 |            | this would cause increase in LLM API costs as each concept   |
|                 |                 |            | will be processed in a separate LLM call (LLM prompt). See   |
|                 |                 |            | Optimizing for Speed for examples of concurrency             |
|                 |                 |            | configuration.                                               |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "max_paragraph  | "int"           | "0"        | Maximum number of the aspect's paragraphs to analyze in a    |
| s_to_analyze_p  |                 |            | single LLM call (single LLM prompt). "0" means all the       |
| er_call"        |                 |            | aspect's paragraphs. This parameter is crucial when working  |
|                 |                 |            | with long documents or aspects that cover extensive portions |
|                 |                 |            | of text that might exceed the LLM's context window. By       |
|                 |                 |            | limiting the number of paragraphs per call, you can break    |
|                 |                 |            | down analysis into manageable chunks or allow the LLM to     |
|                 |                 |            | focus more deeply on smaller sections of text at a time. For |
|                 |                 |            | more details on handling long documents, see Dealing with    |
|                 |                 |            | Long Documents.                                              |
+-----------------+-----------------+------------+--------------------------------------------------------------+
| "raise_excepti  | "bool"          | "True"     | Whether to raise an exception if the extraction fails due to |
| on_on_extracti  |                 |            | invalid data returned by an LLM or an error in the LLM API.  |
| on_error"       |                 |            | If True (default): if the LLM returns invalid data,          |
|                 |                 |            | "LLMExtractionError" will be raised, and if the LLM API call |
|                 |                 |            | fails, "LLMAPIError" will be raised. If False, a warning     |
|                 |                 |            | will be issued instead, and no extracted items will be       |
|                 |                 |            | returned.                                                    |
+-----------------+-----------------+------------+--------------------------------------------------------------+


**Return Value:**

Returns a list of "_Concept" instances that were processed during
extraction from the specified aspect. If "from_concepts" was
specified, returns only those concepts; otherwise returns all concepts
attached to the aspect. Each concept in the returned list will have
its "extracted_items" field populated with the extracted information,
and if applicable, "reference_paragraphs"/ "reference_sentences" will
be set based on the extraction parameters.

**Example Usage:**

Extracting concepts from an aspect

   # ContextGem: Extracting Concepts from Specific Aspects

   import os

   from contextgem import Aspect, Document, DocumentLLM, NumericalConcept, StringConcept


   # Sample text content
   text_content = """
   DataFlow Systems is an innovative fintech startup that was established in 2020 in Austin, Texas.
   The company has rapidly grown to 150 employees and operates in 8 major cities across North America.
   DataFlow's core platform, FinanceStream, is used by more than 5,000 small businesses for automated accounting.
   In their latest financial report, DataFlow Systems announced $12 million in annual revenue for 2024.
   This represents an impressive 40% increase compared to their 2023 performance.
   The company has secured $25 million in Series B funding and plans to expand internationally next year.
   """

   # Create a Document object from text
   doc = Document(raw_text=text_content)

   # Define an aspect to extract from the document
   financial_aspect = Aspect(
       name="Financial Performance",
       description="Revenue, growth metrics, and financial indicators",
   )

   # Add concepts to the aspect
   financial_aspect.concepts = [
       StringConcept(
           name="Annual Revenue",
           description="Total revenue reported for the year",
       ),
       NumericalConcept(
           name="Growth Rate",
           description="Percentage growth rate compared to previous year",
           numeric_type="float",
       ),
       NumericalConcept(
           name="Revenue Year",
           description="The year for which revenue is reported",
       ),
   ]

   # Attach the aspect to the document
   doc.aspects = [financial_aspect]

   # Configure DocumentLLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # First, extract the aspect from the document (required before concept extraction)
   extracted_aspects = llm.extract_aspects_from_document(doc)
   financial_aspect = extracted_aspects[0]

   # Extract concepts from the specific aspect
   extracted_concepts = llm.extract_concepts_from_aspect(financial_aspect, doc)

   # Access extracted concepts for the aspect
   print(f"Aspect: {financial_aspect.name}")
   print(f"Extracted items: {[item.value for item in financial_aspect.extracted_items]}")
   print("\nConcepts extracted from this aspect:")
   for concept in extracted_concepts:
       print(f"  {concept.name}: {[item.value for item in concept.extracted_items]}")


# ==== advanced_usage ====

Advanced usage examples
***********************

Below are complete, self-contained examples demonstrating advanced
usage of ContextGem.


🔍 Extracting Aspects Containing Concepts
=========================================

Tip:

  Concept extraction is useful for extracting specific data points
  from a document or an aspect. For example, a "Payment terms" aspect
  in a contract may have multiple concepts:

  * "Payment amount"

  * "Payment due date"

  * "Payment method"

   # Advanced Usage Example - extracting a single aspect with inner concepts from a legal document

   import os

   from contextgem import Aspect, Document, DocumentLLM, StringConcept, StringExample


   # Create a document instance with e.g. a legal contract text
   # The text is shortened for brevity
   doc = Document(
       raw_text=(
           "EMPLOYMENT AGREEMENT\n\n"
           'This Employment Agreement (the "Agreement") is made and entered into as of January 15, 2023 (the "Effective Date"), '
           'by and between ABC Corporation, a Delaware corporation (the "Company"), and Jane Smith, an individual (the "Employee").\n\n'
           "1. EMPLOYMENT TERM\n"
           "The Company hereby employs the Employee, and the Employee hereby accepts employment with the Company, upon the terms and "
           "conditions set forth in this Agreement. The term of this Agreement shall commence on the Effective Date and shall continue "
           'for a period of two (2) years, unless earlier terminated in accordance with Section 8 (the "Term").\n\n'
           "2. POSITION AND DUTIES\n"
           "During the Term, the Employee shall serve as Chief Technology Officer of the Company, with such duties and responsibilities "
           "as are commensurate with such position.\n\n"
           "8. TERMINATION\n"
           "8.1 Termination by the Company. The Company may terminate the Employee's employment for Cause at any time upon written notice. "
           "\"Cause\" shall mean: (i) Employee's material breach of this Agreement; (ii) Employee's conviction of a felony; or "
           "(iii) Employee's willful misconduct that causes material harm to the Company.\n"
           "8.2 Termination by the Employee. The Employee may terminate employment for Good Reason upon 30 days' written notice to the Company. "
           "\"Good Reason\" shall mean a material reduction in Employee's base salary or a material diminution in Employee's duties.\n"
           "8.3 Severance. If the Employee's employment is terminated by the Company without Cause or by the Employee for Good Reason, "
           "the Employee shall be entitled to receive severance pay equal to six (6) months of the Employee's base salary.\n\n"
           "IN WITNESS WHEREOF, the parties have executed this Agreement as of the date first written above.\n\n"
           "ABC CORPORATION\n\n"
           "By: ______________________\n"
           "Name: John Johnson\n"
           "Title: CEO\n\n"
           "EMPLOYEE\n\n"
           "______________________\n"
           "Jane Smith"
       )
   )

   # Define an aspect focused on termination clauses
   termination_aspect = Aspect(
       name="Termination Provisions",
       description="Analysis of contract termination conditions, notice requirements, and severance terms.",
       reference_depth="paragraphs",
   )

   # Define concepts for the termination aspect
   termination_for_cause = StringConcept(
       name="Termination for Cause",
       description="Conditions under which the company can terminate the employee for cause.",
       examples=[  # optional, examples help the LLM to understand the concept better
           StringExample(content="Employee may be terminated for misconduct"),
           StringExample(content="Termination for breach of contract"),
       ],
       add_references=True,
       reference_depth="sentences",
   )
   notice_period = StringConcept(
       name="Notice Period",
       description="Required notification period before employment termination.",
       add_references=True,
       reference_depth="sentences",
   )
   severance_terms = StringConcept(
       name="Severance Package",
       description="Compensation and benefits provided upon termination.",
       add_references=True,
       reference_depth="sentences",
   )

   # Add concepts to the aspect
   termination_aspect.add_concepts([termination_for_cause, notice_period, severance_terms])

   # Add the aspect to the document
   doc.add_aspects([termination_aspect])

   # Create an LLM for extracting data from the document
   llm = DocumentLLM(
       model="openai/gpt-4o",  # You can use models from other providers as well, e.g. "anthropic/claude-3-5-sonnet"
       api_key=os.environ.get(
           "CONTEXTGEM_OPENAI_API_KEY"
       ),  # your API key for OpenAI or another LLM provider
   )

   # Extract all information from the document
   doc = llm.extract_all(doc)

   # Access the extracted information in the document object
   print("=== Termination Provisions Analysis ===")
   print(f"Extracted {len(doc.aspects[0].extracted_items)} items from the aspect")

   # Access extracted aspect concepts in the document object
   for concept in doc.aspects[0].concepts:
       print(f"--- {concept.name} ---")
       for item in concept.extracted_items:
           print(f"• {item.value}")
           print(f"  Reference sentences: {len(item.reference_sentences)}")


📊 Extracting Aspects and Concepts from a Document
==================================================

Tip:

  This example demonstrates how to extract both document-level
  concepts and aspect-specific concepts from a document with
  references. Using concurrency can significantly speed up extraction
  when working with multiple aspects and concepts.Document-level
  concepts apply to the entire document (like "Is Privacy Policy" or
  "Last Updated Date"), while aspect-specific concepts are tied to
  particular sections or themes within the document.

   # Advanced Usage Example - Extracting aspects and concepts from a document, with references,
   # using concurrency

   import os

   from aiolimiter import AsyncLimiter

   from contextgem import (
       Aspect,
       BooleanConcept,
       DateConcept,
       Document,
       DocumentLLM,
       JsonObjectConcept,
       StringConcept,
   )


   # Example privacy policy document (shortened for brevity)
   doc = Document(
       raw_text=(
           "Privacy Policy\n\n"
           "Last Updated: March 15, 2024\n\n"
           "1. Data Collection\n"
           "We collect various types of information from our users, including:\n"
           "- Personal information (name, email address, phone number)\n"
           "- Device information (IP address, browser type, operating system)\n"
           "- Usage data (pages visited, time spent on site)\n"
           "- Location data (with your consent)\n\n"
           "2. Data Usage\n"
           "We use your information to:\n"
           "- Provide and improve our services\n"
           "- Send you marketing communications (if you opt-in)\n"
           "- Analyze website performance\n"
           "- Comply with legal obligations\n\n"
           "3. Data Sharing\n"
           "We may share your information with:\n"
           "- Service providers (for processing payments and analytics)\n"
           "- Law enforcement (when legally required)\n"
           "- Business partners (with your explicit consent)\n\n"
           "4. Data Retention\n"
           "We retain personal data for 24 months after your last interaction with our services. "
           "Analytics data is kept for 36 months.\n\n"
           "5. User Rights\n"
           "You have the right to:\n"
           "- Access your personal data\n"
           "- Request data deletion\n"
           "- Opt-out of marketing communications\n"
           "- Lodge a complaint with supervisory authorities\n\n"
           "6. Contact Information\n"
           "For privacy-related inquiries, contact our Data Protection Officer at privacy@example.com\n"
       ),
   )

   # Define all document-level concepts in a single declaration
   document_concepts = [
       BooleanConcept(
           name="Is Privacy Policy",
           description="Verify if this document is a privacy policy",
           singular_occurrence=True,  # explicitly enforce singular extracted item (optional)
       ),
       DateConcept(
           name="Last Updated Date",
           description="The date when the privacy policy was last updated",
           singular_occurrence=True,  # explicitly enforce singular extracted item (optional)
       ),
       StringConcept(
           name="Contact Information",
           description="Contact details for privacy-related inquiries",
           add_references=True,
           reference_depth="sentences",
       ),
   ]

   # Define all aspects with their concepts in a single declaration
   aspects = [
       Aspect(
           name="Data Collection",
           description="Information about what types of data are collected from users",
           concepts=[
               JsonObjectConcept(
                   name="Collected Data Types",
                   description="List of different types of data collected from users",
                   structure={
                       "personal_info": list[str],
                       "technical_info": list[str],
                       "usage_info": list[str],
                   },  # simply use a dictionary with type hints (including generic aliases and union types)
                   add_references=True,
                   reference_depth="sentences",
               )
           ],
       ),
       Aspect(
           name="Data Retention",
           description="Information about how long different types of data are retained",
           concepts=[
               JsonObjectConcept(
                   name="Retention Periods",
                   description="The durations for which different types of data are retained",
                   structure={
                       "personal_info": str | None,
                       "technical_info": str | None,
                       "usage_info": str | None,
                   },  # use `str | None` type hints to allow for None values if not specified
                   add_references=True,
                   reference_depth="sentences",
                   singular_occurrence=True,  # explicitly enforce singular extracted item (optional)
               )
           ],
       ),
       Aspect(
           name="Data Subject Rights",
           description="Information about the rights users have regarding their data",
           concepts=[
               StringConcept(
                   name="Data Subject Rights",
                   description="Rights available to users regarding their personal data",
                   add_references=True,
                   reference_depth="sentences",
               )
           ],
       ),
   ]

   # Add aspects and concepts to the document
   doc.add_aspects(aspects)
   doc.add_concepts(document_concepts)

   # Create an LLM for extraction
   llm = DocumentLLM(
       model="openai/gpt-4o",  # or another LLM from e.g. Anthropic, Ollama, etc.
       api_key=os.environ.get(
           "CONTEXTGEM_OPENAI_API_KEY"
       ),  # your API key for the applicable LLM provider
   )
   llm.async_limiter = AsyncLimiter(
       3, 3
   )  # customize async limiter for concurrency (optional)


   # Extract all information from the document, using concurrency
   doc = llm.extract_all(doc, use_concurrency=True)

   # Access / print extracted information on the document object

   print("Document Concepts:")
   for concept in doc.concepts:
       print(f"{concept.name}:")
       for item in concept.extracted_items:
           print(f"• {item.value}")
       print()

   print("Aspects and Concepts:")
   for aspect in doc.aspects:
       print(f"[{aspect.name}]")
       for item in aspect.extracted_items:
           print(f"• {item.value}")
       print()
       for concept in aspect.concepts:
           print(f"{concept.name}:")
           for item in concept.extracted_items:
               print(f"• {item.value}")
       print()


🔄 Using a Multi-LLM Pipeline to Extract Data from Several Documents
====================================================================

Tip:

  A pipeline is a reusable configuration of extraction steps. You can
  use the same pipeline to extract data from multiple documents.For
  example, if your app extracts data from invoices, you can configure
  a pipeline once, and then use it for each incoming invoice.

   # Advanced Usage Example - analyzing multiple documents with a single pipeline,
   # with different LLMs, concurrency and cost tracking

   import os

   from contextgem import (
       Aspect,
       DateConcept,
       Document,
       DocumentLLM,
       DocumentLLMGroup,
       ExtractionPipeline,
       JsonObjectConcept,
       JsonObjectExample,
       LLMPricing,
       NumericalConcept,
       RatingConcept,
       StringConcept,
       StringExample,
   )


   # Construct documents

   # Document 1 - Consultancy Agreement (shortened for brevity)
   doc1 = Document(
       raw_text=(
           "Consultancy Agreement\n"
           "This agreement between Company A (Supplier) and Company B (Customer)...\n"
           "The term of the agreement is 1 year from the Effective Date...\n"
           "The Supplier shall provide consultancy services as described in Annex 2...\n"
           "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
           "All intellectual property created during the provision of services shall belong to the Customer...\n"
           "This agreement is governed by the laws of Norway...\n"
           "Annex 1: Data processing agreement...\n"
           "Annex 2: Statement of Work...\n"
           "Annex 3: Service Level Agreement...\n"
       ),
   )

   # Document 2 - Service Level Agreement (shortened for brevity)
   doc2 = Document(
       raw_text=(
           "Service Level Agreement\n"
           "This agreement between TechCorp (Provider) and GlobalInc (Client)...\n"
           "The agreement shall commence on January 1, 2023 and continue for 2 years...\n"
           "The Provider shall deliver IT support services as outlined in Schedule A...\n"
           "The Client shall make monthly payments of $5,000 within 15 days of invoice receipt...\n"
           "The Provider guarantees [99.9%] uptime for all critical systems...\n"
           "Either party may terminate with 60 days written notice...\n"
           "This agreement is governed by the laws of California...\n"
           "Schedule A: Service Descriptions...\n"
           "Schedule B: Response Time Requirements...\n"
       ),
   )

   # Create a reusable extraction pipeline
   contract_pipeline = ExtractionPipeline()

   # Define aspects and aspect-level concepts in the pipeline
   # Concepts in the aspects will be extracted from the extracted aspect context
   contract_pipeline.aspects = [  # or use .add_aspects([...])
       Aspect(
           name="Contract Parties",
           description="Clauses defining the parties to the agreement",
           concepts=[  # define aspect-level concepts, if any
               StringConcept(
                   name="Party names and roles",
                   description="Names of all parties entering into the agreement and their roles",
                   examples=[  # optional
                       StringExample(
                           content="X (Client)",  # guidance regarding the expected output format
                       )
                   ],
               )
           ],
       ),
       Aspect(
           name="Term",
           description="Clauses defining the term of the agreement",
           concepts=[
               NumericalConcept(
                   name="Contract term",
                   description="The term of the agreement in years",
                   numeric_type="int",  # or "float", or "any" for auto-detection
                   add_references=True,  # extract references to the source text
                   reference_depth="paragraphs",
               )
           ],
       ),
   ]

   # Define document-level concepts
   # Concepts in the document will be extracted from the whole document content
   contract_pipeline.concepts = [  # or use .add_concepts()
       DateConcept(
           name="Effective date",
           description="The effective date of the agreement",
       ),
       StringConcept(
           name="Contract type",
           description="The type of agreement",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
       ),
       StringConcept(
           name="Governing law",
           description="The law that governs the agreement",
       ),
       JsonObjectConcept(
           name="Attachments",
           description="The titles and concise descriptions of the attachments to the agreement",
           structure={"title": str, "description": str | None},
           examples=[  # optional
               JsonObjectExample(  # guidance regarding the expected output format
                   content={
                       "title": "Appendix A",
                       "description": "Code of conduct",
                   }
               ),
           ],
       ),
       RatingConcept(
           name="Duration adequacy",
           description="Contract duration adequacy considering the subject matter and best practices.",
           llm_role="reasoner_text",  # for this concept, we use a more advanced LLM for reasoning
           rating_scale=(1, 10),
           add_justifications=True,  # add justifications for the rating
           justification_depth="balanced",  # provide a balanced justification
           justification_max_sents=3,
       ),
   ]

   # Assign pipeline to the documents
   # You can re-use the same pipeline for multiple documents
   doc1.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document
   doc2.assign_pipeline(
       contract_pipeline
   )  # assigns pipeline aspects and concepts to the document

   # Create an LLM group for data extraction and reasoning
   llm_extractor = DocumentLLM(
       model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="extractor_text",  # signifies the LLM is used for data extraction tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=0.150,
           output_per_1m_tokens=0.600,
       ),
       # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider
   )
   llm_reasoner = DocumentLLM(
       model="openai/o3-mini",  # or any other LLM from e.g. Anthropic, etc.
       api_key=os.environ["CONTEXTGEM_OPENAI_API_KEY"],  # your API key
       role="reasoner_text",  # signifies the LLM is used for reasoning tasks
       pricing_details=LLMPricing(  # optional, for costs calculation
           input_per_1m_tokens=1.10,
           output_per_1m_tokens=4.40,
       ),
       # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider
   )
   # The LLM group is used for all extraction tasks within the pipeline
   llm_group = DocumentLLMGroup(llms=[llm_extractor, llm_reasoner])

   # Extract all information from the documents at once
   doc1 = llm_group.extract_all(
       doc1, use_concurrency=True
   )  # use concurrency to speed up extraction
   doc2 = llm_group.extract_all(
       doc2, use_concurrency=True
   )  # use concurrency to speed up extraction
   # Or use async variants .extract_all_async(...)

   # Get the extracted data
   print("Some extracted data from doc 1:")
   print("Contract Parties > Party names and roles:")
   print(
       doc1.get_aspect_by_name("Contract Parties")
       .get_concept_by_name("Party names and roles")
       .extracted_items
   )
   print("Attachments:")
   print(doc1.get_concept_by_name("Attachments").extracted_items)
   # ...

   print("\nSome extracted data from doc 2:")
   print("Term > Contract term:")
   print(
       doc2.get_aspect_by_name("Term")
       .get_concept_by_name("Contract term")
       .extracted_items[0]
       .value
   )
   print("Duration adequacy:")
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].value)
   print(doc2.get_concept_by_name("Duration adequacy").extracted_items[0].justification)
   # ...

   # Output processing costs (requires setting the pricing details for each LLM)
   print("\nProcessing costs:")
   print(llm_group.get_cost())


# ==== logging_config ====

Logging Configuration
*********************

ContextGem provides comprehensive logging to help you monitor and
debug the extraction process. You can control logging behavior using
environment variables. ContextGem uses a **namespaced logger** under
the name "contextgem".


⚙️ Environment Variables
========================

ContextGem uses a single environment variable for logging
configuration:

**CONTEXTGEM_LOGGER_LEVEL**
   Sets the logging level. Valid values are:

   * "TRACE" - Most verbose, shows all log messages

   * "DEBUG" - Shows debug information and above

   * "INFO" - Shows informational messages and above (default)

   * "SUCCESS" - Shows success messages and above

   * "WARNING" - Shows warnings and errors only

   * "ERROR" - Shows errors and critical messages only

   * "CRITICAL" - Shows only critical messages

   * "OFF" - Completely disables logging

   **Default:** "INFO"

Warning:

  **Not recommended:** Setting the level to "OFF" or above "INFO"
  (such as "WARNING" or "ERROR") may cause you to miss helpful
  messages, guidance, recommendations, and important information about
  the extraction process. The default "INFO" level provides a good
  balance of useful information without being too verbose.


🔧 Setting Environment Variables
================================

**Before importing ContextGem:**

   # Set logging level to WARNING
   export CONTEXTGEM_LOGGER_LEVEL=WARNING

   # Disable logging completely
   export CONTEXTGEM_LOGGER_LEVEL=OFF

**In Python before import:**

   import os

   # Set logging level to DEBUG
   os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "DEBUG"

   # Import ContextGem after setting environment variables
   import contextgem


🔄 Changing Settings at Runtime
===============================

If you need to change logging settings after importing ContextGem, use
the "reload_logger_settings()" function:

Changing logger settings at runtime

   import os

   from contextgem import reload_logger_settings


   # Initial logger settings are loaded from environment variables at import time

   # Change logger level to WARNING
   os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "WARNING"
   print("Setting logger level to WARNING")
   reload_logger_settings()
   # Now the logger will only show WARNING level and above messages

   # Disable the logger completely
   os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "OFF"
   print("Disabling the logger")
   reload_logger_settings()
   # Now the logger is disabled and won't show any messages

   # You can re-enable the logger by setting it back to a valid level
   # os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "INFO"
   # reload_logger_settings()


📋 Log Format
=============

ContextGem logs use the following format:

   [contextgem] 2025-01-11 15:30:45.123 | INFO    | Your log message here

Each log entry includes:

* Timestamp (with milliseconds)

* Log level

* Log message


# ==== optimizations/optimization_choosing_llm ====

Choosing the Right LLM(s)
*************************


🧭 General Guidance
===================

Your choice of LLM directly affects the accuracy, speed, and cost of
your extraction pipeline. ContextGem integrates with various LLM
providers (via LiteLLM), enabling you to select models that best fit
your needs.

Since ContextGem specializes in deep single-document analysis, models
with large context windows are recommended. While each use case has
unique requirements, our experience suggests the following practical
guidelines. However, please note that for sensitive applications
(e.g., contract review) where accuracy is paramount and speed/cost are
secondary concerns, using the most capable model available for all
tasks is often the safest approach.


Choosing LLMs - General Guidance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

+----------------------------------------------------+----------------------------------------------------+
| Aspect Extraction                                  | Concept Extraction                                 |
|====================================================|====================================================|
| A **smaller/distilled non-reasoning model**        | For *basic concepts* (e.g., titles, payment        |
| capable of identifying relevant document sections  | amounts, dates), the same **smaller/distilled non- |
| (e.g., "gpt-4o-mini"). This extraction resembles   | reasoning model** is often sufficient (e.g., "gpt- |
| multi-label classification. Complex aspects may    | 4o-mini"). For *complex concepts* requiring        |
| occasionally require larger or reasoning models.   | nuanced understanding within specific aspects or   |
|                                                    | the entire document, consider a **larger non-      |
|                                                    | reasoning model** (e.g., "gpt-4o"). For concepts   |
|                                                    | requiring advanced understanding or complex        |
|                                                    | reasoning (e.g., logical deductions, evaluation),  |
|                                                    | a **reasoning model** like "o3-mini" may be        |
|                                                    | appropriate.                                       |
+----------------------------------------------------+----------------------------------------------------+

See also:

  **Small Model Issues?** If you're experiencing issues with smaller
  models (e.g. 8B parameter models), such as JSON validation errors or
  inconsistent results, see our troubleshooting guide for specific
  solutions and workarounds.


🏷️ LLM Roles
============

The "role" of an LLM is an abstraction used to assign various LLMs
tasks of different complexity. For example, if an aspect/concept is
assigned "llm_role="extractor_text"", this aspect/concept is extracted
from the document using the LLM with "role="extractor_text"". This
helps to channel different tasks to different LLMs, ensuring that the
task is handled by the most appropriate model. Usually, domain
expertise is required to determine the most appropriate role for a
specific aspect/concept.

In LLM groups, unique role assignments are especially important: each
model in the group must have a distinct role so routing can
unambiguously send each aspect/concept to the intended model.

For simple use cases, when working with text-only documents and a
single LLM, you can skip the role assignments completely, in which
case the roles will default to ""extractor_text"".


Available LLM roles
^^^^^^^^^^^^^^^^^^^

+----------------------+----------------------+----------------------+----------------------+
| Role                 | Extraction Context   | Extracted Item Types | Required LLM         |
|                      |                      |                      | Capabilities         |
|======================|======================|======================|======================|
| ""extractor_text""   | Text                 | Aspects and concepts | No reasoning         |
|                      |                      | (aspect- and         | required             |
|                      |                      | document-level)      |                      |
+----------------------+----------------------+----------------------+----------------------+
| ""reasoner_text""    | Text                 | Aspects and concepts | Reasoning-capable    |
|                      |                      | (aspect- and         | model                |
|                      |                      | document-level)      |                      |
+----------------------+----------------------+----------------------+----------------------+
| ""extractor_vision"" | Images               | Document-level       | Vision-capable model |
|                      |                      | concepts             |                      |
+----------------------+----------------------+----------------------+----------------------+
| ""reasoner_vision""  | Images               | Document-level       | Vision-capable and   |
|                      |                      | concepts             | reasoning-capable    |
|                      |                      |                      | model                |
+----------------------+----------------------+----------------------+----------------------+
| ""extractor_multimo  | Text and/or images   | Document-level       | Multimodal model     |
| dal""                |                      | concepts             | supporting text and  |
|                      |                      |                      | image inputs         |
+----------------------+----------------------+----------------------+----------------------+
| ""reasoner_multimod  | Text and/or images   | Document-level       | Reasoning-capable    |
| al""                 |                      | concepts             | multimodal model     |
|                      |                      |                      | supporting text and  |
|                      |                      |                      | image inputs         |
+----------------------+----------------------+----------------------+----------------------+

Note:

  🧠 Only LLMs that support reasoning (chain of thought) should be
  assigned reasoning roles (""reasoner_text"", ""reasoner_vision"").
  For such models, internal prompts include reasoning-specific
  instructions intended for these models to produce higher-quality
  responses.

Note:

  👁️ Only LLMs that support vision can be assigned vision roles
  (""extractor_vision"", ""reasoner_vision"").

Note:

  🔀 Multimodal roles (""extractor_multimodal"",
  ""reasoner_multimodal"") reuse the existing text and vision
  extraction paths. If text exists, the text path runs first; if
  images exist, the vision path runs next. References are only
  supported for multimodal concepts when text is used.

Example of selecting different LLMs for different tasks

   # Example of selecting different LLMs for different tasks

   import os

   from contextgem import Aspect, Document, DocumentLLM, DocumentLLMGroup, StringConcept


   # Define LLMs
   base_llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       role="extractor_text",  # default
   )

   # Optional - attach a fallback LLM
   base_llm_fallback = DocumentLLM(
       model="openai/gpt-3-5-turbo",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       role="extractor_text",  # must have the same role as the parent LLM
       is_fallback=True,
   )
   base_llm.fallback_llm = base_llm_fallback

   advanced_llm = DocumentLLM(
       model="openai/o3-mini",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       role="reasoner_text",
   )

   # You can organize LLMs in a group to use them in a pipeline
   llm_group = DocumentLLMGroup(
       llms=[base_llm, advanced_llm],
   )

   # Assign the existing LLMs to aspects/concepts
   document = Document(
       raw_text="document_text",
       aspects=[
           Aspect(
               name="aspect_name",
               description="aspect_description",
               llm_role="extractor_text",
               concepts=[
                   StringConcept(
                       name="concept_name",
                       description="concept_description",
                       llm_role="reasoner_text",
                   )
               ],
           )
       ],
   )

   # Then use the LLM group to extract all information from the document
   # This will use different LLMs for different aspects/concepts under the hood
   # document = llm_group.extract_all(document)


# ==== optimizations/optimization_accuracy ====

Optimizing for Accuracy
***********************

When accuracy is paramount, ContextGem offers several techniques to
improve extraction quality, some of which are pretty obvious:

* **🚀 Use a Capable LLM**: Choose a powerful LLM model for
  extraction.

* **🪄 Use Larger Segmentation Models**: Select a larger SaT model for
  intelligent segmentation of paragraphs or sentences, to ensure the
  highest segmentation accuracy in complex documents (e.g. contracts).

* **💡 Provide Examples**: For most complex concepts, add examples to
  guide the LLM's extraction format and style.

* **🧠 Request Justifications**: For most complex aspects/concepts,
  enable justifications to understand the LLM's reasoning and instruct
  the LLM to "think" when giving an answer.

* **📏 Limit Paragraphs Per Call**: This will reduce each prompt's
  length and ensure a more focused analysis.

* **🔢 Limit Aspects/Concepts Per Call**: Process a smaller number of
  aspects or concepts in each LLM call, preventing prompt overloading.

* **🔄 Use a Fallback LLM**: Configure a fallback LLM to retry failed
  extractions with a different model.

Example of optimizing extraction for accuracy

   # Example of optimizing extraction for accuracy

   import os

   from contextgem import Document, DocumentLLM, StringConcept, StringExample


   # Define document
   doc = Document(
       raw_text="Non-Disclosure Agreement...",
       sat_model_id="sat-6l-sm",  # default is "sat-3l-sm"
       paragraph_segmentation_mode="sat",  # default is "newlines"
       # sentence segmentation mode is always "sat", as other approaches proved to be less accurate
   )

   # Define document concepts
   doc.concepts = [
       StringConcept(
           name="Title",  # A very simple concept, just an example for testing purposes
           description="Title of the document",
           add_justifications=True,  # enable justifications
           justification_depth="brief",  # default
           examples=[
               StringExample(
                   content="Supplier Agreement",
               )
           ],
       ),
       # ... add other concepts ...
   ]

   # ... attach other aspects/concepts to the document ...

   # Define and configure LLM
   llm = DocumentLLM(
       model="openai/gpt-4o",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       fallback_llm=DocumentLLM(
           model="openai/gpt-4-turbo",
           api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
           is_fallback=True,
       ),  # configure a fallback LLM
   )

   # Extract data from document with specific configuration options
   doc = llm.extract_all(
       doc,
       max_paragraphs_to_analyze_per_call=30,  # limit the number of paragraphs to analyze in an individual LLM call
       max_items_per_call=1,  # limit the number of aspects/concepts to analyze in an individual LLM call
       use_concurrency=True,  # optional: enable concurrent extractions
   )

   # ... use the extracted data ...


# ==== optimizations/optimization_speed ====

Optimizing for Speed
********************

For large-scale processing or time-sensitive applications, optimize
your pipeline for speed:

* **🚀 Enable and Configure Concurrency**: Process multiple
  extractions concurrently. Adjust the async limiter to adapt to your
  LLM API setup.

* **📦 Use Smaller Models**: Select smaller/distilled LLMs that
  perform faster. (See Choosing the Right LLM(s) for guidance on
  choosing the right model.)

* **🔄 Use a Fallback LLM**: Configure a fallback LLM to retry
  extractions that failed due to rate limits.

* **⚙️ Use Default Parameters**: All the extractions will be processed
  in as few LLM calls as possible.

* **📉 Enable Justifications Only When Necessary**: Do not use
  justifications for simple aspects or concepts. This will reduce the
  number of tokens generated.

* **⚠️ Use Sentence-Level Reference Depth Sparingly**: Only use
  sentence-level reference depth for aspects or concepts when
  absolutely necessary, as it requires loading a SaT model and running
  sentence segmentation on text, which can be slow for long documents.

Example of optimizing extraction for speed

   # Example of optimizing extraction for speed

   import os

   from aiolimiter import AsyncLimiter

   from contextgem import Document, DocumentLLM


   # Define document
   document = Document(
       raw_text="document_text",
       # aspects=[Aspect(...), ...],
       # concepts=[Concept(...), ...],
   )

   # Define LLM with a fallback model
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       fallback_llm=DocumentLLM(
           model="openai/gpt-3.5-turbo",
           api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
           is_fallback=True,
       ),
   )
   llm.async_limiter = AsyncLimiter(
       10, 5
   )  # e.g. 10 acquisitions per 5-second period; adjust to your LLM API setup
   llm.fallback_llm.async_limiter = AsyncLimiter(  # type: ignore
       20, 5
   )  # e.g. 20 acquisitions per 5-second period; adjust to your LLM API setup


   # Use the LLM for extraction with concurrency enabled
   llm.extract_all(document, use_concurrency=True)

   # ... use the extracted data ...


# ==== optimizations/optimization_cost ====

Optimizing for Cost
*******************

ContextGem offers several strategies to optimize for cost efficiency
while maintaining extraction quality:

* **💸 Select Cost-Efficient Models**: Use smaller/distilled non-
  reasoning LLMs for extracting aspects and basic concepts (e.g.
  titles, payment amounts, dates).

* **⚙️ Use Default Parameters**: All the extractions will be processed
  in as few LLM calls as possible.

* **📉 Enable Justifications Only When Necessary**: Do not use
  justifications for simple aspects or concepts. This will reduce the
  number of tokens generated.

* **📊 Monitor Usage and Cost**: Track LLM calls, token consumption,
  and cost to identify optimization opportunities.

Example of optimizing extraction for cost

   # Example of optimizing extraction for cost

   import os

   from contextgem import DocumentLLM, LLMPricing


   llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
       pricing_details=LLMPricing(
           input_per_1m_tokens=0.150,
           output_per_1m_tokens=0.600,
       ),  # add pricing details to track costs
       # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider
   )

   # ... use the LLM for extraction ...

   # ... monitor usage and cost ...
   usage = llm.get_usage()  # get the usage details, including tokens and calls' details.
   cost = llm.get_cost()  # get the cost details, including input, output, and total costs.
   print(usage)
   print(cost)


# ==== optimizations/optimization_long_docs ====

Dealing with Long Documents
***************************

ContextGem offers specialized configuration options for efficiently
processing lengthy documents.


✂️ Segmentation Approach
========================

Unlike many systems that rely on chunking (e.g. RAG), ContextGem
intelligently segments documents into natural semantic units like
paragraphs and sentences. This preserves the contextual integrity of
the content while allowing you to configure:

* Maximum number of paragraphs per LLM call

* Maximum number of aspects/concepts to analyze per LLM call

* Maximum number of images per LLM call (if the document contains
  images)


⚙️ Effective Optimization Strategies
====================================

* **🔄 Use Long-Context Models**: Select models with large context
  windows. (See Choosing the Right LLM(s) for guidance on choosing the
  right model.)

* **📏 Limit Paragraphs Per Call**: This will reduce each prompt's
  length and ensure a more focused analysis.

* **🔢 Limit Aspects/Concepts Per Call**: Process a smaller number of
  aspects or concepts in each LLM call, preventing prompt overloading.

* **⚠️ Use Sentence-Level Reference Depth Sparingly**: Only use
  sentence-level reference depth for aspects or concepts when
  absolutely necessary, as it requires loading a SaT model and running
  sentence segmentation on text, which can be slow for long documents.

* **⚡ Optional: Enable Concurrency**: Enable running extractions
  concurrently if your API setup permits. This will reduce the overall
  processing time. (See Optimizing for Speed for guidance on
  configuring concurrency.)

Since each use case has unique requirements, experiment with different
configurations to find your optimal setup.

Example of configuring LLM extraction for long documents

   # Example of configuring LLM extraction to process long documents

   import os

   from contextgem import Document, DocumentLLM


   # Define document
   long_doc = Document(
       raw_text="long_document_text",
   )

   # ... attach aspects/concepts to the document ...

   # Define and configure LLM
   llm = DocumentLLM(
       model="openai/gpt-4o-mini",
       api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
   )

   # Extract data from document with specific configuration options
   long_doc = llm.extract_all(
       long_doc,
       max_paragraphs_to_analyze_per_call=50,  # limit the number of paragraphs to analyze in an individual LLM call
       max_items_per_call=2,  # limit the number of aspects/concepts to analyze in an individual LLM call
       use_concurrency=True,  # optional: enable concurrent extractions
   )

   # ... use the extracted data ...


# ==== optimizations/optimization_small_llm_troubleshooting ====

Troubleshooting Issues with Small Models
****************************************

Small language models (e.g. 8B parameter models) often struggle with
ContextGem's structured extraction tasks. This guide addresses common
issues and provides practical solutions.

See also:

  For general guidance on selecting appropriate models for your use
  case, see Choosing the Right LLM(s).


⚠️ Common Issues with Small Models
==================================

**"LLM did not return valid JSON" Error**
   Small models frequently fail to follow the precise JSON schema
   required by ContextGem's internal prompts. This manifests as:

   * "Error when validating parsed JSON: parsed_json is None"

   * "LLM did not return valid JSON"

**Inconsistent Results**
   Small models may produce:

   * Empty extraction results

   * Incomplete or partial extractions

   * Inconsistent formatting across multiple calls


🎯 Model Capability Requirements
================================

**Minimum Recommended Performance**
   All ContextGem tests use models with performance equivalent to or
   exceeding "openai/gpt-4o-mini". For reliable structured extraction,
   your model should:

   * Be able to follow detailed JSON schema instructions consistently

   * Have a sufficient context window to ingest the detailed prompt
     and the document content

   * Maintain attention across long prompts with complex instructions


🛠️ Mitigation Strategies for Small Models
=========================================

Important:

  **The most effective solution is usually to upgrade to a larger,
  more capable model** (such as "gpt-4o-mini" or larger). The
  strategies below are workarounds for situations where upgrading
  isn't possible.

If you must use a smaller model, try these approaches individually or
in combination:

**1. Reduce Task Complexity**

   # Extract one aspect/concept at a time instead of all at once
   results = llm.extract_all(
       document,
       max_items_per_call=1  # Analyze aspects/concepts individually
   )

**2. Limit Document Scope**

   # Process fewer document paragraphs per call
   results = llm.extract_all(
       document,
       max_paragraphs_to_analyze_per_call=50  # Default is 0 (all paragraphs)
   )

**3. Use More Specific Aspects/Concepts**

Instead of generic aspects/concepts:

   # ❌ Too generic - may confuse small models
   Aspect(
       name="Contract Terms",
       description="Contractual/legal details"
   )

Use targeted concepts:

   # ✅ More specific - easier for small models
   Aspect(
       name="Termination Terms",
       description="Provisions on contract termination"
   ),
   Aspect(
       name="Payment Terms",
       description="Provisions on payment schedules and amounts"
   )

**4. Choose the Right API**

For extracting document sections by topic, use **Aspects API** instead
of **Concepts API**:

   # ✅ Aspects API is designed specifically for extracting document sections by topic,
   # while Concepts API is designed for extracting/inferring specific values or entities
   # from a document or a specific section.
   from contextgem import Aspect

   project_scope = Aspect(
       name="Project Scope",
       description="Details about the scope of work"
   )

   # Paragraph references are automatically added to the extracted aspects
   results = llm.extract_aspects_from_document(document)

Instead of:

   # ❌ Concepts API's core purpose is to extract/infer specific values or entities
   # from a document or a specific section, rather than extracting document sections
   # by topic.
   from contextgem import StringConcept

   project_scope = StringConcept(
       name="Project Scope",
       description="Details about the scope of work",
       add_references=True
   )


🔍 Debugging LLM Responses
==========================

To see what your LLM is supposed to return, you can inspect the prompt
and the model's response:

   # Make an extraction call
   results = llm.extract_aspects_from_document(document)

   # Inspect the actual prompt sent to the LLM
   prompt = llm.get_usage()[-1].usage.calls[-1].prompt
   print("Prompt sent to LLM:")
   print(prompt)

   # Check the raw response (if available)
   response = llm.get_usage()[-1].usage.calls[-1].response
   print("LLM response:")
   print(response)


📊 Testing Local Models
=======================

Before committing to a local model for production, test it on
extraction tasks in the documentation, such as:

* Aspect Extraction from Document

* Extracting Aspect with Sub-Aspects

* Concept Extraction from Aspect

* Concept Extraction from Document

Important:

  **Production Applications**: For production applications, especially
  those requiring high accuracy (like legal document analysis), using
  appropriately capable models is crucial. The cost of model inference
  is typically far outweighed by the cost of incorrect extractions or
  failed processing.


# ==== serialization ====

Serializing objects and results
*******************************

ContextGem provides multiple serialization methods to preserve your
document processing pipeline components and results. These methods
enable you to save your work, transfer data between systems, or
integrate with other applications.

When using serialization, all extracted data is preserved in the
serialized objects.


💾 Serialization Methods
========================

The following ContextGem objects support serialization:

* "Document" - Contains document content and extracted information

* "ExtractionPipeline" - Defines extraction structure and logic

* "DocumentLLM" - Stores LLM configuration for document processing

Each object supports three serialization methods:

* "to_json()" - Converts the object to a JSON string for cross-
  platform compatibility

* "to_dict()" - Converts the object to a Python dictionary for in-
  memory operations

* "to_disk(file_path)" - Saves the object directly to disk at the
  specified path


🔄 Deserialization Methods
==========================

To reconstruct objects from their serialized forms, use the
corresponding class methods:

* "from_json(json_string)" - Creates an object from a JSON string

* "from_dict(dict_object)" - Creates an object from a Python
  dictionary

* "from_disk(file_path)" - Loads an object from a file on disk


📝 Example Usage
================

   # Example of serializing and deserializing ContextGem document,
   # extraction pipeline, and LLM config.

   import os
   from pathlib import Path

   from contextgem import (
       Aspect,
       BooleanConcept,
       Document,
       DocumentLLM,
       DocxConverter,
       ExtractionPipeline,
       StringConcept,
   )


   # Create a document object
   converter = DocxConverter()
   docx_path = str(
       Path(__file__).resolve().parents[4]
       / "tests"
       / "docx_files"
       / "en_nda_with_anomalies.docx"
   )  # your file path here (Path adapted for testing)
   doc = converter.convert(docx_path, strict_mode=True)

   # Create an extraction pipeline
   extraction_pipeline = ExtractionPipeline(
       aspects=[
           Aspect(
               name="Categories of confidential information",
               description="Clauses describing confidential information covered by the NDA",
               concepts=[
                   StringConcept(
                       name="Types of disclosure",
                       description="Types of disclosure of confidential information",
                   ),
                   # ...
               ],
           ),
           # ...
       ],
       concepts=[
           BooleanConcept(
               name="Is mutual",
               description="Whether the NDA is mutual (both parties act as discloser/recipient)",
               add_justifications=True,
           ),
           # ...
       ],
   )

   # Attach the pipeline to the document
   doc.assign_pipeline(extraction_pipeline)

   # Configure a document LLM with your API parameters
   llm = DocumentLLM(
       model="azure/gpt-4.1-mini",
       api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
       api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
       api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
   )

   # Extract data from the document
   doc = llm.extract_all(doc)

   # Serialize the LLM config, pipeline and document
   llm_config_json = llm.to_json()  # or to_dict() / to_disk()
   extraction_pipeline_json = extraction_pipeline.to_json()  # or to_dict() / to_disk()
   processed_doc_json = doc.to_json()  # or to_dict() / to_disk()

   # Deserialize the LLM config, pipeline and document
   llm_deserialized = DocumentLLM.from_json(
       llm_config_json
   )  # or from_dict() / from_disk()
   extraction_pipeline_deserialized = ExtractionPipeline.from_json(
       extraction_pipeline_json
   )  # or from_dict() / from_disk()
   processed_doc_deserialized = Document.from_json(
       processed_doc_json
   )  # or from_dict() / from_disk()

   # All extracted data is preserved!
   assert processed_doc_deserialized.aspects[0].concepts[0].extracted_items


🚀 Use Cases
============

* **Caching Results**: Save processed documents to avoid repeating
  expensive LLM calls

* **Transfer Between Systems**: Export results from one environment
  and import in another

* **API Integration**: Convert objects to JSON for API responses

* **Workflow Persistence**: Save pipeline configurations for later
  reuse


# ==== api/documents ====

Documents
*********

Module for handling documents.

This module provides the Document class, which represents a structured
or unstructured file containing written or visual content. Documents
can be processed to extract information, analyze content, and organize
data into paragraphs, sentences, aspects, and concepts.

class contextgem.public.documents.Document(**data)

   Bases: "_Document"

   Represents a document containing textual and visual content for
   analysis.

   A document serves as the primary container for content analysis
   within the ContextGem framework, enabling complex document
   understanding and information extraction workflows.

   Variables:
      * **raw_text** (*str** | **None*) -- The main text of the
        document as a single string. Defaults to None.

      * **paragraphs** (*list**[**Paragraph**]*) -- List of Paragraph
        instances in consecutive order as they appear in the document.
        Defaults to an empty list.

      * **images** (*list**[**Image**]*) -- List of Image instances
        attached to or representing the document. Defaults to an empty
        list.

      * **aspects** (*list**[**Aspect**]*) -- List of aspects
        associated with the document for focused analysis. Validated
        to ensure unique names and descriptions. Defaults to an empty
        list.

      * **concepts** (*list**[**_Concept**]*) -- List of concepts
        associated with the document for information extraction.
        Validated to ensure unique names and descriptions. Defaults to
        an empty list.

      * **paragraph_segmentation_mode** (*Literal**[**"newlines"**,
        **"sat"**]*) -- Mode for paragraph segmentation. When set to
        "sat", uses a SaT (Segment Any Text
        https://arxiv.org/abs/2406.16678) model. Defaults to
        "newlines".

      * **sat_model_id** (*SaTModelId*) -- SaT model ID for
        paragraph/sentence segmentation or a local path to a SaT
        model. For model IDs, defaults to "sat-3l-sm". See
        https://github.com/segment-any-text/wtpsplit for the list of
        available models. For local paths, provide either a string
        path or a Path object pointing to the directory containing the
        SaT model.

      * **pre_segment_sentences** (*bool*) -- Whether to pre-segment
        sentences during Document initialization. When False
        (default), sentence segmentation is deferred until sentences
        are actually needed, improving initialization performance.
        When True, sentences are segmented immediately during Document
        creation using the SaT model.

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **raw_text** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]
        **| **None*)

      * **paragraphs** (*list**[**_Paragraph**]*)

      * **images** (*list**[**_Image**]*)

      * **aspects** (*Annotated**[**Sequence**[**_Aspect**]**, **Befo
        reValidator**(**func=~contextgem.internal.typings.validators.
        _validate_sequence_is_list**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **concepts** (*Annotated**[**Sequence**[**_Concept**]**, **Be
        foreValidator**(**func=~contextgem.internal.typings.validator
        s._validate_sequence_is_list**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **paragraph_segmentation_mode** (*Literal**[**'newlines'**,
        **'sat'**]*)

      * **sat_model_id** (*Literal**[**'sat-1l'**, **'sat-1l-sm'**,
        **'sat-3l'**, **'sat-3l-sm'**, **'sat-6l'**, **'sat-6l-sm'**,
        **'sat-9l'**, **'sat-12l'**, **'sat-12l-sm'**] **| **str** |
        **~pathlib._local.Path*)

      * **pre_segment_sentences** (*bool*)

   Note:
      Normally, you do not need to construct/populate paragraphs
      manually, as they are populated automatically from document's
      "raw_text" attribute. Only use this constructor for advanced use
      cases, such as when you have a custom paragraph segmentation
      tool.

   Example:
      Document definition

         from pathlib import Path

         from contextgem import Document, Paragraph, create_image


         # Create a document with raw text content
         contract_document = Document(
             raw_text=(
                 "...This agreement is effective as of January 1, 2025.\n\n"
                 "All parties must comply with the terms outlined herein. The terms include "
                 "monthly reporting requirements and quarterly performance reviews.\n\n"
                 "Failure to adhere to these terms may result in termination of the agreement. "
                 "Additionally, any breach of confidentiality will be subject to penalties as "
                 "described in this agreement.\n\n"
                 "This agreement shall remain in force for a period of three (3) years unless "
                 "otherwise terminated according to the provisions stated above..."
             ),
             paragraph_segmentation_mode="newlines",  # Default mode, splits on newlines
         )

         # Create a document with more advanced paragraph segmentation using a SaT model
         report_document = Document(
             raw_text=(
                 "Executive Summary "
                 "This report outlines our quarterly performance. "
                 "Revenue increased by [15%] compared to the previous quarter.\n\n"
                 "Customer satisfaction metrics show positive trends across all regions..."
             ),
             paragraph_segmentation_mode="sat",  # Use SaT model for intelligent paragraph segmentation
             sat_model_id="sat-3l-sm",  # Specify which SaT model to use
         )

         # Create a document with predefined paragraphs, e.g. when you use a custom
         # paragraph segmentation tool
         document_from_paragraphs = Document(
             paragraphs=[
                 Paragraph(raw_text="This is the first paragraph."),
                 Paragraph(raw_text="This is the second paragraph with more content."),
                 Paragraph(raw_text="Final paragraph concluding the document."),
                 # ...
             ]
         )

         # Create document with images

         # Path is adapted for doc tests
         current_file = Path(__file__).resolve()
         root_path = current_file.parents[4]
         image_path = root_path / "tests" / "images" / "invoices" / "invoice.png"

         # Create a document with only images (no text)
         image_document = Document(
             images=[
                 create_image(image_path),  # contextgem.Image instance
                 # ...
             ]
         )

         # Create a document with both text and images
         mixed_document = Document(
             raw_text="This document contains both text and visual elements.",
             images=[
                 create_image(image_path),  # contextgem.Image instance
                 # ...
             ],
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   add_aspects(aspects)

      Adds aspects to the existing aspects list of an instance and
      returns the updated instance. This method ensures that the
      provided aspects are deeply copied to avoid any unintended state
      modification of the original reusable aspects.

      Parameters:
         **aspects** (*list**[**_Aspect**]*) -- A list of aspects to
         be added. Each aspect is deeply copied to ensure the original
         list remains unaltered.

      Returns:
         Updated instance containing the newly added aspects.

      Return type:
         Self

   add_concepts(concepts)

      Adds a list of new concepts to the existing *concepts* attribute
      of the instance. This method ensures that the provided list of
      concepts is deep-copied to prevent unintended side effects from
      modifying the input list outside of this method.

      Parameters:
         **concepts** (*list**[**_Concept**]*) -- A list of concepts
         to be added. It will be deep-copied before being added to the
         instance's *concepts* attribute.

      Returns:
         Returns the instance itself after the modification.

      Return type:
         Self

   assign_pipeline(pipeline, overwrite_existing=False)

      Assigns a given pipeline to the document. The method deep-copies
      the input pipeline to prevent any modifications to the state of
      aspects or concepts in the original pipeline. If the aspects or
      concepts are already associated with the document, an error is
      raised unless the *overwrite_existing* parameter is explicitly
      set to *True*.

      Parameters:
         * **pipeline** (*_ExtractionPipeline** |
           **_DocumentPipeline*) -- The ExtractionPipeline (or
           deprecated DocumentPipeline) object to attach to the
           document.

         * **overwrite_existing** (*bool*) -- A boolean flag. If set
           to True, any existing aspects and concepts assigned to the
           document will be overwritten by the new pipeline. Defaults
           to False.

      Return type:
         "typing.Self"

      Returns:
         Returns the current instance of the document after assigning
         the pipeline.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   get_aspect_by_name(name)

      Finds and returns an aspect with the specified name from the
      list of available aspects, if the instance has *aspects*
      attribute.

      Parameters:
         **name** (*str*) -- The name of the aspect to find.

      Returns:
         The aspect with the specified name.

      Return type:
         _Aspect

      Raises:
         **ValueError** -- If no aspect with the specified name is
         found.

   get_aspects_by_names(names)

      Retrieve a list of _Aspect objects corresponding to the provided
      list of names.

      Parameters:
         **names** ("list"["str"]) -- List of aspect names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Aspect objects corresponding to provided names.

      Return type:
         list[_Aspect]

   get_concept_by_name(name)

      Retrieves a concept from the list of concepts based on the
      provided name, if the instance has *concepts* attribute.

      Parameters:
         **name** (*str*) -- The name of the concept to search for.

      Returns:
         The *_Concept* object with the specified name.

      Return type:
         _Concept

      Raises:
         **ValueError** -- If no concept with the specified name is
         found.

   get_concepts_by_names(names)

      Retrieve a list of _Concept objects corresponding to the
      provided list of names.

      Parameters:
         **names** ("list"["str"]) -- List of concept names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Concept objects corresponding to provided names.

      Return type:
         list[_Concept]

   property llm_roles: set[str]

      A set of LLM roles associated with the object's aspects and
      concepts.

      Returns:
         A set containing unique LLM roles gathered from aspects and
         concepts.

      Return type:
         set[str]

   remove_all_aspects()

      Removes all aspects from the instance and returns the updated
      instance.

      This method clears the *aspects* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all aspects removed

   remove_all_concepts()

      Removes all concepts from the instance and returns the updated
      instance.

      This method clears the *concepts* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all concepts removed

   remove_all_instances()

      Removes all assigned instances from the object and resets them
      as empty lists. Returns the modified instance.

      Returns:
         The modified object with all assigned instances removed.

      Return type:
         Self

   remove_aspect_by_name(name)

      Removes an aspect from the assigned aspects by its name.

      Parameters:
         **name** (*str*) -- The name of the aspect to be removed

      Returns:
         Updated instance with the aspect removed.

      Return type:
         Self

   remove_aspects_by_names(names)

      Removes multiple aspects from an object based on the provided
      list of names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of names identifying
         the aspects to be removed.

      Returns:
         The updated object after the specified aspects have been
         removed.

      Return type:
         Self

   remove_concept_by_name(name)

      Removes a concept from the assigned concepts by its name.

      Parameters:
         **name** (*str*) -- The name of the concept to be removed

      Returns:
         Updated instance with the concept removed.

      Return type:
         Self

   remove_concepts_by_names(names)

      Removes concepts from the object by their names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of concept names to
         be removed.

      Returns:
         Returns the updated instance after removing the specified
         concepts.

      Return type:
         Self

   property sentences: list[_Sentence]

      Provides access to all sentences within the paragraphs of the
      document by flattening and combining sentences from each
      paragraph into a single list.

      Returns:
         A list of _Sentence objects that are contained within all
         paragraphs.

      Return type:
         list[_Sentence]

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   raw_text: NonEmptyStr | None

   paragraphs: list[_Paragraph]

   images: list[_Image]

   aspects: Annotated[Sequence[_Aspect], BeforeValidator(_validate_sequence_is_list)]

   concepts: Annotated[Sequence[_Concept], BeforeValidator(_validate_sequence_is_list)]

   paragraph_segmentation_mode: Literal['newlines', 'sat']

   sat_model_id: SaTModelId

   pre_segment_sentences: bool

   custom_data: JSONDictField


# ==== api/converters ====

Converters
**********

class contextgem.public.converters.DocxConverter

   Bases: "_DocxConverterBase"

   Converter for DOCX files into ContextGem documents.

   This class handles extraction of text, formatting, tables, images,
   footnotes, comments, and other elements from DOCX files by directly
   parsing Word XML.

   The converter is read-only and does not modify the source DOCX file
   in any way. It only extracts content for conversion to ContextGem
   document object or text formats.

   The resulting ContextGem document is populated with the following:

   * Raw text: The raw text of the DOCX file.

   * Paragraphs: Paragraph objects with the following metadata:

     * Raw text: The raw text of the paragraph.

     * Additional context: Metadata about the paragraph's style, list
       level, table cell position, being part of a footnote or
       comment, etc. This context provides additional information that
       is useful for LLM analysis and extraction.

   * Images: Image objects constructed from embedded images in the
     DOCX file.

   Example:
      DocxConverter usage example

         # Using ContextGem's DocxConverter

         from contextgem import DocxConverter


         converter = DocxConverter()

         # Convert a DOCX file to an LLM-ready ContextGem Document
         # from path
         document = converter.convert("path/to/document.docx")
         # or from file object
         with open("path/to/document.docx", "rb") as docx_file_object:
             document = converter.convert(docx_file_object)

         # Perform data extraction on the resulting Document object
         # document.add_aspects(...)
         # document.add_concepts(...)
         # llm.extract_all(document)

         # You can also use DocxConverter instance as a standalone text extractor
         docx_text = converter.convert_to_text_format(
             "path/to/document.docx",
             output_format="markdown",  # or "raw"
         )

   convert_to_text_format(docx_path_or_file, output_format='markdown', include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, include_links=True, include_inline_formatting=True, strict_mode=False)

      Converts a DOCX file directly to text without creating a
      ContextGem Document.

      Parameters:
         * **docx_path_or_file** ("str" | "pathlib._local.Path" |
           "typing.BinaryIO") -- Path to the DOCX file (as string or
           Path object) or a file-like object

         * **output_format** ("typing.Literal"["'raw'", "'markdown'"])
           -- Output format ("markdown" or "raw") (default:
           "markdown")

         * **include_tables** ("bool") -- If True, include tables in
           the output (default: True)

         * **include_comments** ("bool") -- If True, include comments
           in the output (default: True)

         * **include_footnotes** ("bool") -- If True, include
           footnotes in the output (default: True)

         * **include_headers** ("bool") -- If True, include headers in
           the output (default: True)

         * **include_footers** ("bool") -- If True, include footers in
           the output (default: True)

         * **include_textboxes** ("bool") -- If True, include textbox
           content (default: True)

         * **include_links** ("bool") -- If True, process and format
           hyperlinks (default: True)

         * **include_inline_formatting** ("bool") -- If True, apply
           inline formatting (bold, italic, etc.) in markdown mode
           (default: True)

         * **strict_mode** ("bool") -- If True, raise exceptions for
           any processing error instead of skipping problematic
           elements (default: False)

      Return type:
         "str"

      Returns:
         Text in the specified format

      Note:

        When using markdown output format, the following conditions
        apply:

        * Document structure elements (headings, lists, tables) are
          preserved

        * Headings are converted to markdown heading syntax (# Heading
          1, ## Heading 2, etc.)

        * Lists are converted to markdown list syntax, preserving
          numbering and hierarchy

        * Tables are formatted using markdown table syntax

        * Footnotes, comments, headers, and footers are included as
          specially marked sections

   convert(docx_path_or_file, apply_markdown=True, raw_text_to_md=None, include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, include_images=True, include_links=True, include_inline_formatting=True, strict_mode=False)

      Converts a DOCX file into a ContextGem Document object.

      Parameters:
         * **docx_path_or_file** ("str" | "pathlib._local.Path" |
           "typing.BinaryIO") -- Path to the DOCX file (as string or
           Path object) or a file-like object

         * **apply_markdown** ("bool") -- If True, applies markdown
           processing and formatting to the document content while
           preserving raw text separately (default: True)

         * **raw_text_to_md** ("bool" | "None") -- [DEPRECATED] Use
           apply_markdown instead. Will be removed in v1.0.0. Note:
           This parameter previously controlled whether raw_text would
           contain raw or markdown text. The new apply_markdown
           parameter instead controls whether to apply markdown
           processing while keeping raw text and processed text
           separate.

         * **include_tables** ("bool") -- If True, include tables in
           the output (default: True)

         * **include_comments** ("bool") -- If True, include comments
           in the output (default: True)

         * **include_footnotes** ("bool") -- If True, include
           footnotes in the output (default: True)

         * **include_headers** ("bool") -- If True, include headers in
           the output (default: True)

         * **include_footers** ("bool") -- If True, include footers in
           the output (default: True)

         * **include_textboxes** ("bool") -- If True, include textbox
           content (default: True)

         * **include_images** ("bool") -- If True, extract and include
           images (default: True)

         * **include_links** ("bool") -- If True, process and format
           hyperlinks (default: True)

         * **include_inline_formatting** ("bool") -- If True, apply
           inline formatting (bold, italic, etc.) in markdown mode
           (default: True)

         * **strict_mode** ("bool") -- If True, raise exceptions for
           any processing error instead of skipping problematic
           elements (default: False)

      Return type:
         "contextgem.public.documents.Document"

      Returns:
         A populated Document object


# ==== api/aspects ====

Aspects
*******

Module for handling document aspects.

This module provides the Aspect class, which represents a defined area
or topic within a document that requires focused attention. Aspects
are used to identify and extract specific subjects or themes from
documents according to predefined criteria.

class contextgem.public.aspects.Aspect(**data)

   Bases: "_Aspect"

   Represents an aspect with associated metadata, sub-aspects,
   concepts, and logic for validation.

   An aspect is a defined area or topic within a document that
   requires focused attention. Each aspect corresponds to a specific
   subject or theme described in the task.

   Variables:
      * **name** (*str*) -- The name of the aspect. Required, non-
        empty string.

      * **description** (*str*) -- A detailed description of the
        aspect. Required, non-empty string.

      * **concepts** (*list**[**_Concept**]*) -- A list of concepts
        associated with the aspect. These concepts must be unique in
        both name and description and cannot include concepts with
        vision LLM roles.

      * **llm_role** (*LLMRoleAspect*) -- The role of the LLM
        responsible for aspect extraction. Default is
        "extractor_text". Valid roles are "extractor_text" and
        "reasoner_text".

      * **reference_depth** (*ReferenceDepth*) -- The structural depth
        of references (paragraphs or sentences). Defaults to
        "paragraphs". Affects the structure of "extracted_items".

      * **add_justifications** (*bool*) -- Whether the LLM will output
        justification for each extracted item. Inherited from base
        class. Defaults to False.

      * **justification_depth** (*JustificationDepth*) -- The level of
        detail for justifications. Inherited from base class. Defaults
        to "brief".

      * **justification_max_sents** (*int*) -- Maximum number of
        sentences in a justification. Inherited from base class.
        Defaults to 2.

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **add_justifications** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **justification_depth** (*Literal**[**'brief'**,
        **'balanced'**, **'comprehensive'**]*)

      * **justification_max_sents** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **description** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **aspects** (*Annotated**[**Sequence**[**_Aspect**]**, **Befo
        reValidator**(**func=~contextgem.internal.typings.validators.
        _validate_sequence_is_list**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **concepts** (*Annotated**[**Sequence**[**_Concept**]**, **Be
        foreValidator**(**func=~contextgem.internal.typings.validator
        s._validate_sequence_is_list**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **llm_role** (*Literal**[**'extractor_text'**,
        **'reasoner_text'**]*)

      * **reference_depth** (*Literal**[**'paragraphs'**,
        **'sentences'**]*)

   Example:
      Aspect definition

         from contextgem import Aspect


         # Define an aspect focused on termination clauses
         termination_aspect = Aspect(
             name="Termination provisions",
             description="Contract termination conditions, notice requirements, and severance terms.",
             reference_depth="sentences",
             add_justifications=True,
             justification_depth="comprehensive",
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   add_aspects(aspects)

      Adds aspects to the existing aspects list of an instance and
      returns the updated instance. This method ensures that the
      provided aspects are deeply copied to avoid any unintended state
      modification of the original reusable aspects.

      Parameters:
         **aspects** (*list**[**_Aspect**]*) -- A list of aspects to
         be added. Each aspect is deeply copied to ensure the original
         list remains unaltered.

      Returns:
         Updated instance containing the newly added aspects.

      Return type:
         Self

   add_concepts(concepts)

      Adds a list of new concepts to the existing *concepts* attribute
      of the instance. This method ensures that the provided list of
      concepts is deep-copied to prevent unintended side effects from
      modifying the input list outside of this method.

      Parameters:
         **concepts** (*list**[**_Concept**]*) -- A list of concepts
         to be added. It will be deep-copied before being added to the
         instance's *concepts* attribute.

      Returns:
         Returns the instance itself after the modification.

      Return type:
         Self

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   get_aspect_by_name(name)

      Finds and returns an aspect with the specified name from the
      list of available aspects, if the instance has *aspects*
      attribute.

      Parameters:
         **name** (*str*) -- The name of the aspect to find.

      Returns:
         The aspect with the specified name.

      Return type:
         _Aspect

      Raises:
         **ValueError** -- If no aspect with the specified name is
         found.

   get_aspects_by_names(names)

      Retrieve a list of _Aspect objects corresponding to the provided
      list of names.

      Parameters:
         **names** ("list"["str"]) -- List of aspect names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Aspect objects corresponding to provided names.

      Return type:
         list[_Aspect]

   get_concept_by_name(name)

      Retrieves a concept from the list of concepts based on the
      provided name, if the instance has *concepts* attribute.

      Parameters:
         **name** (*str*) -- The name of the concept to search for.

      Returns:
         The *_Concept* object with the specified name.

      Return type:
         _Concept

      Raises:
         **ValueError** -- If no concept with the specified name is
         found.

   get_concepts_by_names(names)

      Retrieve a list of _Concept objects corresponding to the
      provided list of names.

      Parameters:
         **names** ("list"["str"]) -- List of concept names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Concept objects corresponding to provided names.

      Return type:
         list[_Concept]

   property llm_roles: set[str]

      A set of LLM roles associated with the object's aspects and
      concepts.

      Returns:
         A set containing unique LLM roles gathered from aspects and
         concepts.

      Return type:
         set[str]

   property reference_paragraphs: list[_Paragraph]

      Provides access to the instance's reference paragraphs, assigned
      during extraction.

      Returns:
         A list containing the paragraphs as *_Paragraph* objects.

      Return type:
         list[_Paragraph]

   property reference_sentences: list[_Sentence]

      Provides access to the instance's reference sentences, assigned
      during extraction.

      Returns:
         A list containing the sentences as *_Sentence* objects.

      Return type:
         list[_Sentence]

   remove_all_aspects()

      Removes all aspects from the instance and returns the updated
      instance.

      This method clears the *aspects* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all aspects removed

   remove_all_concepts()

      Removes all concepts from the instance and returns the updated
      instance.

      This method clears the *concepts* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all concepts removed

   remove_all_instances()

      Removes all assigned instances from the object and resets them
      as empty lists. Returns the modified instance.

      Returns:
         The modified object with all assigned instances removed.

      Return type:
         Self

   remove_aspect_by_name(name)

      Removes an aspect from the assigned aspects by its name.

      Parameters:
         **name** (*str*) -- The name of the aspect to be removed

      Returns:
         Updated instance with the aspect removed.

      Return type:
         Self

   remove_aspects_by_names(names)

      Removes multiple aspects from an object based on the provided
      list of names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of names identifying
         the aspects to be removed.

      Returns:
         The updated object after the specified aspects have been
         removed.

      Return type:
         Self

   remove_concept_by_name(name)

      Removes a concept from the assigned concepts by its name.

      Parameters:
         **name** (*str*) -- The name of the concept to be removed

      Returns:
         Updated instance with the concept removed.

      Return type:
         Self

   remove_concepts_by_names(names)

      Removes concepts from the object by their names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of concept names to
         be removed.

      Returns:
         Returns the updated instance after removing the specified
         concepts.

      Return type:
         Self

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   name: NonEmptyStr

   description: NonEmptyStr

   aspects: Annotated[Sequence[_Aspect], BeforeValidator(_validate_sequence_is_list)]

   concepts: Annotated[Sequence[_Concept], BeforeValidator(_validate_sequence_is_list)]

   llm_role: LLMRoleAspect

   reference_depth: ReferenceDepth

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: JSONDictField


# ==== api/concepts ====

Concepts
********

Module for handling concepts at aspect and document levels.

This module provides classes for defining different types of concepts
that can be extracted from documents and aspects. Concepts represent
specific pieces of information to be identified and extracted by LLMs,
such as strings, numbers, boolean values, JSON objects, and ratings.

Each concept type has specific properties and behaviors tailored to
the kind of data it represents, including validation rules, extraction
methods, and reference handling. Concepts can be attached to documents
or aspects and can include examples, justifications, and references to
the source text.

class contextgem.public.concepts.StringConcept(**data)

   Bases: "_StringConcept"

   A concept model for string-based information extraction from
   documents and aspects.

   This class provides functionality for defining, extracting, and
   managing string data as conceptual entities within documents or
   aspects.

   Variables:
      * **name** (*str*) -- The name of the concept (non-empty string,
        stripped).

      * **description** (*str*) -- A brief description of the concept
        (non-empty string, stripped).

      * **examples** (*list**[**StringExample**]*) -- Example strings
        illustrating the concept usage.

      * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible
        for extracting the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision", "extractor_multimodal",
        "reasoner_multimodal"). Defaults to "extractor_text".

      * **add_justifications** (*bool*) -- Whether to include
        justifications for extracted items.

      * **justification_depth** (*JustificationDepth*) --
        Justification detail level. Defaults to "brief".

      * **justification_max_sents** (*int*) -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** (*bool*) -- Whether to include source
        references for extracted items.

      * **reference_depth** (*ReferenceDepth*) -- Source reference
        granularity ("paragraphs" or "sentences"). Defaults to
        "paragraphs". Only relevant when references are added to
        extracted items. Affects the structure of "extracted_items".

      * **singular_occurrence** (*StrictBool*) -- Whether this concept
        is restricted to having only one extracted item. If True, only
        a single extracted item will be extracted. Defaults to False
        (multiple extracted items are allowed). Note that with
        advanced LLMs, this constraint may not be strictly required as
        they can often infer the appropriate number of items to
        extract from the concept's name, description, and type (e.g.,
        "document title" vs "key findings").

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **add_justifications** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **justification_depth** (*Literal**[**'brief'**,
        **'balanced'**, **'comprehensive'**]*)

      * **justification_max_sents** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **description** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **llm_role** (*Literal**[**'extractor_text'**,
        **'reasoner_text'**, **'extractor_vision'**,
        **'reasoner_vision'**, **'extractor_multimodal'**,
        **'reasoner_multimodal'**]*)

      * **add_references** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **reference_depth** (*Literal**[**'paragraphs'**,
        **'sentences'**]*)

      * **singular_occurrence** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **examples** (*list**[**_StringExample**]*)

   Example:
      String concept definition

         from contextgem import StringConcept, StringExample


         # Define a string concept for identifying contract party names
         # and their roles in the contract
         party_names_and_roles_concept = StringConcept(
             name="Party names and roles",
             description=(
                 "Names of all parties entering into the agreement and their contractual roles"
             ),
             examples=[
                 StringExample(
                     content="X (Client)",  # guidance regarding format
                 )
             ],
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   examples: list[_StringExample]

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: JSONDictField

class contextgem.public.concepts.BooleanConcept(**data)

   Bases: "_BooleanConcept"

   A concept model for boolean (True/False) information extraction
   from documents and aspects.

   This class handles identification and extraction of boolean values
   that represent conceptual properties or attributes within content.

   Variables:
      * **name** (*str*) -- The name of the concept (non-empty string,
        stripped).

      * **description** (*str*) -- A brief description of the concept
        (non-empty string, stripped).

      * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible
        for extracting the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision", "extractor_multimodal",
        "reasoner_multimodal"). Defaults to "extractor_text".

      * **add_justifications** (*bool*) -- Whether to include
        justifications for extracted items.

      * **justification_depth** (*JustificationDepth*) --
        Justification detail level. Defaults to "brief".

      * **justification_max_sents** (*int*) -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** (*bool*) -- Whether to include source
        references for extracted items.

      * **reference_depth** (*ReferenceDepth*) -- Source reference
        granularity ("paragraphs" or "sentences"). Defaults to
        "paragraphs". Only relevant when references are added to
        extracted items. Affects the structure of "extracted_items".

      * **singular_occurrence** (*StrictBool*) -- Whether this concept
        is restricted to having only one extracted item. If True, only
        a single extracted item will be extracted. Defaults to False
        (multiple extracted items are allowed). Note that with
        advanced LLMs, this constraint may not be strictly required as
        they can often infer the appropriate number of items to
        extract from the concept's name, description, and type (e.g.,
        "contains confidential information" vs "compliance
        violations").

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **add_justifications** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **justification_depth** (*Literal**[**'brief'**,
        **'balanced'**, **'comprehensive'**]*)

      * **justification_max_sents** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **description** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **llm_role** (*Literal**[**'extractor_text'**,
        **'reasoner_text'**, **'extractor_vision'**,
        **'reasoner_vision'**, **'extractor_multimodal'**,
        **'reasoner_multimodal'**]*)

      * **add_references** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **reference_depth** (*Literal**[**'paragraphs'**,
        **'sentences'**]*)

      * **singular_occurrence** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

   Example:
      Boolean concept definition

         from contextgem import BooleanConcept


         # Create the concept with specific configuration
         has_confidentiality = BooleanConcept(
             name="Contains confidentiality clause",
             description="Determines whether the contract includes provisions requiring parties to maintain confidentiality",
             llm_role="reasoner_text",
             singular_occurrence=True,
             add_justifications=True,
             justification_depth="brief",
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: JSONDictField

class contextgem.public.concepts.NumericalConcept(**data)

   Bases: "_NumericalConcept"

   A concept model for numerical information extraction from documents
   and aspects.

   This class handles identification and extraction of numeric values
   (integers, floats, or both) that represent conceptual measurements
   or quantities within content.

   Variables:
      * **name** (*str*) -- The name of the concept (non-empty string,
        stripped).

      * **description** (*str*) -- A brief description of the concept
        (non-empty string, stripped).

      * **numeric_type** (*Literal**[**"int"**, **"float"**,
        **"any"**]*) -- Type constraint for extracted numbers ("int",
        "float", or "any"). Defaults to "any" for auto-detection.

      * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible
        for extracting the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision", "extractor_multimodal",
        "reasoner_multimodal"). Defaults to "extractor_text".

      * **add_justifications** (*bool*) -- Whether to include
        justifications for extracted items.

      * **justification_depth** (*JustificationDepth*) --
        Justification detail level. Defaults to "brief".

      * **justification_max_sents** (*int*) -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** (*bool*) -- Whether to include source
        references for extracted items.

      * **reference_depth** (*ReferenceDepth*) -- Source reference
        granularity ("paragraphs" or "sentences"). Defaults to
        "paragraphs". Only relevant when references are added to
        extracted items. Affects the structure of "extracted_items".

      * **singular_occurrence** (*StrictBool*) -- Whether this concept
        is restricted to having only one extracted item. If True, only
        a single extracted item will be extracted. Defaults to False
        (multiple extracted items are allowed). Note that with
        advanced LLMs, this constraint may not be strictly required as
        they can often infer the appropriate number of items to
        extract from the concept's name, description, and type (e.g.,
        "total revenue" vs "monthly sales figures").

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **add_justifications** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **justification_depth** (*Literal**[**'brief'**,
        **'balanced'**, **'comprehensive'**]*)

      * **justification_max_sents** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **description** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **llm_role** (*Literal**[**'extractor_text'**,
        **'reasoner_text'**, **'extractor_vision'**,
        **'reasoner_vision'**, **'extractor_multimodal'**,
        **'reasoner_multimodal'**]*)

      * **add_references** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **reference_depth** (*Literal**[**'paragraphs'**,
        **'sentences'**]*)

      * **singular_occurrence** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **numeric_type** (*Literal**[**'int'**, **'float'**,
        **'any'**]*)

   Example:
      Numerical concept definition

         from contextgem import NumericalConcept


         # Create concepts for different numerical values in the contract
         payment_amount = NumericalConcept(
             name="Payment amount",
             description="The monetary value to be paid according to the contract terms",
             numeric_type="float",
             llm_role="extractor_text",
             add_references=True,
             reference_depth="sentences",
         )

         payment_days = NumericalConcept(
             name="Payment term days",
             description="The number of days within which payment must be made",
             numeric_type="int",
             llm_role="extractor_text",
             add_justifications=True,
             justification_depth="balanced",
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   numeric_type: Literal['int', 'float', 'any']

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: JSONDictField

class contextgem.public.concepts.RatingConcept(**data)

   Bases: "_RatingConcept"

   A concept model for rating-based information extraction with
   defined scale boundaries.

   This class handles identification and extraction of integer ratings
   that must fall within the boundaries of a specified rating scale.

   Variables:
      * **name** (*str*) -- The name of the concept (non-empty string,
        stripped).

      * **description** (*str*) -- A brief description of the concept
        (non-empty string, stripped).

      * **rating_scale** (*RatingScale** | **tuple**[**int**,
        **int**]*) -- The rating scale defining valid value
        boundaries. Can be either a RatingScale object (deprecated,
        will be removed in v1.0.0) or a tuple of (start, end)
        integers.

      * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible
        for extracting the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision", "extractor_multimodal",
        "reasoner_multimodal"). Defaults to "extractor_text".

      * **add_justifications** (*bool*) -- Whether to include
        justifications for extracted items.

      * **justification_depth** (*JustificationDepth*) --
        Justification detail level. Defaults to "brief".

      * **justification_max_sents** (*int*) -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** (*bool*) -- Whether to include source
        references for extracted items.

      * **reference_depth** (*ReferenceDepth*) -- Source reference
        granularity ("paragraphs" or "sentences"). Defaults to
        "paragraphs". Only relevant when references are added to
        extracted items. Affects the structure of "extracted_items".

      * **singular_occurrence** (*StrictBool*) -- Whether this concept
        is restricted to having only one extracted item. If True, only
        a single extracted item will be extracted. Defaults to False
        (multiple extracted items are allowed). Note that with
        advanced LLMs, this constraint may not be strictly required as
        they can often infer the appropriate number of items to
        extract from the concept's name, description, and type (e.g.,
        "product rating score" vs "customer satisfaction ratings").

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **add_justifications** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **justification_depth** (*Literal**[**'brief'**,
        **'balanced'**, **'comprehensive'**]*)

      * **justification_max_sents** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **description** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **llm_role** (*Literal**[**'extractor_text'**,
        **'reasoner_text'**, **'extractor_vision'**,
        **'reasoner_vision'**, **'extractor_multimodal'**,
        **'reasoner_multimodal'**]*)

      * **add_references** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **reference_depth** (*Literal**[**'paragraphs'**,
        **'sentences'**]*)

      * **singular_occurrence** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **rating_scale** (*_RatingScale** |
        **tuple**[**Annotated**[**int**,
        **Strict**(**strict=True**)**]**, **Annotated**[**int**,
        **Strict**(**strict=True**)**]**]*)

   Example:
      Rating concept definition

         from contextgem import RatingConcept


         # Create a concept to rate the fairness of contract terms
         fairness_rating = RatingConcept(
             name="Contract fairness rating",
             description="Evaluation of how balanced and fair the contract terms are for all parties",
             rating_scale=(1, 5),
             llm_role="reasoner_text",
             add_justifications=True,
             justification_depth="comprehensive",
             justification_max_sents=10,
         )

         # Create a concept to rate the clarity of contract language
         clarity_rating = RatingConcept(
             name="Language clarity rating",
             description="Assessment of how clear and unambiguous the contract language is",
             rating_scale=(1, 10),
             llm_role="reasoner_text",
             add_justifications=True,
             justification_depth="balanced",
             justification_max_sents=3,
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_IntegerItem]

      Gets the list of extracted rating items.

      Returns:
         List of extracted integer items representing ratings.

      Return type:
         list[_IntegerItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   rating_scale: _RatingScale | tuple[StrictInt, StrictInt]

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: JSONDictField

class contextgem.public.concepts.JsonObjectConcept(**data)

   Bases: "_JsonObjectConcept"

   A concept model for structured JSON object extraction from
   documents and aspects.

   This class handles identification and extraction of structured data
   in JSON format, with validation against a predefined schema
   structure.

   Variables:
      * **name** (*str*) -- The name of the concept (non-empty string,
        stripped).

      * **description** (*str*) -- A brief description of the concept
        (non-empty string, stripped).

      * **structure** (*type** | **dict**[**str**, **Any**]*) --

        JSON object schema as a class with type annotations or
        dictionary where keys are field names and values are type
        annotations. All dictionary keys must be strings. Supports
        generic aliases, union types, nested dictionaries for complex
        hierarchical structures, lists of dictionaries for array
        items, Literal types, and classes with type annotations
        (Pydantic models, dataclasses, etc.) for nested structures.
        All annotated types must be JSON-serializable. Examples:

        * Simple structure: "{"item": str, "amount": int | float}"

        * Nested structure: "{"item": str, "details": {"price": float,
          "quantity": int}}"

        * List of objects: "{"items": [{"name": str, "price":
          float}]}"

        * List of primitives: "{"names": [str], "scores": [int |
          float], "statuses": [Literal["active", "inactive"]]}"

        * List of classes: "{"addresses": [AddressModel], "users":
          [UserModel]}"

        * Literal values: "{"status": Literal["pending", "completed",
          "failed"]}"

        * With type annotated classes: "{"address": AddressModel}"
          where AddressModel can be a Pydantic model, dataclass, or
          any class with type annotations

        **Note**: For lists, you can use either generic syntax
        ("list[str]") or literal syntax ("[str]"). List instances
        support primitive types, unions, literals, and typed classes.
        Both "{"items": [ClassName]}" and "{"items": list[ClassName]}"
        are equivalent.

        **Note**: Class types cannot be used as dictionary keys or
        values. For example, "dict[str, Address]" is not allowed. Use
        alternative structures like nested objects or lists of objects
        instead.

        **Note**: When using classes that contain other classes as
        type hints, inherit from "JsonObjectClassStruct" in all parts
        of the class hierarchy, to ensure proper conversion of nested
        class hierarchies to dictionary representations for
        serialization.

        **Tip**: do not overcomplicate the structure to avoid prompt
        overloading.

      * **examples** (*list**[**JsonObjectExample**]*) -- Example JSON
        objects illustrating the concept usage.

      * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible
        for extracting the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision", "extractor_multimodal",
        "reasoner_multimodal"). Defaults to "extractor_text".

      * **add_justifications** (*bool*) -- Whether to include
        justifications for extracted items.

      * **justification_depth** (*JustificationDepth*) --
        Justification detail level. Defaults to "brief".

      * **justification_max_sents** (*int*) -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** (*bool*) -- Whether to include source
        references for extracted items.

      * **reference_depth** (*ReferenceDepth*) -- Source reference
        granularity ("paragraphs" or "sentences"). Defaults to
        "paragraphs". Only relevant when references are added to
        extracted items. Affects the structure of "extracted_items".

      * **singular_occurrence** (*StrictBool*) -- Whether this concept
        is restricted to having only one extracted item. If True, only
        a single extracted item will be extracted. Defaults to False
        (multiple extracted items are allowed). Note that with
        advanced LLMs, this constraint may not be strictly required as
        they can often infer the appropriate number of items to
        extract from the concept's name, description, and type (e.g.,
        "product specifications" vs "customer order details").

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **add_justifications** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **justification_depth** (*Literal**[**'brief'**,
        **'balanced'**, **'comprehensive'**]*)

      * **justification_max_sents** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **description** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **llm_role** (*Literal**[**'extractor_text'**,
        **'reasoner_text'**, **'extractor_vision'**,
        **'reasoner_vision'**, **'extractor_multimodal'**,
        **'reasoner_multimodal'**]*)

      * **add_references** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **reference_depth** (*Literal**[**'paragraphs'**,
        **'sentences'**]*)

      * **singular_occurrence** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **structure** (*type** | **dict**[**Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]**,
        **Any**]*)

      * **examples** (*list**[**_JsonObjectExample**]*)

   Example:
      JSON object concept definition

         from typing import Literal

         from contextgem import JsonObjectConcept


         # Define a JSON object concept for capturing address information
         address_info_concept = JsonObjectConcept(
             name="Address information",
             description=(
                 "Structured address data from text including street, "
                 "city, state, postal code, and country."
             ),
             structure={
                 "street": str | None,
                 "city": str | None,
                 "state": str | None,
                 "postal_code": str | None,
                 "country": str | None,
                 "address_type": Literal["residential", "business"] | None,
             },
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   structure: type | dict[NonEmptyStr, Any]

   examples: list[_JsonObjectExample]

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: JSONDictField

class contextgem.public.concepts.DateConcept(**data)

   Bases: "_DateConcept"

   A concept model for date object extraction from documents and
   aspects.

   This class handles identification and extraction of dates, with
   support for parsing string representations in a specified format
   into Python date objects.

   Variables:
      * **name** (*str*) -- The name of the concept (non-empty string,
        stripped).

      * **description** (*str*) -- A brief description of the concept
        (non-empty string, stripped).

      * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible
        for extracting the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision", "extractor_multimodal",
        "reasoner_multimodal"). Defaults to "extractor_text".

      * **add_justifications** (*bool*) -- Whether to include
        justifications for extracted items.

      * **justification_depth** (*JustificationDepth*) --
        Justification detail level. Defaults to "brief".

      * **justification_max_sents** (*int*) -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** (*bool*) -- Whether to include source
        references for extracted items.

      * **reference_depth** (*ReferenceDepth*) -- Source reference
        granularity ("paragraphs" or "sentences"). Defaults to
        "paragraphs". Only relevant when references are added to
        extracted items. Affects the structure of "extracted_items".

      * **singular_occurrence** (*StrictBool*) -- Whether this concept
        is restricted to having only one extracted item. If True, only
        a single extracted item will be extracted. Defaults to False
        (multiple extracted items are allowed). Note that with
        advanced LLMs, this constraint may not be strictly required as
        they can often infer the appropriate number of items to
        extract from the concept's name, description, and type (e.g.,
        "contract signing date" vs "meeting dates").

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **add_justifications** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **justification_depth** (*Literal**[**'brief'**,
        **'balanced'**, **'comprehensive'**]*)

      * **justification_max_sents** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **description** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **llm_role** (*Literal**[**'extractor_text'**,
        **'reasoner_text'**, **'extractor_vision'**,
        **'reasoner_vision'**, **'extractor_multimodal'**,
        **'reasoner_multimodal'**]*)

      * **add_references** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **reference_depth** (*Literal**[**'paragraphs'**,
        **'sentences'**]*)

      * **singular_occurrence** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

   Example:
      Date concept definition

         from contextgem import DateConcept


         # Create a date concept to extract the effective date of the contract
         effective_date = DateConcept(
             name="Effective date",
             description="The effective as specified in the contract",
             add_references=True,  # Include references to where dates were found
             singular_occurrence=True,  # Only extract one effective date per document
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_ExtractedItem]

      Provides access to extracted items.

      Returns:
         A list containing the extracted items as *_ExtractedItem*
         objects.

      Return type:
         list[_ExtractedItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: JSONDictField

class contextgem.public.concepts.LabelConcept(**data)

   Bases: "_LabelConcept"

   A concept model for label-based classification of documents and
   aspects.

   This class handles identification and classification using
   predefined labels, supporting both multi-class (single label
   selection) and multi-label (multiple label selection)
   classification approaches.

   **Note**: Behavior depends on "classification_type":

   * "multi_class": exactly one label is always returned for each
     extracted item. If none of the specific labels apply, include a
     catch-all label (e.g., ""other"", ""N/A"") among "labels" so the
     model can select it.

   * "multi_label": when none of the predefined labels apply, no
     extracted items may be returned (empty "extracted_items" list).
     This prevents forced classification when no appropriate label
     exists.

   Variables:
      * **name** (*str*) -- The name of the concept (non-empty string,
        stripped).

      * **description** (*str*) -- A brief description of the concept
        (non-empty string, stripped).

      * **labels** (*list**[**str**]*) -- List of predefined labels
        (non-empty strings, stripped) for classification. Must contain
        at least 2 unique labels.

      * **classification_type** (*ClassificationType*) --
        Classification mode - "multi_class" for single label
        selection, "multi_label" for multiple label selection.
        Defaults to "multi_class".

      * **llm_role** (*LLMRoleAny*) -- The role of the LLM responsible
        for extracting the concept ("extractor_text", "reasoner_text",
        "extractor_vision", "reasoner_vision", "extractor_multimodal",
        "reasoner_multimodal"). Defaults to "extractor_text".

      * **add_justifications** (*bool*) -- Whether to include
        justifications for extracted items.

      * **justification_depth** (*JustificationDepth*) --
        Justification detail level. Defaults to "brief".

      * **justification_max_sents** (*int*) -- Maximum sentences in
        justification. Defaults to 2.

      * **add_references** (*bool*) -- Whether to include source
        references for extracted items.

      * **reference_depth** (*ReferenceDepth*) -- Source reference
        granularity ("paragraphs" or "sentences"). Defaults to
        "paragraphs". Only relevant when references are added to
        extracted items. Affects the structure of "extracted_items".

      * **singular_occurrence** (*bool*) -- Whether this concept is
        restricted to having only one extracted item. If True, only a
        single extracted item will be extracted. Defaults to False
        (multiple extracted items are allowed). Note that with
        advanced LLMs, this constraint may not be strictly required as
        they can often infer the appropriate number of items to
        extract from the concept's name, description, and type (e.g.,
        "document type" vs "content topics").

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **add_justifications** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **justification_depth** (*Literal**[**'brief'**,
        **'balanced'**, **'comprehensive'**]*)

      * **justification_max_sents** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **name** (*Annotated**[**str**, **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **description** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **llm_role** (*Literal**[**'extractor_text'**,
        **'reasoner_text'**, **'extractor_vision'**,
        **'reasoner_vision'**, **'extractor_multimodal'**,
        **'reasoner_multimodal'**]*)

      * **add_references** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **reference_depth** (*Literal**[**'paragraphs'**,
        **'sentences'**]*)

      * **singular_occurrence** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **labels** (*list**[**Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**,
        **pattern=None**)**]**]*)

      * **classification_type** (*Literal**[**'multi_class'**,
        **'multi_label'**]*)

   Example:
      Label concept definition

         from contextgem import LabelConcept


         # Multi-class classification: single label selection
         document_type_concept = LabelConcept(
             name="Document Type",
             description="Classify the type of legal document",
             labels=["NDA", "Consultancy Agreement", "Privacy Policy", "Other"],
             classification_type="multi_class",
             singular_occurrence=True,
         )

         # Multi-label classification: multiple label selection
         content_topics_concept = LabelConcept(
             name="Content Topics",
             description="Identify all relevant topics covered in the document",
             labels=["Finance", "Legal", "Technology", "HR", "Operations", "Marketing"],
             classification_type="multi_label",
             add_justifications=True,
             justification_depth="brief",  # add justifications for the selected labels
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   property extracted_items: list[_LabelItem]

      Gets the list of extracted label items.

      Returns:
         List of extracted label items.

      Return type:
         list[_LabelItem]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   labels: list[NonEmptyStr]

   classification_type: ClassificationType

   name: NonEmptyStr

   description: NonEmptyStr

   llm_role: LLMRoleAny

   add_references: StrictBool

   reference_depth: ReferenceDepth

   singular_occurrence: StrictBool

   add_justifications: StrictBool

   justification_depth: JustificationDepth

   justification_max_sents: StrictInt

   custom_data: JSONDictField


# ==== api/examples ====

Examples
********

Module for handling example data in document processing.

This module provides classes for defining examples that can be used to
guide LLM extraction tasks. Examples serve as reference points for the
model to understand the expected format and content of extracted
information. The module supports different types of examples including
string-based examples and structured JSON object examples.

Examples can be attached to concepts to provide concrete illustrations
of the kind of information to be extracted, improving the accuracy and
consistency of LLM-based extraction processes.

class contextgem.public.examples.StringExample(**data)

   Bases: "_StringExample"

   Represents a string example that can be provided by users for
   certain extraction tasks.

   Variables:
      **content** (*str*) -- A non-empty string that holds the text
      content of the example.

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **content** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

   Note:
      Examples are optional and can be used to guide LLM extraction
      tasks. They serve as reference points for the model to
      understand the expected format and content of extracted
      information. StringExample can be attached to a "StringConcept".

   Example:
      String example definition

         from contextgem import StringConcept, StringExample


         # Create string examples
         string_examples = [
             StringExample(content="X (Client)"),
             StringExample(content="Y (Supplier)"),
         ]

         # Attach string examples to a StringConcept
         string_concept = StringConcept(
             name="Contract party name and role",
             description="The name and role of the contract party",
             examples=string_examples,  # Attach the example to the concept (optional)
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   content: NonEmptyStr

   custom_data: JSONDictField

class contextgem.public.examples.JsonObjectExample(**data)

   Bases: "_JsonObjectExample"

   Represents a JSON object example that can be provided by users for
   certain extraction tasks.

   Variables:
      **content** (*dict**[**str**, **Any**]*) -- A JSON-serializable
      dict with the minimum length of 1 that holds the content of the
      example.

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **content** (*Annotated**[**dict**[**str**, **Any**]**, **Bef
        oreValidator**(**func=~contextgem.internal.typings.validators
        ._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

   Note:
      Examples are optional and can be used to guide LLM extraction
      tasks. They serve as reference points for the model to
      understand the expected format and content of extracted
      information. JsonObjectExample can be attached to a
      "JsonObjectConcept".

   Example:
      JSON object example definition

         from contextgem import JsonObjectConcept, JsonObjectExample


         # Create a JSON object example
         json_example = JsonObjectExample(
             content={
                 "name": "John Doe",
                 "education": "Bachelor's degree in Computer Science",
                 "skills": ["Python", "Machine Learning", "Data Analysis"],
                 "hobbies": ["Reading", "Traveling", "Gaming"],
             }
         )


         # Define a structure for JSON object concept
         class PersonInfo:
             name: str
             education: str
             skills: list[str]
             hobbies: list[str]


         # Also works as a dict with type hints, e.g.
         # PersonInfo = {
         #     "name": str,
         #     "education": str,
         #     "skills": list[str],
         #     "hobbies": list[str],
         # }

         # Attach JSON example to a JsonObjectConcept
         json_concept = JsonObjectConcept(
             name="Candidate info",
             description="Structured information about a job candidate",
             structure=PersonInfo,  # Define the expected structure
             examples=[json_example],  # Attach the example to the concept (optional)
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   content: JSONDictField

   custom_data: JSONDictField


# ==== api/llms ====

LLMs
****

Module for handling processing logic using LLMs.

This module provides classes and utilities for interacting with LLMs
in document processing workflows. It includes functionality for
managing LLM configurations, handling API calls, processing text and
image inputs, tracking token usage and costs, and managing rate limits
for LLM requests.

The module supports various LLM providers through the litellm library,
enabling both text-only and multimodal (vision) capabilities. It
implements efficient asynchronous processing patterns and provides
detailed usage statistics for monitoring and cost management.

class contextgem.public.llms.DocumentLLMGroup(**data)

   Bases: "_DocumentLLMGroup"

   Represents a group of DocumentLLMs with unique roles for processing
   document content.

   This class manages multiple LLMs assigned to specific roles for
   text and vision processing. It ensures role compliance and
   facilitates extraction of aspects and concepts from documents.

   Variables:
      * **llms** (*list**[**DocumentLLM**]*) -- A list of DocumentLLM
        instances, each with a unique role (e.g., *extractor_text*,
        *reasoner_text*, *extractor_vision*, *reasoner_vision*). At
        least 2 instances with distinct roles are required.

      * **output_language** (*LanguageRequirement*) -- Language for
        produced output text (justifications, explanations). Values:
        "en" (always English) or "adapt" (matches document/image
        language). All LLMs in the group must share the same
        output_language setting. Defaults to "en". Applies only when
        DocumentLLMs' default system messages are used.

   Parameters:
      * **llms** (*list**[**_DocumentLLM**]*)

      * **output_language** (*Literal**[**'en'**, **'adapt'**]*)

   Note:
      Refer to the "DocumentLLM" class for more information on
      constructing LLMs for the group.

   Example:
      LLM group definition

         from contextgem import DocumentLLM, DocumentLLMGroup


         # Create a text extractor LLM with a fallback
         text_extractor = DocumentLLM(
             model="openai/gpt-4o-mini",
             api_key="your-openai-api-key",  # Replace with your actual API key
             role="extractor_text",
         )

         # Create a fallback LLM for the text extractor
         text_extractor_fallback = DocumentLLM(
             model="anthropic/claude-3-5-haiku",
             api_key="your-anthropic-api-key",  # Replace with your actual API key
             role="extractor_text",  # Must have the same role as the primary LLM
             is_fallback=True,
         )

         # Assign the fallback LLM to the primary text extractor
         text_extractor.fallback_llm = text_extractor_fallback

         # Create a text reasoner LLM
         text_reasoner = DocumentLLM(
             model="openai/o3-mini",
             api_key="your-openai-api-key",  # Replace with your actual API key
             role="reasoner_text",  # For more complex tasks that require reasoning
         )

         # Create a vision extractor LLM
         vision_extractor = DocumentLLM(
             model="openai/gpt-4o-mini",
             api_key="your-openai-api-key",  # Replace with your actual API key
             role="extractor_vision",  # For handling images
         )

         # Create a vision reasoner LLM
         vision_reasoner = DocumentLLM(
             model="openai/gpt-5-mini",
             api_key="your-openai-api-key",
             role="reasoner_vision",  # For more complex vision tasks that require reasoning
         )

         # Create a DocumentLLMGroup with all four LLMs
         llm_group = DocumentLLMGroup(
             llms=[text_extractor, text_reasoner, vision_extractor, vision_reasoner],
             output_language="en",  # All LLMs must have the same output language ("en" is default)
         )
         # This group will have 5 LLMs: four main ones, with different roles,
         # and one fallback LLM for a specific LLM. Each LLM can have a fallback LLM.

         # Get usage statistics for the whole group or for a specific role
         group_usage = llm_group.get_usage()
         text_extractor_usage = llm_group.get_usage(llm_role="extractor_text")

         # Get cost statistics for the whole group or for a specific role
         all_costs = llm_group.get_cost()
         text_extractor_cost = llm_group.get_cost(llm_role="extractor_text")

         # Reset usage and cost statistics for the whole group or for a specific role
         llm_group.reset_usage_and_cost()
         llm_group.reset_usage_and_cost(llm_role="extractor_text")

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   _eq_deserialized_llm_config(other)

      Custom config equality method to compare this _DocumentLLMGroup
      with a deserialized instance.

      Uses the *_eq_deserialized_llm_config* method of the
      _DocumentLLM class to compare each LLM in the group, including
      fallbacks, if any.

      Parameters:
         **other** (*_DocumentLLMGroup*) -- Another _DocumentLLMGroup
         instance to compare with

      Returns:
         True if the instances are equal, False otherwise

      Return type:
         bool

   extract_all(document, *, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts all aspects and concepts from a document and its
      aspects.

      This method performs comprehensive extraction by processing the
      document for aspects and concepts, then extracting concepts from
      each aspect. The operation can be configured for concurrent
      processing and customized extraction parameters.

      This is the synchronous version of *extract_all_async()*.

      Parameters:
         * **document** (*_Document*) -- The document to analyze.

         * **overwrite_existing** (*bool**, **optional*) -- Whether to
           overwrite already processed aspects and concepts with newly
           extracted information. Defaults to False.

         * **max_items_per_call** (*int**, **optional*) -- Maximum
           number of items with the same extraction params to process
           in each LLM call. Defaults to 0 (all items in one call). If
           concurrency is enabled, defaults to 1. For complex tasks,
           you should not set a high value, in order to avoid prompt
           overloading.

         * **use_concurrency** (*bool**, **optional*) -- If True,
           enables concurrent processing of multiple items.
           Concurrency can considerably reduce processing time, but
           may cause rate limit errors with LLM providers. Use this
           option when API rate limits allow for multiple concurrent
           requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int**,
           **optional*) -- Maximum paragraphs to include in a single
           LLM prompt. Defaults to 0 (all paragraphs).

         * **max_images_to_analyze_per_call** (*int**, **optional*) --
           Maximum images to include in a single LLM prompt. Defaults
           to 0 (all images). Relevant only for document-level
           concepts.

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         The document with extracted aspects and concepts.

      Return type:
         _Document

   async extract_all_async(document, *, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Asynchronously extracts all aspects and concepts from a document
      and its aspects.

      This method performs comprehensive extraction by processing the
      document for aspects and concepts, then extracting concepts from
      each aspect. The operation can be configured for concurrent
      processing and customized extraction parameters.

      Parameters:
         * **document** (*_Document*) -- The document to analyze.

         * **overwrite_existing** (*bool**, **optional*) -- Whether to
           overwrite already processed aspects and concepts with newly
           extracted information. Defaults to False.

         * **max_items_per_call** (*int**, **optional*) -- Maximum
           number of items with the same extraction params to process
           in each LLM call. Defaults to 0 (all items in one call). If
           concurrency is enabled, defaults to 1. For complex tasks,
           you should not set a high value, in order to avoid prompt
           overloading.

         * **use_concurrency** (*bool**, **optional*) -- If True,
           enables concurrent processing of multiple items.
           Concurrency can considerably reduce processing time, but
           may cause rate limit errors with LLM providers. Use this
           option when API rate limits allow for multiple concurrent
           requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int**,
           **optional*) -- Maximum paragraphs to include in a single
           LLM prompt. Defaults to 0 (all paragraphs).

         * **max_images_to_analyze_per_call** (*int**, **optional*) --
           Maximum images to include in a single LLM prompt. Defaults
           to 0 (all images). Relevant only for document-level
           concepts.

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         The document with extracted aspects and concepts.

      Return type:
         _Document

   extract_aspects_from_document(document, *, from_aspects=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts aspects from the provided document using predefined
      LLMs.

      If an aspect instance has "extracted_items" populated, the
      "reference_paragraphs" field will be automatically populated
      from these items.

      This is the synchronous version of
      *extract_aspects_from_document_async()*.

      Parameters:
         * **document** (*_Document*) -- The document from which
           aspects are to be extracted.

         * **from_aspects** (*list**[**_Aspect**] **| **None*) --
           Existing aspects to use as a base for extraction. If None,
           uses all document's aspects.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed aspects with newly extracted information.
           Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum items with the
           same extraction params to process per LLM call. Defaults to
           0 (all items in single call). For complex tasks, you should
           not set a value, to avoid prompt overloading. If
           concurrency is enabled, defaults to 1 (each item processed
           separately).

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to analyze in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed _Aspect objects with extracted items.

      Return type:
         list[_Aspect]

   async extract_aspects_from_document_async(document, *, from_aspects=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts aspects from the provided document using predefined
      LLMs asynchronously.

      If an aspect instance has "extracted_items" populated, the
      "reference_paragraphs" field will be automatically populated
      from these items.

      Parameters:
         * **document** (*_Document*) -- The document from which
           aspects are to be extracted.

         * **from_aspects** (*list**[**_Aspect**] **| **None*) --
           Existing aspects to use as a base for extraction. If None,
           uses all document's aspects.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed aspects with newly extracted information.
           Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum number of items
           with the same extraction params to process per LLM call.
           Defaults to 0 (all items in one call). If concurrency is
           enabled, defaults to 1. For complex tasks, you should not
           set a high value, in order to avoid prompt overloading.

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to analyze in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed _Aspect objects with extracted items.

      Return type:
         list[_Aspect]

   extract_concepts_from_aspect(aspect, document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts concepts associated with a given aspect in a document.

      This method processes an aspect to extract related concepts
      using LLMs. If the aspect has not been previously processed, a
      ValueError is raised.

      This is the synchronous version of
      *extract_concepts_from_aspect_async()*.

      Parameters:
         * **aspect** (*_Aspect*) -- The aspect from which to extract
           concepts.

         * **document** (*_Document*) -- The document that contains
           the aspect.

         * **from_concepts** (*list**[**_Concept**] **| **None*) --
           List of existing concepts to process. Defaults to None.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed concepts with newly extracted
           information. Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum number of items
           with the same extraction params to process in each LLM
           call. Defaults to 0 (all items in one call). If concurrency
           is enabled, defaults to 1. For complex tasks, you should
           not set a high value, in order to avoid prompt overloading.

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to include in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed concept objects.

      Return type:
         list[_Concept]

   async extract_concepts_from_aspect_async(aspect, document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Asynchronously extracts concepts from a specified aspect using
      LLMs.

      This method processes an aspect to extract related concepts
      using LLMs. If the aspect has not been previously processed, a
      ValueError is raised.

      Parameters:
         * **aspect** (*_Aspect*) -- The aspect from which to extract
           concepts.

         * **document** (*_Document*) -- The document that contains
           the aspect.

         * **from_concepts** (*list**[**_Concept**] **| **None*) --
           List of existing concepts to process. Defaults to None.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed concepts with newly extracted
           information. Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum number of items
           with the same extraction params to process in each LLM
           call. Defaults to 0 (all items in one call). If concurrency
           is enabled, defaults to 1. For complex tasks, you should
           not set a high value, in order to avoid prompt overloading.

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to include in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed concept objects.

      Return type:
         list[_Concept]

   extract_concepts_from_document(document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts concepts from the provided document using predefined
      LLMs.

      This is the synchronous version of
      *extract_concepts_from_document_async()*.

      Parameters:
         * **document** (*_Document*) -- The document from which
           concepts are to be extracted.

         * **from_concepts** (*list**[**_Concept**] **| **None*) --
           Existing concepts to use as a base for extraction. If None,
           uses all document's concepts.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed concepts with newly extracted
           information. Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum items with the
           same extraction params to process per LLM call. Defaults to
           0 (all items in single call). For complex tasks, you should
           not set a value, to avoid prompt overloading. If
           concurrency is enabled, defaults to 1 (each item processed
           separately).

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to analyze in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **max_images_to_analyze_per_call** (*int**, **optional*) --
           Maximum images to include in a single LLM prompt. Defaults
           to 0 (all images).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed Concept objects with extracted items.

      Return type:
         list[_Concept]

   async extract_concepts_from_document_async(document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts concepts from the provided document using predefined
      LLMs asynchronously.

      This method processes a document to extract concepts using
      configured LLMs.

      Parameters:
         * **document** (*_Document*) -- The document from which
           concepts are to be extracted.

         * **from_concepts** (*list**[**_Concept**] **| **None*) --
           Existing concepts to use as a base for extraction. If None,
           uses all document's concepts.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed concepts with newly extracted
           information. Defaults to False. Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum number of items
           with the same extraction params to process per LLM call.
           Defaults to 0 (all items in one call). If concurrency is
           enabled, defaults to 1. For complex tasks, you should not
           set a high value, in order to avoid prompt overloading.

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to analyze in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **max_images_to_analyze_per_call** (*int**, **optional*) --
           Maximum images to include in a single LLM prompt. Defaults
           to 0 (all images).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed Concept objects with extracted items.

      Return type:
         list[_Concept]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   get_cost(llm_role=None)

      Retrieves the accumulated cost information of the LLMs in the
      group, filtered by the specified LLM role if provided.

      Parameters:
         **llm_role** (*str** | **None*) -- Optional; A string
         representing the role of the LLM to filter the cost data. If
         None, returns cost for all LLMs in the group.

      Returns:
         A list of cost statistics containers for the specified LLMs
         and their fallbacks.

      Return type:
         list[_LLMCostOutputContainer]

      Raises:
         **ValueError** -- If no LLM with the specified role exists in
         the group.

   get_usage(llm_role=None)

      Retrieves the usage information of the LLMs in the group,
      filtered by the specified LLM role if provided.

      Parameters:
         **llm_role** (*str** | **None*) -- Optional; A string
         representing the role of the LLM to filter the usage data. If
         None, returns usage for all LLMs in the group.

      Returns:
         A list of usage statistics containers for the specified LLMs
         and their fallbacks.

      Return type:
         list[_LLMUsageOutputContainer]

      Raises:
         **ValueError** -- If no LLM with the specified role exists in
         the group.

   group_update_output_language(output_language)

      Updates the output language for all LLMs in the group.

      Parameters:
         **output_language** (*LanguageRequirement*) -- The new output
         language to set for all LLMs

      Returns:
         None

      Return type:
         None

   property is_group: bool

      Returns True indicating this is a group of LLMs.

      Returns:
         Always True for DocumentLLMGroup instances.

      Return type:
         bool

   property list_roles: list[Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision', 'extractor_multimodal', 'reasoner_multimodal']]

      Returns a list of all roles assigned to the LLMs in this group.

      Returns:
         A list of LLM role identifiers

      Return type:
         list[LLMRoleAny]

   reset_usage_and_cost(llm_role=None)

      Resets the usage and cost statistics for LLMs in the group.

      This method clears accumulated usage and cost data, which is
      useful when processing multiple documents sequentially and
      tracking metrics for each document separately.

      Parameters:
         **llm_role** (*str** | **None*) -- Optional; A string
         representing the role of the LLM to reset statistics for. If
         None, resets statistics for all LLMs in the group.

      Returns:
         None

      Return type:
         None

      Raises:
         **ValueError** -- If no LLM with the specified role exists in
         the group.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   llms: list[_DocumentLLM]

   output_language: LanguageRequirement

class contextgem.public.llms.DocumentLLM(**data)

   Bases: "_DocumentLLM"

   Handles processing documents with a specific LLM.

   This class serves as an abstraction for interacting with a LLM. It
   provides functionality for querying the LLM with text or image
   inputs, and manages prompt preparation and token usage tracking.
   The class can be configured with different roles based on the
   document processing task.

   Variables:
      * **model** (*str*) -- Model identifier in format
        {model_provider}/{model_name}. See
        https://docs.litellm.ai/docs/providers for supported
        providers.

      * **deployment_id** (*str** | **None*) -- Deployment ID for the
        LLM. Primarily used with Azure OpenAI.

      * **api_key** (*str** | **None*) -- API key for LLM
        authentication. Not required for local models (e.g., Ollama).

      * **api_base** (*str** | **None*) -- Base URL of the API
        endpoint.

      * **api_version** (*str** | **None*) -- API version. Primarily
        used with Azure OpenAI.

      * **role** (*LLMRoleAny*) -- Role type for the LLM
        ("extractor_text", "reasoner_text", "extractor_vision",
        "reasoner_vision", "extractor_multimodal",
        "reasoner_multimodal"). Defaults to "extractor_text".

      * **system_message** (*str** | **None*) -- Preparatory system-
        level message to set context for LLM responses.

      * **temperature** (*float** | **None*) -- Sampling temperature
        (0.0 to 1.0) controlling response creativity. Lower values
        produce more predictable outputs, higher values generate more
        varied responses. Defaults to 0.3.

      * **max_tokens** (*int*) -- Maximum tokens allowed in the
        generated response. Defaults to 4096.

      * **max_completion_tokens** (*int*) -- Maximum token size for
        output completions in reasoning (CoT-capable) models. Defaults
        to 16000.

      * **reasoning_effort** (*ReasoningEffort** | **None*) -- The
        effort level for the LLM to reason about the input. Can be set
        to ""minimal"" (gpt-5 models only), ""low"", ""medium"", or
        ""high"". Relevant for reasoning (CoT-capable) models.
        Defaults to None.

      * **top_p** (*float** | **None*) -- Nucleus sampling value (0.0
        to 1.0) controlling output focus/randomness. Lower values make
        output more deterministic, higher values produce more diverse
        outputs. Defaults to 0.3.

      * **num_retries_failed_request** (*int*) -- Number of retries
        when LLM request fails. Defaults to 3.

      * **max_retries_failed_request** (*int*) -- LLM provider-
        specific retry count for failed requests. Defaults to 0.

      * **max_retries_invalid_data** (*int*) -- Number of retries when
        LLM returns invalid data. Defaults to 3.

      * **timeout** (*int*) -- Timeout in seconds for LLM API calls.
        Defaults to 120 seconds.

      * **pricing_details** (*LLMPricing** | **None*) -- LLMPricing
        object with pricing details for cost calculation. Defaults to
        None.

      * **auto_pricing** (*bool*) -- Enable automatic LLM cost
        calculation using genai-prices. Ignored when "pricing_details"
        is provided. Defaults to "False".

      * **auto_pricing_refresh** (*bool*) -- Whether genai-prices
        should auto-refresh its cached pricing data. Defaults to
        "False".

      * **is_fallback** (*bool*) -- Indicates whether the LLM is a
        fallback model. Defaults to False.

      * **fallback_llm** (*DocumentLLM** | **None*) -- DocumentLLM to
        use as fallback if current one fails. Must have the same role
        as the current LLM. Defaults to None.

      * **output_language** (*LanguageRequirement*) -- Language for
        produced output text (justifications, explanations). Can be
        "en" (English) or "adapt" (adapts to document/image language).
        Defaults to "en". Applies only when DocumentLLM's default
        system message is used.

      * **async_limiter** (*AsyncLimiter*) -- Controls frequency of
        async LLM API requests for concurrent tasks. Defaults to
        allowing 3 acquisitions per 10-second period to prevent rate
        limit issues. See https://github.com/mjpieters/aiolimiter for
        configuration details.

      * **seed** (*int** | **None*) -- Seed for random number
        generation to help produce more consistent outputs across
        multiple runs. When set to a specific integer value, the LLM
        will attempt to use this seed for sampling operations.
        However, deterministic output is still not guaranteed even
        with the same seed, as other factors may influence the model's
        response. Defaults to None.

   Parameters:
      * **model** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **deployment_id** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]
        **| **None*)

      * **api_key** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]
        **| **None*)

      * **api_base** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]
        **| **None*)

      * **api_version** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]
        **| **None*)

      * **role** (*Literal**[**'extractor_text'**,
        **'reasoner_text'**, **'extractor_vision'**,
        **'reasoner_vision'**, **'extractor_multimodal'**,
        **'reasoner_multimodal'**]*)

      * **system_message** (*str** | **None*)

      * **max_tokens** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **max_completion_tokens** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **reasoning_effort** (*Literal**[**'minimal'**, **'low'**,
        **'medium'**, **'high'**] **| **None*)

      * **num_retries_failed_request** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **max_retries_failed_request** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **max_retries_invalid_data** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **timeout** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **pricing_details** (*_LLMPricing** | **None*)

      * **is_fallback** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **fallback_llm** (*_DocumentLLM** | **None*)

      * **output_language** (*Literal**[**'en'**, **'adapt'**]*)

      * **temperature** (*Annotated**[**float**,
        **Strict**(**strict=True**)**] **| **None*)

      * **top_p** (*Annotated**[**float**,
        **Strict**(**strict=True**)**] **| **None*)

      * **seed** (*Annotated**[**int**, **Strict**(**strict=True**)**]
        **| **None*)

      * **tools** (*list**[**Annotated**[**dict**[**str**, **Any**]**,
        **BeforeValidator**(**func=~contextgem.internal.typings.valid
        ators._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]**] **|
        **None*)

      * **tool_choice** (*str** | **Annotated**[**dict**[**str**,
        **Any**]**, **BeforeValidator**(**func=~contextgem.internal.t
        ypings.validators._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**] **| **None*)

      * **parallel_tool_calls** (*bool** | **None*)

      * **tool_max_rounds** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **auto_pricing** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

      * **auto_pricing_refresh** (*Annotated**[**bool**,
        **Strict**(**strict=True**)**]*)

   Note:

      * LLM groups
           Refer to the "DocumentLLMGroup" class for more information
           on constructing LLM groups, which are a collection of LLMs
           with unique roles, used for complex document processing
           tasks.

      * LLM role
           The "role" of an LLM is an abstraction to differentiate
           between tasks of different complexity. For example, if an
           aspect/concept is assigned "llm_role="extractor_text"", it
           means that the aspect/concept is extracted from the
           document using the LLM with the "role" set to
           "extractor_text". This helps to channel different tasks to
           different LLMs, ensuring that the task is handled by the
           most appropriate model. Usually, domain expertise is
           required to determine the most appropriate role for a
           specific aspect/concept. But for simple use cases, you can
           skip the role assignment completely, in which case the
           "role" will default to "extractor_text".

      * Explicit capability declaration
           Model vision capabilities are automatically detected using
           "litellm.supports_vision()". If this function does not
           correctly identify your model's capabilities, ContextGem
           will typically issue a warning, and you can explicitly
           declare the capability by setting "_supports_vision=True"
           on the LLM instance.

   Example:
      LLM definition

         from contextgem import DocumentLLM, LLMPricing


         # Create a single LLM for text extraction
         text_extractor = DocumentLLM(
             model="openai/gpt-4o-mini",
             api_key="your-api-key",  # Replace with your actual API key
             role="extractor_text",  # Role for text extraction
             pricing_details=LLMPricing(  # optional
                 input_per_1m_tokens=0.150, output_per_1m_tokens=0.600
             ),
             # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider
         )

         # Create a fallback LLM in case the primary model fails
         fallback_text_extractor = DocumentLLM(
             model="anthropic/claude-3-7-sonnet",
             api_key="your-anthropic-api-key",  # Replace with your actual API key
             role="extractor_text",  # must be the same as the role of the primary LLM
             is_fallback=True,
             pricing_details=LLMPricing(  # optional
                 input_per_1m_tokens=3.00, output_per_1m_tokens=15.00
             ),
             # or set `auto_pricing=True` to automatically fetch pricing data from the LLM provider
         )
         # Assign the fallback LLM to the primary LLM
         text_extractor.fallback_llm = fallback_text_extractor

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   _eq_deserialized_llm_config(other)

      Custom config equality method to compare this _DocumentLLM with
      a deserialized instance.

      Compares the __dict__ of both instances and performs specific
      checks for certain attributes that require special handling.

      Note that, by default, the reconstructed deserialized
      _DocumentLLM will be only partially equal (==) to the original
      one, as the api credentials are redacted, and the attached
      prompt templates, async limiter, and async lock are not
      serialized and point to different objects in memory post-
      initialization. Also, usage and cost are reset by default pre-
      serialization.

      Parameters:
         **other** (*_DocumentLLM*) -- Another _DocumentLLM instance
         to compare with

      Returns:
         True if the instances are equal, False otherwise

      Return type:
         bool

   _update_default_prompt(prompt_path, prompt_type)

      For advanced users only!

      Update the default Jinja2 prompt template for the LLM.

      This method allows you to replace the built-in prompt templates
      with custom ones for specific extraction types. The framework
      uses these templates to guide the LLM in extracting structured
      information from documents.

      The custom prompt must be a valid Jinja2 template and include
      all the necessary variables that are present in the default
      prompt. Otherwise, the extraction may fail. Default prompts are
      located under "contextgem/internal/prompts/"

      IMPORTANT NOTES:

      The default prompts are complex and specifically designed for
      various steps of LLM extraction with the framework. Such prompts
      include the necessary instructions, template variables, nested
      structures and loops, etc.

      Only use custom prompts if you MUST have a deeper customization
      and adaptation of the default prompts to your specific use case.
      Otherwise, the default prompts should be sufficient for most use
      cases.

      Use at your own risk!

      Parameters:
         * **prompt_path** (*str** | **Path*) -- Path to the Jinja2
           template file (.j2 extension required)

         * **prompt_type** (*DefaultPromptType*) -- Type of prompt to
           update ("aspect" or "concept")

      Returns:
         None

      Return type:
         None

   property async_limiter: AsyncLimiter

      Gets the async rate limiter for this LLM.

      Returns:
         The AsyncLimiter instance controlling request rate limits.

      Return type:
         AsyncLimiter

   chat(prompt, *, images=None, chat_session=None)

      Synchronously sends a prompt to the LLM and gets a response. For
      models supporting vision, attach images to the prompt if needed.

      This method allows direct interaction with the LLM by submitting
      your own prompt.

      Parameters:
         * **prompt** (*str*) -- The input prompt to send to the LLM

         * **images** (*list**[**Image**] **| **None*) -- Optional
           list of Image instances for vision queries

         * **chat_session** (*_ChatSession** | **None*) -- Optional
           stateful chat session to preserve and use history.

      Returns:
         The LLM's response

      Return type:
         str

      Raises:
         * **ValueError** -- If the prompt is empty or not a string

         * **ValueError** -- If images parameter is not a list of
           Image instances

         * **ValueError** -- If images are provided but the model
           doesn't support vision

         * **RuntimeError** -- If the LLM call fails and no fallback
           is available

   async chat_async(prompt, *, images=None, chat_session=None)

      Asynchronously sends a prompt to the LLM and gets a response.
      For models supporting vision, attach images to the prompt if
      needed.

      This method allows direct interaction with the LLM by submitting
      your own prompt.

      Parameters:
         * **prompt** (*str*) -- The input prompt to send to the LLM

         * **images** (*list**[**Image**] **| **None*) -- Optional
           list of Image instances for vision queries

         * **chat_session** (*_ChatSession** | **None*) -- Optional
           stateful chat session to preserve and use history.

      Returns:
         The LLM's response

      Return type:
         str

      Raises:
         * **ValueError** -- If the prompt is empty or not a string

         * **ValueError** -- If images parameter is not a list of
           Image instances

         * **ValueError** -- If images are provided but the model
           doesn't support vision

         * **RuntimeError** -- If the LLM call fails and no fallback
           is available

   extract_all(document, *, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts all aspects and concepts from a document and its
      aspects.

      This method performs comprehensive extraction by processing the
      document for aspects and concepts, then extracting concepts from
      each aspect. The operation can be configured for concurrent
      processing and customized extraction parameters.

      This is the synchronous version of *extract_all_async()*.

      Parameters:
         * **document** (*_Document*) -- The document to analyze.

         * **overwrite_existing** (*bool**, **optional*) -- Whether to
           overwrite already processed aspects and concepts with newly
           extracted information. Defaults to False.

         * **max_items_per_call** (*int**, **optional*) -- Maximum
           number of items with the same extraction params to process
           in each LLM call. Defaults to 0 (all items in one call). If
           concurrency is enabled, defaults to 1. For complex tasks,
           you should not set a high value, in order to avoid prompt
           overloading.

         * **use_concurrency** (*bool**, **optional*) -- If True,
           enables concurrent processing of multiple items.
           Concurrency can considerably reduce processing time, but
           may cause rate limit errors with LLM providers. Use this
           option when API rate limits allow for multiple concurrent
           requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int**,
           **optional*) -- Maximum paragraphs to include in a single
           LLM prompt. Defaults to 0 (all paragraphs).

         * **max_images_to_analyze_per_call** (*int**, **optional*) --
           Maximum images to include in a single LLM prompt. Defaults
           to 0 (all images). Relevant only for document-level
           concepts.

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         The document with extracted aspects and concepts.

      Return type:
         _Document

   async extract_all_async(document, *, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Asynchronously extracts all aspects and concepts from a document
      and its aspects.

      This method performs comprehensive extraction by processing the
      document for aspects and concepts, then extracting concepts from
      each aspect. The operation can be configured for concurrent
      processing and customized extraction parameters.

      Parameters:
         * **document** (*_Document*) -- The document to analyze.

         * **overwrite_existing** (*bool**, **optional*) -- Whether to
           overwrite already processed aspects and concepts with newly
           extracted information. Defaults to False.

         * **max_items_per_call** (*int**, **optional*) -- Maximum
           number of items with the same extraction params to process
           in each LLM call. Defaults to 0 (all items in one call). If
           concurrency is enabled, defaults to 1. For complex tasks,
           you should not set a high value, in order to avoid prompt
           overloading.

         * **use_concurrency** (*bool**, **optional*) -- If True,
           enables concurrent processing of multiple items.
           Concurrency can considerably reduce processing time, but
           may cause rate limit errors with LLM providers. Use this
           option when API rate limits allow for multiple concurrent
           requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int**,
           **optional*) -- Maximum paragraphs to include in a single
           LLM prompt. Defaults to 0 (all paragraphs).

         * **max_images_to_analyze_per_call** (*int**, **optional*) --
           Maximum images to include in a single LLM prompt. Defaults
           to 0 (all images). Relevant only for document-level
           concepts.

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         The document with extracted aspects and concepts.

      Return type:
         _Document

   extract_aspects_from_document(document, *, from_aspects=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts aspects from the provided document using predefined
      LLMs.

      If an aspect instance has "extracted_items" populated, the
      "reference_paragraphs" field will be automatically populated
      from these items.

      This is the synchronous version of
      *extract_aspects_from_document_async()*.

      Parameters:
         * **document** (*_Document*) -- The document from which
           aspects are to be extracted.

         * **from_aspects** (*list**[**_Aspect**] **| **None*) --
           Existing aspects to use as a base for extraction. If None,
           uses all document's aspects.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed aspects with newly extracted information.
           Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum items with the
           same extraction params to process per LLM call. Defaults to
           0 (all items in single call). For complex tasks, you should
           not set a value, to avoid prompt overloading. If
           concurrency is enabled, defaults to 1 (each item processed
           separately).

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to analyze in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed _Aspect objects with extracted items.

      Return type:
         list[_Aspect]

   async extract_aspects_from_document_async(document, *, from_aspects=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts aspects from the provided document using predefined
      LLMs asynchronously.

      If an aspect instance has "extracted_items" populated, the
      "reference_paragraphs" field will be automatically populated
      from these items.

      Parameters:
         * **document** (*_Document*) -- The document from which
           aspects are to be extracted.

         * **from_aspects** (*list**[**_Aspect**] **| **None*) --
           Existing aspects to use as a base for extraction. If None,
           uses all document's aspects.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed aspects with newly extracted information.
           Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum number of items
           with the same extraction params to process per LLM call.
           Defaults to 0 (all items in one call). If concurrency is
           enabled, defaults to 1. For complex tasks, you should not
           set a high value, in order to avoid prompt overloading.

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to analyze in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed _Aspect objects with extracted items.

      Return type:
         list[_Aspect]

   extract_concepts_from_aspect(aspect, document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts concepts associated with a given aspect in a document.

      This method processes an aspect to extract related concepts
      using LLMs. If the aspect has not been previously processed, a
      ValueError is raised.

      This is the synchronous version of
      *extract_concepts_from_aspect_async()*.

      Parameters:
         * **aspect** (*_Aspect*) -- The aspect from which to extract
           concepts.

         * **document** (*_Document*) -- The document that contains
           the aspect.

         * **from_concepts** (*list**[**_Concept**] **| **None*) --
           List of existing concepts to process. Defaults to None.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed concepts with newly extracted
           information. Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum number of items
           with the same extraction params to process in each LLM
           call. Defaults to 0 (all items in one call). If concurrency
           is enabled, defaults to 1. For complex tasks, you should
           not set a high value, in order to avoid prompt overloading.

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to include in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed concept objects.

      Return type:
         list[_Concept]

   async extract_concepts_from_aspect_async(aspect, document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Asynchronously extracts concepts from a specified aspect using
      LLMs.

      This method processes an aspect to extract related concepts
      using LLMs. If the aspect has not been previously processed, a
      ValueError is raised.

      Parameters:
         * **aspect** (*_Aspect*) -- The aspect from which to extract
           concepts.

         * **document** (*_Document*) -- The document that contains
           the aspect.

         * **from_concepts** (*list**[**_Concept**] **| **None*) --
           List of existing concepts to process. Defaults to None.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed concepts with newly extracted
           information. Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum number of items
           with the same extraction params to process in each LLM
           call. Defaults to 0 (all items in one call). If concurrency
           is enabled, defaults to 1. For complex tasks, you should
           not set a high value, in order to avoid prompt overloading.

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to include in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed concept objects.

      Return type:
         list[_Concept]

   extract_concepts_from_document(document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts concepts from the provided document using predefined
      LLMs.

      This is the synchronous version of
      *extract_concepts_from_document_async()*.

      Parameters:
         * **document** (*_Document*) -- The document from which
           concepts are to be extracted.

         * **from_concepts** (*list**[**_Concept**] **| **None*) --
           Existing concepts to use as a base for extraction. If None,
           uses all document's concepts.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed concepts with newly extracted
           information. Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum items with the
           same extraction params to process per LLM call. Defaults to
           0 (all items in single call). For complex tasks, you should
           not set a value, to avoid prompt overloading. If
           concurrency is enabled, defaults to 1 (each item processed
           separately).

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to analyze in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **max_images_to_analyze_per_call** (*int**, **optional*) --
           Maximum images to include in a single LLM prompt. Defaults
           to 0 (all images).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed Concept objects with extracted items.

      Return type:
         list[_Concept]

   async extract_concepts_from_document_async(document, *, from_concepts=None, overwrite_existing=False, max_items_per_call=0, use_concurrency=False, max_paragraphs_to_analyze_per_call=0, max_images_to_analyze_per_call=0, raise_exception_on_extraction_error=True)

      Extracts concepts from the provided document using predefined
      LLMs asynchronously.

      This method processes a document to extract concepts using
      configured LLMs.

      Parameters:
         * **document** (*_Document*) -- The document from which
           concepts are to be extracted.

         * **from_concepts** (*list**[**_Concept**] **| **None*) --
           Existing concepts to use as a base for extraction. If None,
           uses all document's concepts.

         * **overwrite_existing** (*bool*) -- Whether to overwrite
           already processed concepts with newly extracted
           information. Defaults to False. Defaults to False.

         * **max_items_per_call** (*int*) -- Maximum number of items
           with the same extraction params to process per LLM call.
           Defaults to 0 (all items in one call). If concurrency is
           enabled, defaults to 1. For complex tasks, you should not
           set a high value, in order to avoid prompt overloading.

         * **use_concurrency** (*bool*) -- If True, enables concurrent
           processing of multiple items. Concurrency can considerably
           reduce processing time, but may cause rate limit errors
           with LLM providers. Use this option when API rate limits
           allow for multiple concurrent requests. Defaults to False.

         * **max_paragraphs_to_analyze_per_call** (*int*) -- Maximum
           paragraphs to analyze in a single LLM prompt. Defaults to 0
           (all paragraphs).

         * **max_images_to_analyze_per_call** (*int**, **optional*) --
           Maximum images to include in a single LLM prompt. Defaults
           to 0 (all images).

         * **raise_exception_on_extraction_error** (*bool**,
           **optional*) -- Whether to raise an exception if the
           extraction fails due to invalid data returned by an LLM or
           an error in the LLM API. If False, a warning will be issued
           instead, and no extracted items will be returned. Defaults
           to True.

      Returns:
         List of processed Concept objects with extracted items.

      Return type:
         list[_Concept]

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   get_cost()

      Retrieves the accumulated cost information of the LLM and its
      fallback LLM if configured.

      This method collects cost statistics for the current LLM
      instance and its fallback LLM (if configured), providing
      insights into API usage expenses.

      Returns:
         A list of cost statistics containers for the LLM and its
         fallback.

      Return type:
         list[_LLMCostOutputContainer]

   get_usage()

      Retrieves the usage information of the LLM and its fallback LLM
      if configured.

      This method collects token usage statistics for the current LLM
      instance and its fallback LLM (if configured), providing
      insights into API consumption.

      Returns:
         A list of usage statistics containers for the LLM and its
         fallback.

      Return type:
         list[_LLMUsageOutputContainer]

   property is_group: bool

      Returns False indicating this is a single LLM, not a group.

      Returns:
         Always False for DocumentLLM instances.

      Return type:
         bool

   property list_roles: list[Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision', 'extractor_multimodal', 'reasoner_multimodal']]

      Returns a list containing the role of this LLM.

      (For a single LLM, this returns a list with just one element -
      the LLM's role. For LLM groups, the method implementation
      returns roles of all LLMs in the group.)

      Returns:
         A list containing the role of this LLM.

      Return type:
         list[LLMRoleAny]

   reset_usage_and_cost()

      Resets the usage and cost statistics for the LLM and its
      fallback LLM (if configured).

      This method clears accumulated usage and cost data, which is
      useful when processing multiple documents sequentially and
      tracking metrics for each document separately.

      Returns:
         None

      Return type:
         None

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   model: NonEmptyStr

   deployment_id: NonEmptyStr | None

   api_key: NonEmptyStr | None

   api_base: NonEmptyStr | None

   api_version: NonEmptyStr | None

   role: LLMRoleAny

   system_message: str | None

   max_tokens: StrictInt

   max_completion_tokens: StrictInt

   reasoning_effort: ReasoningEffort | None

   num_retries_failed_request: StrictInt

   max_retries_failed_request: StrictInt

   max_retries_invalid_data: StrictInt

   timeout: StrictInt

   pricing_details: _LLMPricing | None

   is_fallback: StrictBool

   fallback_llm: _DocumentLLM | None

   output_language: LanguageRequirement

   temperature: StrictFloat | None

   top_p: StrictFloat | None

   seed: StrictInt | None

   tools: list[JSONDictField] | None

   tool_choice: str | JSONDictField | None

   parallel_tool_calls: bool | None

   tool_max_rounds: StrictInt

   auto_pricing: StrictBool

   auto_pricing_refresh: StrictBool

class contextgem.public.llms.ChatSession(**data)

   Bases: "_ChatSession"

   Stateful chat session that preserves message history across turns.

   To be used as "chat_session=..." parameter for
   "DocumentLLM.chat(...)" or "DocumentLLM.chat_async(...)".

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   Parameters:
      **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, **B
      eforeValidator**(**func=~contextgem.internal.typings.validators
      ._validate_is_json_dict**,
      **json_schema_input_type=PydanticUndefined**)**]*)

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   property messages: list[_Message]

      Returns the list of messages in the session.

      Returns:
         The list of messages in the session.

      Return type:
         list[_Message]

   reset()

      Clears conversation history by removing all messages.

      Returns:
         None

      Return type:
         None

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   custom_data: JSONDictField


# ==== api/data_models ====

Data models
***********

Module defining public data validation models.

class contextgem.public.data_models.LLMPricing(**data)

   Bases: "_LLMPricing"

   Represents the pricing details for an LLM.

   Defines the cost structure for processing input tokens and
   generating output tokens, with prices specified per million tokens.

   Variables:
      * **input_per_1m_tokens** (*StrictFloat*) -- The cost in
        currency units for processing 1M input tokens.

      * **output_per_1m_tokens** (*StrictFloat*) -- The cost in
        currency units for generating 1M output tokens.

   Parameters:
      * **input_per_1m_tokens** (*Annotated**[**float**,
        **Strict**(**strict=True**)**]*)

      * **output_per_1m_tokens** (*Annotated**[**float**,
        **Strict**(**strict=True**)**]*)

   Example:
      LLM pricing definition

         from contextgem import LLMPricing


         # Create a pricing model for an LLM (openai/o3-mini example)
         pricing = LLMPricing(
             input_per_1m_tokens=1.10,  # $1.10 per million input tokens
             output_per_1m_tokens=4.40,  # $4.40 per million output tokens
         )

         # LLMPricing objects are immutable
         try:
             pricing.input_per_1m_tokens = 0.7
         except ValueError as e:
             print(f"Error when trying to modify pricing: {e}")

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   input_per_1m_tokens: StrictFloat

   output_per_1m_tokens: StrictFloat

class contextgem.public.data_models.RatingScale(*, start=0, end=10)

   Bases: "_RatingScale"

   Represents a rating scale with defined minimum and maximum values.

   Deprecated since version 0.10.0: RatingScale is deprecated and will
   be removed in v1.0.0. Use a tuple of (start, end) integers instead,
   e.g. (1, 5) instead of RatingScale(start=1, end=5).

   This class defines a numerical scale for rating concepts, with
   configurable start and end values that determine the valid range
   for ratings.

   Variables:
      * **start** (*StrictInt*) -- The minimum value of the rating
        scale (inclusive). Must be greater than or equal to 0.

      * **end** (*StrictInt*) -- The maximum value of the rating scale
        (inclusive). Must be greater than 0.

   Parameters:
      * **start** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

      * **end** (*Annotated**[**int**,
        **Strict**(**strict=True**)**]*)

   Initialize RatingScale with deprecation warning.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   start: StrictInt

   end: StrictInt


# ==== api/utils ====

Utility functions and classes
*****************************

Module defining public utility functions and classes of the framework.

contextgem.public.utils.image_to_base64(source)

   Converts an image to its Base64 encoded string representation.

   Helper function that can be used when constructing "Image" objects.

   Parameters:
      **source** (*str** | **Path** | **BinaryIO** | **bytes*) -- The
      image source - can be a file path (str or Path), file-like
      object (BytesIO, file handle, etc.), or raw bytes data.

   Returns:
      A Base64 encoded string representation of the image.

   Return type:
      str

   Raises:
      * **FileNotFoundError** -- If the image file path does not
        exist.

      * **OSError** -- If the image cannot be read.

   Example:
      >>> from pathlib import Path
      >>> import io
      >>>
      >>> # From file path
      >>> base64_str = image_to_base64("path/to/image.jpg")
      >>>
      >>> # From file handle
      >>> with open("image.png", "rb") as f:
      ...     base64_str = image_to_base64(f)
      >>>
      >>> # From bytes data
      >>> with open("image.webp", "rb") as f:
      ...     image_bytes = f.read()
      >>> base64_str = image_to_base64(image_bytes)
      >>>
      >>> # From BytesIO
      >>> buffer = io.BytesIO(image_bytes)
      >>> base64_str = image_to_base64(buffer)

contextgem.public.utils.create_image(source)

   Creates an Image instance from various image sources.

   This function automatically determines the MIME type and converts
   the image to base64 format using Pillow functionality. It supports
   common image formats including JPEG, PNG, and WebP.

   Parameters:
      **source** (*str** | **Path** | **PILImage.Image** |
      **BinaryIO** | **bytes*) -- The image source - can be a file
      path (str or Path), PIL Image object, file-like object (BytesIO,
      file handle, etc.), or raw bytes data.

   Returns:
      An Image instance with the appropriate MIME type and base64
      data.

   Return type:
      Image

   Raises:
      * **ValueError** -- If the image format is not supported or
        cannot be determined.

      * **FileNotFoundError** -- If the image file path does not
        exist.

      * **OSError** -- If the image cannot be opened or processed.

   Example:
      >>> from pathlib import Path
      >>> from PIL import Image as PILImage
      >>> import io
      >>>
      >>> # From file path
      >>> img = create_image("path/to/image.jpg")
      >>>
      >>> # From PIL Image object
      >>> pil_img = PILImage.open("path/to/image.png")
      >>> img = create_image(pil_img)
      >>>
      >>> # From file-like object
      >>> with open("image.jpg", "rb") as f:
      ...     img = create_image(f)
      >>>
      >>> # From bytes data
      >>> with open("image.png", "rb") as f:
      ...     image_bytes = f.read()
      >>> img = create_image(image_bytes)
      >>>
      >>> # From BytesIO
      >>> buffer = io.BytesIO(image_bytes)
      >>> img = create_image(buffer)

contextgem.public.utils.reload_logger_settings()

   Reloads logger settings from environment variables.

   This function should be called when environment variables related
   to logging have been changed after the module was imported. It re-
   reads the environment variables and reconfigures the logger
   accordingly.

   Returns:
      None

   Example:
      Reload logger settings

         import os

         from contextgem import reload_logger_settings


         # Initial logger settings are loaded from environment variables at import time

         # Change logger level to WARNING
         os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "WARNING"
         print("Setting logger level to WARNING")
         reload_logger_settings()
         # Now the logger will only show WARNING level and above messages

         # Disable the logger completely
         os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "OFF"
         print("Disabling the logger")
         reload_logger_settings()
         # Now the logger is disabled and won't show any messages

         # You can re-enable the logger by setting it back to a valid level
         # os.environ["CONTEXTGEM_LOGGER_LEVEL"] = "INFO"
         # reload_logger_settings()

class contextgem.public.utils.JsonObjectClassStruct(*args, **kwargs)

   Bases: "_JsonObjectClassStruct"

   A base class that automatically converts class hierarchies to
   dictionary representations.

   This class enables the use of existing class hierarchies (such as
   dataclasses or Pydantic models) with nested type hints as a
   structure definition for JsonObjectConcept. When you need to use
   typed class hierarchies with JsonObjectConcept, inherit from this
   class in all parts of your class structure.

   Example:
      Using JsonObjectClassStruct for class hierarchies

         from dataclasses import dataclass

         from contextgem import JsonObjectClassStruct, JsonObjectConcept


         @dataclass
         class Address(JsonObjectClassStruct):
             street: str
             city: str
             country: str


         @dataclass
         class Contact(JsonObjectClassStruct):
             email: str
             phone: str
             address: Address


         @dataclass
         class Person(JsonObjectClassStruct):
             name: str
             age: int
             contact: Contact


         # Use the class structure with JsonObjectConcept
         # JsonObjectClassStruct enables automatic conversion of typed class hierarchies
         # into the dictionary structure required by JsonObjectConcept, preserving the
         # type information and nested relationships between classes.
         JsonObjectConcept(name="person", description="Person information", structure=Person)

   Replacement for "__new__" that blocks direct instantiation of the
   decorated class while allowing subclasses to instantiate normally.

   If invoked for the exact decorated class, an error is logged and
   "TypeError" is raised. For subclasses, the call is forwarded to the
   next "__new__" in the MRO, preserving base-class behavior (e.g.,
   Pydantic's "BaseModel.__new__").

   Parameters:
      * **inner_cls** -- The class being instantiated (decorated class
        or its subclass).

      * **args** -- Positional constructor arguments.

      * **kwargs** -- Keyword constructor arguments.

   Returns:
      A new instance when called for a subclass.

   Raises:
      **TypeError** -- When attempting to instantiate the decorated
      class directly.

   Return type:
      Any


# ==== api/images ====

Images
******

Module for handling document images.

This module provides the Image class, which represents visual content
that can be attached to or fully represent a document. Images are
stored in base64-encoded format with specified MIME types to ensure
proper handling.

class contextgem.public.images.Image(**data)

   Bases: "_Image"

   Represents an image with specified MIME type and base64-encoded
   data. An image is typically attached to a document, or fully
   represents a document.

   Util function "create_image()" from "contextgem.public.utils" can
   be used to create an Image instance from various sources: file
   paths, PIL Image objects, file-like objects, or raw bytes data.

   Variables:
      * **mime_type** (*Literal**[**"image/jpg"**, **"image/jpeg"**,
        **"image/png"**, **"image/webp"**]*) -- The MIME type of the
        image. This must be one of the predefined valid types
        ("image/jpg", "image/jpeg", "image/png", "image/webp").

      * **base64_data** (*str*) -- The base64-encoded data of the
        image. The util function "image_to_base64()" from
        "contextgem.public.utils" can be used to encode images to
        base64.

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **mime_type** (*Literal**[**'image/jpg'**, **'image/jpeg'**,
        **'image/png'**, **'image/webp'**]*)

      * **base64_data** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

   Note:
      * Attached to documents:
           An image must be attached to a document. A document can
           have multiple images.

      * Extraction types:
           Only document-level concept extraction is supported for
           images. Use LLM with role ""extractor_vision"",
           ""reasoner_vision"", ""extractor_multimodal"", or
           ""reasoner_multimodal"" to extract concepts from images.

   Example:
      Image definition

         from pathlib import Path

         from contextgem import Document, Image, create_image, image_to_base64


         # Path is adapted for doc tests
         current_file = Path(__file__).resolve()
         root_path = current_file.parents[4]

         # Using the create_image utility function (recommended approach)
         image_path = root_path / "tests" / "images" / "invoices" / "invoice.jpg"
         jpg_image = create_image(
             image_path
         )  # Automatically detects MIME type and converts to base64

         # Using pre-encoded base64 data directly
         png_image = Image(
             mime_type="image/png",
             base64_data="base64-string",  # image as a base64 string
         )

         # Using a different supported image format with create_image
         webp_image = create_image(root_path / "tests" / "images" / "invoices" / "invoice.webp")

         # Alternative: Manual approach using image_to_base64 (when you need specific control)
         manual_image = Image(mime_type="image/jpeg", base64_data=image_to_base64(image_path))

         # Attaching an image to a document
         # Documents can contain both text and multiple images, or just images

         # Create a document with text content
         text_document = Document(
             raw_text="This is a document with an attached image that shows an invoice.",
             images=[jpg_image],
         )

         # Create a document with only image content (no text)
         image_only_document = Document(images=[jpg_image])

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   mime_type: Literal['image/jpg', 'image/jpeg', 'image/png', 'image/webp']

   base64_data: NonEmptyStr

   custom_data: JSONDictField


# ==== api/paragraphs ====

Paragraphs
**********

Module for handling document paragraphs.

This module provides the Paragraph class, which represents a
structured segment of text within a document. Paragraphs serve as
containers for sentences and maintain the raw text content of the
segment they represent.

The module supports validation to ensure data integrity and provides
mechanisms to prevent inconsistencies during document analysis by
restricting certain attribute modifications after initial assignment.

class contextgem.public.paragraphs.Paragraph(**data)

   Bases: "_Paragraph"

   Represents a paragraph of a document with its raw text content and
   constituent sentences.

   Paragraphs are immutable text segments that can contain multiple
   sentences. Once sentences are assigned to a paragraph, they cannot
   be changed to maintain data integrity during analysis.

   Variables:
      * **raw_text** (*str*) -- The complete text content of the
        paragraph. This value is frozen after initialization.

      * **sentences** (*list**[**Sentence**]*) -- The individual
        sentences contained within the paragraph. Defaults to an empty
        list. Cannot be reassigned once populated.

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **additional_context** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]
        **| **None*)

      * **raw_text** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

      * **sentences** (*list**[**_Sentence**]*)

   Note:
      Normally, you do not need to construct paragraphs manually, as
      they are populated automatically from document's "raw_text"
      attribute. Only use this constructor for advanced use cases,
      such as when you have a custom paragraph segmentation tool.

   Example:
      Paragraph definition

         from contextgem import Paragraph


         # Create a paragraph with raw text content
         contract_paragraph = Paragraph(
             raw_text=(
                 "This agreement is effective as of January 1, 2025. "
                 "All parties must comply with the terms outlined herein. "
                 "Failure to adhere to these terms may result in termination of the agreement."
             )
         )

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   raw_text: NonEmptyStr

   sentences: list[_Sentence]

   additional_context: NonEmptyStr | None

   custom_data: JSONDictField


# ==== api/sentences ====

Sentences
*********

Module for handling document sentences.

This module provides the Sentence class, which represents a structured
unit of text within a document paragraph. Sentences are the
fundamental building blocks of text analysis, containing the raw text
content of individual statements.

The module supports validation to ensure data integrity and integrates
with the paragraph structure to maintain the hierarchical organization
of document content.

class contextgem.public.sentences.Sentence(**data)

   Bases: "_Sentence"

   Represents a sentence within a document paragraph.

   Sentences are immutable text units that serve as the fundamental
   building blocks for document analysis. The raw text content is
   preserved and cannot be modified after initialization to maintain
   data integrity.

   Variables:
      **raw_text** (*str*) -- The complete text content of the
      sentence. This value is frozen after initialization.

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **additional_context** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]
        **| **None*)

      * **raw_text** (*Annotated**[**str**,
        **Strict**(**strict=True**)**,
        **StringConstraints**(**strip_whitespace=True**,
        **to_upper=None**, **to_lower=None**, **strict=None**,
        **min_length=1**, **max_length=None**, **pattern=None**)**]*)

   Note:
      Normally, you do not need to construct sentences manually, as
      they are populated automatically from document's "raw_text" or
      "paragraphs" attributes. Only use this constructor for advanced
      use cases, such as when you have a custom paragraph/sentence
      segmentation tool.

   Example:
      Sentence definition

         from contextgem import Sentence


         # Create a sentence with raw text content
         sentence = Sentence(raw_text="This is a simple sentence.")

         # Sentences are immutable - their content cannot be changed after creation
         try:
             sentence.raw_text = "Attempting to modify the sentence."
         except ValueError as e:
             print(f"Error when trying to modify sentence: {e}")

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   raw_text: NonEmptyStr

   additional_context: NonEmptyStr | None

   custom_data: JSONDictField


# ==== api/pipelines ====

Pipelines
*********

Module for handling document processing pipelines.

This module provides the ExtractionPipeline class, which represents a
reusable collection of pre-defined aspects and concepts that can be
assigned to documents. Pipelines enable standardized document analysis
by packaging common extraction patterns into reusable units.

Pipelines serve as templates for document processing, allowing
consistent application of the same analysis approach across multiple
documents. They encapsulate both the structural organization (aspects)
and the specific information to extract (concepts) in a single,
assignable object.

class contextgem.public.pipelines.ExtractionPipeline(**data)

   Bases: "_ExtractionPipeline"

   Represents a reusable collection of predefined aspects and concepts
   for document analysis.

   Extraction pipelines serve as templates that can be assigned to
   multiple documents, ensuring consistent application of the same
   analysis criteria. They package common extraction patterns into
   reusable units, allowing for standardized document processing.

   Variables:
      * **aspects** (*list**[**_Aspect**]*) -- A list of aspects to
        extract from documents. Aspects represent structural
        categories of information. Defaults to an empty list.

      * **concepts** (*list**[**_Concept**]*) -- A list of concepts to
        identify within documents. Concepts represent specific
        information elements to extract. Defaults to an empty list.

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **aspects** (*Annotated**[**Sequence**[**_Aspect**]**, **Befo
        reValidator**(**func=~contextgem.internal.typings.validators.
        _validate_sequence_is_list**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **concepts** (*Annotated**[**Sequence**[**_Concept**]**, **Be
        foreValidator**(**func=~contextgem.internal.typings.validator
        s._validate_sequence_is_list**,
        **json_schema_input_type=PydanticUndefined**)**]*)

   Note:
      A pipeline is a reusable configuration of extraction steps. You
      can use the same pipeline to extract data from multiple
      documents.

   Example:
      Extraction pipeline definition

         from contextgem import (
             Aspect,
             BooleanConcept,
             DateConcept,
             Document,
             ExtractionPipeline,
             StringConcept,
         )


         # Create a pipeline for NDA (Non-Disclosure Agreement) review
         nda_pipeline = ExtractionPipeline(
             aspects=[
                 Aspect(
                     name="Confidential information",
                     description="Clauses defining the confidential information",
                 ),
                 Aspect(
                     name="Exclusions",
                     description="Clauses defining exclusions from confidential information",
                 ),
                 Aspect(
                     name="Obligations",
                     description="Clauses defining confidentiality obligations",
                 ),
                 Aspect(
                     name="Liability",
                     description="Clauses defining liability for breach of the agreement",
                 ),
                 # ... Add more aspects as needed
             ],
             concepts=[
                 StringConcept(
                     name="Anomaly",
                     description="Anomaly in the contract, e.g. out-of-context or nonsensical clauses",
                     llm_role="reasoner_text",
                     add_references=True,  # Add references to the source text
                     reference_depth="sentences",  # Reference to the sentence level
                     add_justifications=True,  # Add justifications for the anomaly
                     justification_depth="balanced",  # Justification at the sentence level
                     justification_max_sents=5,  # Maximum number of sentences in the justification
                 ),
                 BooleanConcept(
                     name="Is mutual",
                     description="Whether the NDA is mutual (bidirectional) or one-way",
                     singular_occurrence=True,
                     llm_role="reasoner_text",  # Use the reasoner role for this concept
                 ),
                 DateConcept(
                     name="Effective date",
                     description="The date when the NDA agreement becomes effective",
                     singular_occurrence=True,
                 ),
                 StringConcept(
                     name="Term",
                     description="The term of the NDA",
                 ),
                 StringConcept(
                     name="Governing law",
                     description="The governing law of the agreement",
                     singular_occurrence=True,
                 ),
                 # ... Add more concepts as needed
             ],
         )

         # Assign the pipeline to the NDA document
         nda_document = Document(raw_text="[NDA text]")
         nda_document.assign_pipeline(nda_pipeline)

         # Now the document is ready for processing with the NDA review pipeline!
         # The document can be processed to extract the defined aspects and concepts

         # Extract all aspects and concepts from the NDA using an LLM group
         # with LLMs with roles "extractor_text" and "reasoner_text".
         # llm_group.extract_all(nda_document)

   Create a new model by parsing and validating input data from
   keyword arguments.

   Raises [*ValidationError*][pydantic_core.ValidationError] if the
   input data cannot be validated to form a valid model.

   *self* is explicitly positional-only to allow *self* as a field
   name.

   add_aspects(aspects)

      Adds aspects to the existing aspects list of an instance and
      returns the updated instance. This method ensures that the
      provided aspects are deeply copied to avoid any unintended state
      modification of the original reusable aspects.

      Parameters:
         **aspects** (*list**[**_Aspect**]*) -- A list of aspects to
         be added. Each aspect is deeply copied to ensure the original
         list remains unaltered.

      Returns:
         Updated instance containing the newly added aspects.

      Return type:
         Self

   add_concepts(concepts)

      Adds a list of new concepts to the existing *concepts* attribute
      of the instance. This method ensures that the provided list of
      concepts is deep-copied to prevent unintended side effects from
      modifying the input list outside of this method.

      Parameters:
         **concepts** (*list**[**_Concept**]*) -- A list of concepts
         to be added. It will be deep-copied before being added to the
         instance's *concepts* attribute.

      Returns:
         Returns the instance itself after the modification.

      Return type:
         Self

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   get_aspect_by_name(name)

      Finds and returns an aspect with the specified name from the
      list of available aspects, if the instance has *aspects*
      attribute.

      Parameters:
         **name** (*str*) -- The name of the aspect to find.

      Returns:
         The aspect with the specified name.

      Return type:
         _Aspect

      Raises:
         **ValueError** -- If no aspect with the specified name is
         found.

   get_aspects_by_names(names)

      Retrieve a list of _Aspect objects corresponding to the provided
      list of names.

      Parameters:
         **names** ("list"["str"]) -- List of aspect names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Aspect objects corresponding to provided names.

      Return type:
         list[_Aspect]

   get_concept_by_name(name)

      Retrieves a concept from the list of concepts based on the
      provided name, if the instance has *concepts* attribute.

      Parameters:
         **name** (*str*) -- The name of the concept to search for.

      Returns:
         The *_Concept* object with the specified name.

      Return type:
         _Concept

      Raises:
         **ValueError** -- If no concept with the specified name is
         found.

   get_concepts_by_names(names)

      Retrieve a list of _Concept objects corresponding to the
      provided list of names.

      Parameters:
         **names** ("list"["str"]) -- List of concept names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Concept objects corresponding to provided names.

      Return type:
         list[_Concept]

   property llm_roles: set[str]

      A set of LLM roles associated with the object's aspects and
      concepts.

      Returns:
         A set containing unique LLM roles gathered from aspects and
         concepts.

      Return type:
         set[str]

   remove_all_aspects()

      Removes all aspects from the instance and returns the updated
      instance.

      This method clears the *aspects* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all aspects removed

   remove_all_concepts()

      Removes all concepts from the instance and returns the updated
      instance.

      This method clears the *concepts* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all concepts removed

   remove_all_instances()

      Removes all assigned instances from the object and resets them
      as empty lists. Returns the modified instance.

      Returns:
         The modified object with all assigned instances removed.

      Return type:
         Self

   remove_aspect_by_name(name)

      Removes an aspect from the assigned aspects by its name.

      Parameters:
         **name** (*str*) -- The name of the aspect to be removed

      Returns:
         Updated instance with the aspect removed.

      Return type:
         Self

   remove_aspects_by_names(names)

      Removes multiple aspects from an object based on the provided
      list of names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of names identifying
         the aspects to be removed.

      Returns:
         The updated object after the specified aspects have been
         removed.

      Return type:
         Self

   remove_concept_by_name(name)

      Removes a concept from the assigned concepts by its name.

      Parameters:
         **name** (*str*) -- The name of the concept to be removed

      Returns:
         Updated instance with the concept removed.

      Return type:
         Self

   remove_concepts_by_names(names)

      Removes concepts from the object by their names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of concept names to
         be removed.

      Returns:
         Returns the updated instance after removing the specified
         concepts.

      Return type:
         Self

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   aspects: Annotated[Sequence[_Aspect], BeforeValidator(_validate_sequence_is_list)]

   concepts: Annotated[Sequence[_Concept], BeforeValidator(_validate_sequence_is_list)]

   custom_data: JSONDictField

class contextgem.public.pipelines.DocumentPipeline(**data)

   Bases: "_DocumentPipeline"

   Deprecated wrapper for ExtractionPipeline.

   Deprecated since version 0.14.1: DocumentPipeline is deprecated and
   will be removed in v1.0.0. Use ExtractionPipeline instead.

   This class was renamed to ExtractionPipeline to better reflect its
   purpose and scope:

   * **Clearer semantics**: "ExtractionPipeline" explicitly describes
     what the pipeline does

   * **Consistency**: Aligns with the framework's naming conventions
     for extraction-focused components

   **Migration**: Simply replace "DocumentPipeline" with
   "ExtractionPipeline" in your imports. All functionality remains
   identical.

   Initialize DocumentPipeline with deprecation warning.

   Parameters:
      * **custom_data** (*Annotated**[**dict**[**str**, **Any**]**, *
        *BeforeValidator**(**func=~contextgem.internal.typings.valida
        tors._validate_is_json_dict**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **aspects** (*Annotated**[**Sequence**[**_Aspect**]**, **Befo
        reValidator**(**func=~contextgem.internal.typings.validators.
        _validate_sequence_is_list**,
        **json_schema_input_type=PydanticUndefined**)**]*)

      * **concepts** (*Annotated**[**Sequence**[**_Concept**]**, **Be
        foreValidator**(**func=~contextgem.internal.typings.validator
        s._validate_sequence_is_list**,
        **json_schema_input_type=PydanticUndefined**)**]*)

   add_aspects(aspects)

      Adds aspects to the existing aspects list of an instance and
      returns the updated instance. This method ensures that the
      provided aspects are deeply copied to avoid any unintended state
      modification of the original reusable aspects.

      Parameters:
         **aspects** (*list**[**_Aspect**]*) -- A list of aspects to
         be added. Each aspect is deeply copied to ensure the original
         list remains unaltered.

      Returns:
         Updated instance containing the newly added aspects.

      Return type:
         Self

   add_concepts(concepts)

      Adds a list of new concepts to the existing *concepts* attribute
      of the instance. This method ensures that the provided list of
      concepts is deep-copied to prevent unintended side effects from
      modifying the input list outside of this method.

      Parameters:
         **concepts** (*list**[**_Concept**]*) -- A list of concepts
         to be added. It will be deep-copied before being added to the
         instance's *concepts* attribute.

      Returns:
         Returns the instance itself after the modification.

      Return type:
         Self

   clone()

      Creates and returns a deep copy of the current instance.

      Return type:
         "typing.Self"

      Returns:
         A deep copy of the current instance.

   classmethod from_dict(obj_dict)

      Reconstructs an instance of the class from a dictionary
      representation.

      This method deserializes a dictionary containing the object's
      attributes and values into a new instance of the class. It
      handles complex nested structures like aspects, concepts, and
      extracted items, properly reconstructing each component.

      Parameters:
         **obj_dict** (*dict**[**str**, **Any**]*) -- Dictionary
         containing the serialized object data.

      Returns:
         A new instance of the class with restored attributes.

      Return type:
         Self

   classmethod from_disk(file_path)

      Loads an instance of the class from a JSON file stored on disk.

      This method reads the JSON content from the specified file path
      and deserializes it into an instance of the class using the
      *from_json* method.

      Parameters:
         **file_path** (*str** | **Path*) -- Path to the JSON file to
         load (must end with '.json'). Can be a string or a Path
         object.

      Returns:
         An instance of the class populated with the data from the
         file.

      Return type:
         Self

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If deserialization fails.

   classmethod from_json(json_string)

      Creates an instance of the class from a JSON string
      representation.

      This method deserializes the provided JSON string into a
      dictionary and uses the *from_dict* method to construct the
      class instance. It validates that the class name in the
      serialized data matches the current class.

      Parameters:
         **json_string** (*str*) -- JSON string containing the
         serialized object data.

      Returns:
         A new instance of the class with restored state.

      Return type:
         Self

      Raises:
         **TypeError** -- If the class name in the serialized data
         doesn't match.

   get_aspect_by_name(name)

      Finds and returns an aspect with the specified name from the
      list of available aspects, if the instance has *aspects*
      attribute.

      Parameters:
         **name** (*str*) -- The name of the aspect to find.

      Returns:
         The aspect with the specified name.

      Return type:
         _Aspect

      Raises:
         **ValueError** -- If no aspect with the specified name is
         found.

   get_aspects_by_names(names)

      Retrieve a list of _Aspect objects corresponding to the provided
      list of names.

      Parameters:
         **names** ("list"["str"]) -- List of aspect names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Aspect objects corresponding to provided names.

      Return type:
         list[_Aspect]

   get_concept_by_name(name)

      Retrieves a concept from the list of concepts based on the
      provided name, if the instance has *concepts* attribute.

      Parameters:
         **name** (*str*) -- The name of the concept to search for.

      Returns:
         The *_Concept* object with the specified name.

      Return type:
         _Concept

      Raises:
         **ValueError** -- If no concept with the specified name is
         found.

   get_concepts_by_names(names)

      Retrieve a list of _Concept objects corresponding to the
      provided list of names.

      Parameters:
         **names** ("list"["str"]) -- List of concept names to
         retrieve. The names must be provided as a list of strings.

      Returns:
         A list of _Concept objects corresponding to provided names.

      Return type:
         list[_Concept]

   property llm_roles: set[str]

      A set of LLM roles associated with the object's aspects and
      concepts.

      Returns:
         A set containing unique LLM roles gathered from aspects and
         concepts.

      Return type:
         set[str]

   remove_all_aspects()

      Removes all aspects from the instance and returns the updated
      instance.

      This method clears the *aspects* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all aspects removed

   remove_all_concepts()

      Removes all concepts from the instance and returns the updated
      instance.

      This method clears the *concepts* attribute of the instance by
      resetting it to an empty list. It returns the same instance,
      allowing for method chaining.

      Return type:
         "typing.Self"

      Returns:
         The updated instance with all concepts removed

   remove_all_instances()

      Removes all assigned instances from the object and resets them
      as empty lists. Returns the modified instance.

      Returns:
         The modified object with all assigned instances removed.

      Return type:
         Self

   remove_aspect_by_name(name)

      Removes an aspect from the assigned aspects by its name.

      Parameters:
         **name** (*str*) -- The name of the aspect to be removed

      Returns:
         Updated instance with the aspect removed.

      Return type:
         Self

   remove_aspects_by_names(names)

      Removes multiple aspects from an object based on the provided
      list of names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of names identifying
         the aspects to be removed.

      Returns:
         The updated object after the specified aspects have been
         removed.

      Return type:
         Self

   remove_concept_by_name(name)

      Removes a concept from the assigned concepts by its name.

      Parameters:
         **name** (*str*) -- The name of the concept to be removed

      Returns:
         Updated instance with the concept removed.

      Return type:
         Self

   remove_concepts_by_names(names)

      Removes concepts from the object by their names.

      Parameters:
         **names** (*list**[**str**]*) -- A list of concept names to
         be removed.

      Returns:
         Returns the updated instance after removing the specified
         concepts.

      Return type:
         Self

   to_dict()

      Transforms the current object into a dictionary representation.

      Converts the object to a dictionary that includes: - All public
      attributes - Special handling for specific public and private
      attributes

      When an LLM or LLM group is serialized, its API credentials and
      usage/cost stats are removed.

      Returns:
         A dictionary representation of the current object with all
         necessary data for serialization

      Return type:
         dict[str, Any]

   to_disk(file_path)

      Saves the serialized instance to a JSON file at the specified
      path.

      This method converts the instance to a dictionary representation
      using *to_dict()*, then writes it to disk as a formatted JSON
      file with UTF-8 encoding.

      Parameters:
         **file_path** (*str** | **Path*) -- Path where the JSON file
         should be saved (must end with '.json'). Can be a string or a
         Path object.

      Return type:
         "None"

      Returns:
         None

      Raises:
         * **ValueError** -- If the file path doesn't end with
           '.json'.

         * **RuntimeError** -- If there's an error during the file
           writing process.

   to_json()

      Converts the object to its JSON string representation.

      Serializes the object into a JSON-formatted string using the
      dictionary representation provided by the *to_dict()* method.

      Returns:
         A JSON string representation of the object.

      Return type:
         str

   property unique_id: str

      Returns the ULID of the instance.

   aspects: Annotated[Sequence[_Aspect], BeforeValidator(_validate_sequence_is_list)]

   concepts: Annotated[Sequence[_Concept], BeforeValidator(_validate_sequence_is_list)]

   custom_data: JSONDictField


# ==== api/decorators ====

Decorators
**********

Public decorators for extending or integrating with the framework.

This module contains decorators that are part of the public API and
intended for end users to apply to their own functions or classes.

contextgem.public.decorators.register_tool(func, /)

   Registers a function as a tool handler for LLM chat with tools.

   Validates that the function has an inspectable signature and
   accepts keyword arguments (no positional-only parameters). Marks
   the function so the runtime can recognize and call it by name.

   Parameters:
      **func** (*ToolHandler*) -- A callable to be used as a tool
      handler.

   Returns:
      The same function, marked as a registered tool.

   Return type:
      ToolHandler

   Raises:
      * **TypeError** -- If the provided object is not callable.

      * **ValueError** -- If the signature cannot be inspected or has
        positional-only parameters, or if the function name is empty.