Dealing with Long Documents

Dealing with Long Documents#

ContextGem offers specialized configuration options for efficiently processing lengthy documents.

✂️ Segmentation Approach#

Unlike many systems that rely on chunking (e.g. RAG), ContextGem intelligently segments documents into natural semantic units like paragraphs and sentences. This preserves the contextual integrity of the content while allowing you to configure:

Maximum number of paragraphs per LLM call
Maximum number of aspects/concepts to analyze per LLM call
Maximum number of images per LLM call (if the document contains images)

⚙️ Effective Optimization Strategies#

🔄 Use Long-Context Models: Select models with large context windows. (See Choosing the Right LLM(s) for guidance on choosing the right model.)
📏 Limit Paragraphs Per Call: This will reduce each prompt’s length and ensure a more focused analysis.
🔢 Limit Aspects/Concepts Per Call: Process a smaller number of aspects or concepts in each LLM call, preventing prompt overloading.
⚠️ Use Sentence-Level Reference Depth Sparingly: Only use sentence-level reference depth for aspects or concepts when absolutely necessary, as it requires loading a SaT model and running sentence segmentation on text, which can be slow for long documents.
⚡ Optional: Enable Concurrency: Enable running extractions concurrently if your API setup permits. This will reduce the overall processing time. (See Optimizing for Speed for guidance on configuring concurrency.)

Since each use case has unique requirements, experiment with different configurations to find your optimal setup.

Example of configuring LLM extraction for long documents#

# Example of configuring LLM extraction to process long documents

import os

from contextgem import Document, DocumentLLM


# Define document
long_doc = Document(
    raw_text="long_document_text",
)

# ... attach aspects/concepts to the document ...

# Define and configure LLM
llm = DocumentLLM(
    model="openai/gpt-4o-mini",
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
)

# Extract data from document with specific configuration options
long_doc = llm.extract_all(
    long_doc,
    max_paragraphs_to_analyze_per_call=50,  # limit the number of paragraphs to analyze in an individual LLM call
    max_items_per_call=2,  # limit the number of aspects/concepts to analyze in an individual LLM call
    use_concurrency=True,  # optional: enable concurrent extractions
)

# ... use the extracted data ...

Dealing with Long Documents

Contents

Dealing with Long Documents#

✂️ Segmentation Approach#

⚙️ Effective Optimization Strategies#