Dealing with Long Documents#
ContextGem offers specialized configuration options for efficiently processing lengthy documents.
✂️ Segmentation Approach#
Unlike many systems that rely on chunking (e.g. RAG), ContextGem intelligently segments documents into natural semantic units like paragraphs and sentences. This preserves the contextual integrity of the content while allowing you to configure:
Maximum number of paragraphs per LLM call
Maximum number of aspects/concepts to analyze per LLM call
Maximum number of images per LLM call (if the document contains images)
⚙️ Effective Optimization Strategies#
🔄 Use Long-Context Models: Select models with large context windows. (See Choosing the Right LLM(s) for guidance on choosing the right model.)
📏 Limit Paragraphs Per Call: This will reduce each prompt’s length and ensure a more focused analysis.
🔢 Limit Aspects/Concepts Per Call: Process a smaller number of aspects or concepts in each LLM call, preventing prompt overloading.
⚡ Optional: Enable Concurrency: Enable running extractions concurrently if your API setup permits. This will reduce the overall processing time. (See Optimizing for Speed for guidance on configuring concurrency.)
Since each use case has unique requirements, experiment with different configurations to find your optimal setup.
# Example of configuring LLM extraction to process long documents
import os
from contextgem import Document, DocumentLLM
# Define document
long_doc = Document(
raw_text="long_document_text",
)
# ... attach aspects/concepts to the document ...
# Define and configure LLM
llm = DocumentLLM(
model="openai/gpt-4o-mini",
api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
)
# Extract data from document with specific configuration options
long_doc = llm.extract_all(
long_doc,
max_paragraphs_to_analyze_per_call=50, # limit the number of paragraphs to analyze in an individual LLM call
max_items_per_call=2, # limit the number of aspects/concepts to analyze in an individual LLM call
use_concurrency=True, # optional: enable concurrent extractions
)
# ... use the extracted data ...