Optimizing for Accuracy#
When accuracy is paramount, ContextGem offers several techniques to improve extraction quality, some of which are pretty obvious:
๐ Use a Capable LLM: Choose a powerful LLM model for extraction.
๐ช Use Larger Segmentation Models: Select a larger SaT model for intelligent segmentation of paragraphs or sentences, to ensure the highest segmentation accuracy in complex documents (e.g. contracts).
๐ก Provide Examples: For most complex concepts, add examples to guide the LLMโs extraction format and style.
๐ง Request Justifications: For most complex aspects/concepts, enable justifications to understand the LLMโs reasoning and instruct the LLM to โthinkโ when giving an answer.
๐ Limit Paragraphs Per Call: This will reduce each promptโs length and ensure a more focused analysis.
๐ข Limit Aspects/Concepts Per Call: Process a smaller number of aspects or concepts in each LLM call, preventing prompt overloading.
๐ Use a Fallback LLM: Configure a fallback LLM to retry failed extractions with a different model.
# Example of optimizing extraction for accuracy
import os
from contextgem import Document, DocumentLLM, StringConcept, StringExample
# Define document
doc = Document(
raw_text="Non-Disclosure Agreement...",
sat_model_id="sat-6l-sm", # default is "sat-3l-sm"
paragraph_segmentation_mode="sat", # default is "newlines"
# sentence segmentation mode is always "sat", as other approaches proved to be less accurate
)
# Define document concepts
doc.concepts = [
StringConcept(
name="Title", # A very simple concept, just an example for testing purposes
description="Title of the document",
add_justifications=True, # enable justifications
justification_depth="brief", # default
examples=[
StringExample(
content="Supplier Agreement",
)
],
),
# ... add other concepts ...
]
# ... attach other aspects/concepts to the document ...
# Define and configure LLM
llm = DocumentLLM(
model="openai/gpt-4o",
api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
fallback_llm=DocumentLLM(
model="openai/gpt-4-turbo",
api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
is_fallback=True,
), # configure a fallback LLM
)
# Extract data from document with specific configuration options
doc = llm.extract_all(
doc,
max_paragraphs_to_analyze_per_call=30, # limit the number of paragraphs to analyze in an individual LLM call
max_items_per_call=1, # limit the number of aspects/concepts to analyze in an individual LLM call
use_concurrency=True, # optional: enable concurrent extractions
)
# ... use the extracted data ...