Optimizing for Accuracy

Optimizing for Accuracy#

When accuracy is paramount, ContextGem offers several techniques to improve extraction quality, some of which are pretty obvious:

  • ๐Ÿš€ Use a Capable LLM: Choose a powerful LLM model for extraction.

  • ๐Ÿช„ Use Larger Segmentation Models: Select a larger SaT model for intelligent segmentation of paragraphs or sentences, to ensure the highest segmentation accuracy in complex documents (e.g. contracts).

  • ๐Ÿ’ก Provide Examples: For most complex concepts, add examples to guide the LLMโ€™s extraction format and style.

  • ๐Ÿง  Request Justifications: For most complex aspects/concepts, enable justifications to understand the LLMโ€™s reasoning and instruct the LLM to โ€œthinkโ€ when giving an answer.

  • ๐Ÿ“ Limit Paragraphs Per Call: This will reduce each promptโ€™s length and ensure a more focused analysis.

  • ๐Ÿ”ข Limit Aspects/Concepts Per Call: Process a smaller number of aspects or concepts in each LLM call, preventing prompt overloading.

  • ๐Ÿ”„ Use a Fallback LLM: Configure a fallback LLM to retry failed extractions with a different model.

Example of optimizing extraction for accuracy#
# Example of optimizing extraction for accuracy

import os

from contextgem import Document, DocumentLLM, StringConcept, StringExample

# Define document
doc = Document(
    raw_text="Non-Disclosure Agreement...",
    sat_model_id="sat-6l-sm",  # default is "sat-3l-sm"
    paragraph_segmentation_mode="sat",  # default is "newlines"
    # sentence segmentation mode is always "sat", as other approaches proved to be less accurate
)

# Define document concepts
doc.concepts = [
    StringConcept(
        name="Title",  # A very simple concept, just an example for testing purposes
        description="Title of the document",
        add_justifications=True,  # enable justifications
        justification_depth="brief",  # default
        examples=[
            StringExample(
                content="Supplier Agreement",
            )
        ],
    ),
    # ... add other concepts ...
]

# ... attach other aspects/concepts to the document ...

# Define and configure LLM
llm = DocumentLLM(
    model="openai/gpt-4o",
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
    fallback_llm=DocumentLLM(
        model="openai/gpt-4-turbo",
        api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),
        is_fallback=True,
    ),  # configure a fallback LLM
)

# Extract data from document with specific configuration options
doc = llm.extract_all(
    doc,
    max_paragraphs_to_analyze_per_call=30,  # limit the number of paragraphs to analyze in an individual LLM call
    max_items_per_call=1,  # limit the number of aspects/concepts to analyze in an individual LLM call
    use_concurrency=True,  # optional: enable concurrent extractions
)

# ... use the extracted data ...