Quickstart examples

Quickstart examples#

This guide will help you get started with ContextGem by walking through basic extraction examples.

Below are complete, self-contained examples showing how to extract data from a document using ContextGem.

🔄 Extraction Process#

ContextGem follows a simple extraction process:

Create a Document instance with your content
Define Aspect instances for sections of interest
Define concept instances (StringConcept, BooleanConcept, NumericalConcept, DateConcept, JsonObjectConcept, RatingConcept) for specific data points to extract, and attach them to Aspect (for aspect context) or Document (for document context).
Use DocumentLLM or DocumentLLMGroup to perform the extraction
Access the extracted data in the document object

📋 Aspect Extraction from Document#

Tip

Aspect extraction is useful for identifying and extracting specific sections or topics from documents. Common use cases include:

Extracting specific clauses from legal contracts
Identifying specific sections from financial reports
Isolating relevant topics from research papers
Extracting product features from technical documentation

# Quick Start Example - Extracting aspect from a document

import os

from contextgem import Aspect, Document, DocumentLLM


# Example document instance
# Document content is shortened for brevity
doc = Document(
    raw_text=(
        "Consultancy Agreement\n"
        "This agreement between Company A (Supplier) and Company B (Customer)...\n"
        "The term of the agreement is 1 year from the Effective Date...\n"
        "The Supplier shall provide consultancy services as described in Annex 2...\n"
        "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
        "This agreement is governed by the laws of Norway...\n"
    ),
)

# Define an aspect with optional concept(s), using natural language
doc_aspect = Aspect(
    name="Governing law",
    description="Clauses defining the governing law of the agreement",
    reference_depth="sentences",
)

# Add aspects to the document
doc.add_aspects([doc_aspect])
# (add more aspects to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
)

# Extract information from the document
extracted_aspects = llm.extract_aspects_from_document(doc)
# or use async version llm.extract_aspects_from_document_async(doc)

# Access extracted information
print("Governing law aspect:")
print(
    extracted_aspects[0].extracted_items
)  # extracted aspect items with references to sentences
# or doc.get_aspect_by_name("Governing law").extracted_items

🌳 Extracting Aspect with Sub-Aspects#

Tip

Sub-aspect extraction helps organize complex topics into logical components. Common use cases include:

Breaking down termination clauses in employment contracts into company rights, employee rights, and severance terms
Dividing financial report sections into revenue streams, expenses, and forecasts
Organizing product specifications into technical details, compatibility, and maintenance requirements

# Quick Start Example - Extracting an aspect with sub-aspects

import os

from contextgem import Aspect, Document, DocumentLLM


# Sample document (content shortened for brevity)
contract_text = """
EMPLOYMENT AGREEMENT
...
8. TERMINATION
8.1 Termination by the Company. The Company may terminate the Employee's employment for Cause at any time upon written notice. 
"Cause" shall mean: (i) Employee's material breach of this Agreement; (ii) Employee's conviction of a felony; or 
(iii) Employee's willful misconduct that causes material harm to the Company.
8.2 Termination by the Employee. The Employee may terminate employment for Good Reason upon 30 days' written notice to the Company. 
"Good Reason" shall mean a material reduction in Employee's base salary or a material diminution in Employee's duties.
8.3 Severance. If the Employee's employment is terminated by the Company without Cause or by the Employee for Good Reason, 
the Employee shall be entitled to receive severance pay equal to six (6) months of the Employee's base salary.
...
"""

doc = Document(raw_text=contract_text)

# Define termination aspect with practical sub-aspects
termination_aspect = Aspect(
    name="Termination",
    description="Provisions related to the termination of employment",
    aspects=[  # assign sub-aspects (optional)
        Aspect(
            name="Company Termination Rights",
            description="Conditions under which the company can terminate employment",
        ),
        Aspect(
            name="Employee Termination Rights",
            description="Conditions under which the employee can terminate employment",
        ),
        Aspect(
            name="Severance Terms",
            description="Compensation or benefits provided upon termination",
        ),
    ],
)

# Add the aspect to the document. Sub-aspects are added with the parent aspect.
doc.add_aspects([termination_aspect])
# (add more aspects to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
    api_key=os.environ.get(
        "CONTEXTGEM_OPENAI_API_KEY"
    ),  # your API key of the LLM provider
)

# Extract all information from the document
doc = llm.extract_all(doc)

# Get results with references in the document object
print("\nTermination aspect:\n")
termination_aspect = doc.get_aspect_by_name("Termination")
for sub_aspect in termination_aspect.aspects:
    print(sub_aspect.name)
    for item in sub_aspect.extracted_items:
        print(item.value)
    print("\n")

🔍 Concept Extraction from Aspect#

Tip

Concept extraction from aspects helps identify specific data points within already extracted sections or topics. Common use cases include:

Extracting payment amounts from a contract’s payment terms
Extracting liability cap from a contract’s liability section
Isolating timelines from delivery terms
Extracting a list of features from a product description
Identifying programming languages from a CV’s experience section

# Quick Start Example - Extracting a concept from an aspect

import os

from contextgem import Aspect, Document, DocumentLLM, StringConcept, StringExample


# Example document instance
# Document content is shortened for brevity
doc = Document(
    raw_text=(
        "Employment Agreement\n"
        "This agreement between TechCorp Inc. (Employer) and Jane Smith (Employee)...\n"
        "The employment shall commence on January 15, 2023 and continue until terminated...\n"
        "The Employee shall work as a Senior Software Engineer reporting to the CTO...\n"
        "The Employee shall receive an annual salary of $120,000 paid monthly...\n"
        "The Employee is entitled to 20 days of paid vacation per year...\n"
        "The Employee agrees to a notice period of 30 days for resignation...\n"
        "This agreement is governed by the laws of California...\n"
    ),
)

# Define an aspect with a specific concept, using natural language
doc_aspect = Aspect(
    name="Compensation",
    description="Clauses defining the compensation and benefits for the employee",
    reference_depth="sentences",
)

# Define a concept within the aspect
aspect_concept = StringConcept(
    name="Annual Salary",
    description="The annual base salary amount specified in the employment agreement",
    examples=[  # optional
        StringExample(
            content="$X per year",  # guidance regarding format
        )
    ],
    add_references=True,
    reference_depth="sentences",
)

# Add the concept to the aspect
doc_aspect.add_concepts([aspect_concept])
# (add more concepts to the aspect, if needed)

# Add the aspect to the document
doc.add_aspects([doc_aspect])
# (add more aspects to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
)

# Extract information from the document
doc = llm.extract_all(doc)
# or use async version llm.extract_all_async(doc)

# Access extracted information in the document object
print("Compensation aspect:")
print(
    doc.get_aspect_by_name("Compensation").extracted_items
)  # extracted aspect items with references to sentences
print("Annual Salary concept:")
print(
    doc.get_aspect_by_name("Compensation")
    .get_concept_by_name("Annual Salary")
    .extracted_items
)  # extracted concept items with references to sentences

📝 Concept Extraction from Document (text)#

Tip

Concept extraction from text documents locates specific information directly from text. Common use cases include:

Extracting anomalies from entire legal documents
Identifying financial figures across multiple report sections
Extracting citations and references from academic papers
Identifying product specifications from technical manuals
Extracting contact information from business documents

# Quick Start Example - Extracting a concept from a document

import os

from contextgem import Document, DocumentLLM, JsonObjectConcept


# Example document instance
# Document content is shortened for brevity
doc = Document(
    raw_text=(
        "Statement of Work\n"
        "Project: Cloud Migration Initiative\n"
        "Client: Acme Corporation\n"
        "Contractor: TechSolutions Inc.\n\n"
        "Project Timeline:\n"
        "Start Date: March 1, 2025\n"
        "End Date: August 31, 2025\n\n"
        "Deliverables:\n"
        "1. Infrastructure assessment report (Due: March 15, 2025)\n"
        "2. Migration strategy document (Due: April 10, 2025)\n"
        "3. Test environment setup (Due: May 20, 2025)\n"
        "4. Production migration (Due: July 15, 2025)\n"
        "5. Post-migration support (Due: August 31, 2025)\n\n"
        "Budget: $250,000\n"
        "Payment Schedule: 20% upfront, 30% at midpoint, 50% upon completion\n"
    ),
)

# Define a document-level concept using e.g. JsonObjectConcept
# This will extract structured data from the entire document
doc_concept = JsonObjectConcept(
    name="Project Details",
    description="Key project information including timeline, deliverables, and budget",
    structure={
        "project_name": str,
        "client": str,
        "contractor": str,
        "budget": str,
        "payment_terms": str,
    },  # simply use a dictionary with type hints (including generic aliases and union types)
    add_references=True,
    reference_depth="paragraphs",
)

# Add the concept to the document
doc.add_concepts([doc_concept])
# (add more concepts to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
)

# Extract information from the document
extracted_concepts = llm.extract_concepts_from_document(doc)
# or use async version llm.extract_concepts_from_document_async(doc)

# Access extracted information
print("Project Details:")
print(
    extracted_concepts[0].extracted_items
)  # extracted concept items with references to paragraphs
# Or doc.get_concept_by_name("Project Details").extracted_items

🖼️ Concept Extraction from Document (vision)#

Tip

Concept extraction using vision capabilities processes documents with complex layouts or images. Common use cases include:

Extracting data from scanned contracts or receipts
Identifying information from charts and graphs in reports
Identifying visual product features from marketing materials

# Quick Start Example - Extracting concept from a document with an image

import os
from pathlib import Path

from contextgem import Document, DocumentLLM, Image, NumericalConcept, image_to_base64


# Path adapted for testing
current_file = Path(__file__).resolve()
root_path = current_file.parents[4]
image_path = root_path / "tests" / "images" / "invoices" / "invoice.jpg"

# Create an image instance
doc_image = Image(mime_type="image/jpg", base64_data=image_to_base64(image_path))

# Example document instance holding only the image
doc = Document(
    images=[doc_image],  # may contain multiple images
)

# Define a concept to extract the invoice total amount
doc_concept = NumericalConcept(
    name="Invoice Total",
    description="The total amount to be paid as shown on the invoice",
    numeric_type="float",
    llm_role="extractor_vision",  # use vision model
)

# Add concept to the document
doc.add_concepts([doc_concept])
# (add more concepts to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # Using a model with vision capabilities
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
    role="extractor_vision",  # mark LLM as vision model
)

# Extract information from the document
extracted_concepts = llm.extract_concepts_from_document(doc)
# or use async version: await llm.extract_concepts_from_document_async(doc)

# Access extracted information
print("Invoice Total:")
print(extracted_concepts[0].extracted_items)  # extracted concept items
# or doc.get_concept_by_name("Invoice Total").extracted_items

💬 Lightweight LLM Chat Interface#

Note

While ContextGem is primarily designed for advanced structured data extraction, it also provides a lightweight, unified interface for interacting with LLMs via natural language - across both text and vision - with built-in fallback support.

# Using LLMs for chat (text + vision), with fallback LLM support

import os

from contextgem import DocumentLLM


# from contextgem import Image

main_model = DocumentLLM(
    model="openai/gpt-4o",  # or another provider/model
    api_key=os.getenv("CONTEXTGEM_OPENAI_API_KEY"),  # your API key for the LLM provider
)

# Optional: fallback LLM
fallback_model = DocumentLLM(
    model="openai/gpt-4o-mini",  # or another provider/model
    api_key=os.getenv("CONTEXTGEM_OPENAI_API_KEY"),  # your API key for the LLM provider
    is_fallback=True,
)
main_model.fallback_llm = fallback_model

response = main_model.chat(
    "Hello",
    # images=[Image(...)]
)
# or `response = await main_model.chat_async(...)`

print(response)