Quickstart examples#

This guide will help you get started with ContextGem by walking through basic extraction examples.

Below are complete, self-contained examples showing how to extract data from a document using ContextGem.

πŸ”„ Extraction Process#

ContextGem follows a simple extraction process:

  1. Create a Document instance with your content

  2. Define Aspect instances for sections of interest

  3. Define concept instances (StringConcept, BooleanConcept, NumericalConcept, DateConcept, JsonObjectConcept, RatingConcept) for specific data points to extract, and attach them to Aspect (for aspect context) or Document (for document context).

  4. Use DocumentLLM or DocumentLLMGroup to perform the extraction

  5. Access the extracted data in the document object

πŸ“‹ Aspect Extraction from Document#

Tip

Aspect extraction is useful for identifying and extracting specific sections or topics from documents. Common use cases include:

  • Extracting specific clauses from legal contracts

  • Identifying specific sections from financial reports

  • Isolating relevant topics from research papers

  • Extracting product features from technical documentation

# Quick Start Example - Extracting aspect from a document

import os

from contextgem import Aspect, Document, DocumentLLM

# Example document instance
# Document content is shortened for brevity
doc = Document(
    raw_text=(
        "Consultancy Agreement\n"
        "This agreement between Company A (Supplier) and Company B (Customer)...\n"
        "The term of the agreement is 1 year from the Effective Date...\n"
        "The Supplier shall provide consultancy services as described in Annex 2...\n"
        "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
        "This agreement is governed by the laws of Norway...\n"
    ),
)

# Define an aspect with optional concept(s), using natural language
doc_aspect = Aspect(
    name="Governing law",
    description="Clauses defining the governing law of the agreement",
    reference_depth="sentences",
)

# Add aspects to the document
doc.add_aspects([doc_aspect])
# (add more aspects to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
)

# Extract information from the document
extracted_aspects = llm.extract_aspects_from_document(doc)
# or use async version llm.extract_aspects_from_document_async(doc)

# Access extracted information
print("Governing law aspect:")
print(
    extracted_aspects[0].extracted_items
)  # extracted aspect items with references to sentences
# or doc.get_aspect_by_name("Governing law").extracted_items

🌳 Extracting Aspect with Sub-Aspects#

Tip

Sub-aspect extraction helps organize complex topics into logical components. Common use cases include:

  • Breaking down termination clauses in employment contracts into company rights, employee rights, and severance terms

  • Dividing financial report sections into revenue streams, expenses, and forecasts

  • Organizing product specifications into technical details, compatibility, and maintenance requirements

# Quick Start Example - Extracting an aspect with sub-aspects

import os

from contextgem import Aspect, Document, DocumentLLM

# Sample document (content shortened for brevity)
contract_text = """
EMPLOYMENT AGREEMENT
...
8. TERMINATION
8.1 Termination by the Company. The Company may terminate the Employee's employment for Cause at any time upon written notice. 
"Cause" shall mean: (i) Employee's material breach of this Agreement; (ii) Employee's conviction of a felony; or 
(iii) Employee's willful misconduct that causes material harm to the Company.
8.2 Termination by the Employee. The Employee may terminate employment for Good Reason upon 30 days' written notice to the Company. 
"Good Reason" shall mean a material reduction in Employee's base salary or a material diminution in Employee's duties.
8.3 Severance. If the Employee's employment is terminated by the Company without Cause or by the Employee for Good Reason, 
the Employee shall be entitled to receive severance pay equal to six (6) months of the Employee's base salary.
...
"""

doc = Document(raw_text=contract_text)

# Define termination aspect with practical sub-aspects
termination_aspect = Aspect(
    name="Termination",
    description="Provisions related to the termination of employment",
    aspects=[  # assign sub-aspects (optional)
        Aspect(
            name="Company Termination Rights",
            description="Conditions under which the company can terminate employment",
        ),
        Aspect(
            name="Employee Termination Rights",
            description="Conditions under which the employee can terminate employment",
        ),
        Aspect(
            name="Severance Terms",
            description="Compensation or benefits provided upon termination",
        ),
    ],
)

# Add the aspect to the document. Sub-aspects are added with the parent aspect.
doc.add_aspects([termination_aspect])
# (add more aspects to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
    api_key=os.environ.get(
        "CONTEXTGEM_OPENAI_API_KEY"
    ),  # your API key of the LLM provider
)

# Extract all information from the document
doc = llm.extract_all(doc)

# Get results with references in the document object
print("\nTermination aspect:\n")
termination_aspect = doc.get_aspect_by_name("Termination")
for sub_aspect in termination_aspect.aspects:
    print(sub_aspect.name)
    for item in sub_aspect.extracted_items:
        print(item.value)
    print("\n")

πŸ” Concept Extraction from Aspect#

Tip

Concept extraction from aspects helps identify specific data points within already extracted sections or topics. Common use cases include:

  • Extracting payment amounts from a contract’s payment terms

  • Extracting liability cap from a contract’s liability section

  • Isolating timelines from delivery terms

  • Extracting a list of features from a product description

  • Identifying programming languages from a CV’s experience section

# Quick Start Example - Extracting a concept from an aspect

import os

from contextgem import Aspect, Document, DocumentLLM, StringConcept, StringExample

# Example document instance
# Document content is shortened for brevity
doc = Document(
    raw_text=(
        "Employment Agreement\n"
        "This agreement between TechCorp Inc. (Employer) and Jane Smith (Employee)...\n"
        "The employment shall commence on January 15, 2023 and continue until terminated...\n"
        "The Employee shall work as a Senior Software Engineer reporting to the CTO...\n"
        "The Employee shall receive an annual salary of $120,000 paid monthly...\n"
        "The Employee is entitled to 20 days of paid vacation per year...\n"
        "The Employee agrees to a notice period of 30 days for resignation...\n"
        "This agreement is governed by the laws of California...\n"
    ),
)

# Define an aspect with a specific concept, using natural language
doc_aspect = Aspect(
    name="Compensation",
    description="Clauses defining the compensation and benefits for the employee",
    reference_depth="sentences",
)

# Define a concept within the aspect
aspect_concept = StringConcept(
    name="Annual Salary",
    description="The annual base salary amount specified in the employment agreement",
    examples=[  # optional
        StringExample(
            content="$X per year",  # guidance regarding format
        )
    ],
    add_references=True,
    reference_depth="sentences",
)

# Add the concept to the aspect
doc_aspect.add_concepts([aspect_concept])
# (add more concepts to the aspect, if needed)

# Add the aspect to the document
doc.add_aspects([doc_aspect])
# (add more aspects to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
)

# Extract information from the document
doc = llm.extract_all(doc)
# or use async version llm.extract_all_async(doc)

# Access extracted information in the document object
print("Compensation aspect:")
print(
    doc.get_aspect_by_name("Compensation").extracted_items
)  # extracted aspect items with references to sentences
print("Annual Salary concept:")
print(
    doc.get_aspect_by_name("Compensation")
    .get_concept_by_name("Annual Salary")
    .extracted_items
)  # extracted concept items with references to sentences

πŸ“ Concept Extraction from Document (text)#

Tip

Concept extraction from text documents locates specific information directly from text. Common use cases include:

  • Extracting anomalies from entire legal documents

  • Identifying financial figures across multiple report sections

  • Extracting citations and references from academic papers

  • Identifying product specifications from technical manuals

  • Extracting contact information from business documents

# Quick Start Example - Extracting a concept from a document

import os

from contextgem import Document, DocumentLLM, JsonObjectConcept, JsonObjectExample

# Example document instance
# Document content is shortened for brevity
doc = Document(
    raw_text=(
        "Statement of Work\n"
        "Project: Cloud Migration Initiative\n"
        "Client: Acme Corporation\n"
        "Contractor: TechSolutions Inc.\n\n"
        "Project Timeline:\n"
        "Start Date: March 1, 2025\n"
        "End Date: August 31, 2025\n\n"
        "Deliverables:\n"
        "1. Infrastructure assessment report (Due: March 15, 2025)\n"
        "2. Migration strategy document (Due: April 10, 2025)\n"
        "3. Test environment setup (Due: May 20, 2025)\n"
        "4. Production migration (Due: July 15, 2025)\n"
        "5. Post-migration support (Due: August 31, 2025)\n\n"
        "Budget: $250,000\n"
        "Payment Schedule: 20% upfront, 30% at midpoint, 50% upon completion\n"
    ),
)

# Define a document-level concept using e.g. JsonObjectConcept
# This will extract structured data from the entire document
doc_concept = JsonObjectConcept(
    name="Project Details",
    description="Key project information including timeline, deliverables, and budget",
    structure={
        "project_name": str,
        "client": str,
        "contractor": str,
        "budget": str,
        "payment_terms": str,
    },  # simply use a dictionary with type hints (including generic aliases and union types)
    add_references=True,
    reference_depth="paragraphs",
)

# Add the concept to the document
doc.add_concepts([doc_concept])
# (add more concepts to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # or any other LLM from e.g. Anthropic, etc.
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
)

# Extract information from the document
extracted_concepts = llm.extract_concepts_from_document(doc)
# or use async version llm.extract_concepts_from_document_async(doc)

# Access extracted information
print("Project Details:")
print(
    extracted_concepts[0].extracted_items
)  # extracted concept items with references to paragraphs
# Or doc.get_concept_by_name("Project Details").extracted_items

πŸ–ΌοΈ Concept Extraction from Document (vision)#

Tip

Concept extraction using vision capabilities processes documents with complex layouts or images. Common use cases include:

  • Extracting data from scanned contracts or receipts

  • Identifying information from charts and graphs in reports

  • Identifying visual product features from marketing materials

# Quick Start Example - Extracting concept from a document with an image

import os
from pathlib import Path

from contextgem import Document, DocumentLLM, Image, NumericalConcept, image_to_base64

# Path adapted for testing
current_file = Path(__file__).resolve()
root_path = current_file.parents[4]
image_path = root_path / "tests" / "invoices" / "invoice.jpg"

# Create an image instance
doc_image = Image(mime_type="image/jpg", base64_data=image_to_base64(image_path))

# Example document instance holding only the image
doc = Document(
    images=[doc_image],  # may contain multiple images
)

# Define a concept to extract the invoice total amount
doc_concept = NumericalConcept(
    name="Invoice Total",
    description="The total amount to be paid as shown on the invoice",
    numeric_type="float",
    llm_role="extractor_vision",  # use vision model
)

# Add concept to the document
doc.add_concepts([doc_concept])
# (add more concepts to the document, if needed)

# Create an LLM for extraction
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # Using a model with vision capabilities
    api_key=os.environ.get("CONTEXTGEM_OPENAI_API_KEY"),  # your API key
    role="extractor_vision",  # mark LLM as vision model
)

# Extract information from the document
extracted_concepts = llm.extract_concepts_from_document(doc)
# or use async version: await llm.extract_concepts_from_document_async(doc)

# Access extracted information
print("Invoice Total:")
print(extracted_concepts[0].extracted_items)  # extracted concept items
# or doc.get_concept_by_name("Invoice Total").extracted_items