StringConcept#

StringConcept is a versatile concept type in ContextGem that extracts text-based information from documents, ranging from simple data fields to complex analytical insights.

📝 Overview#

StringConcept is used when you need to extract text values from documents, including:

  • Simple fields: names, titles, descriptions, identifiers

  • Complex analyses: conclusions, assessments, recommendations, summaries

  • Detected elements: anomalies, patterns, key findings, critical insights

This concept type offers flexibility to extract both factual information and interpretive content that requires advanced understanding.

💻 Usage Example#

Here’s a simple example of how to use StringConcept to extract a person’s name from a document:

# ContextGem: StringConcept Extraction

import os

from contextgem import Document, DocumentLLM, StringConcept

# Create a Document object from text
doc = Document(raw_text="My name is John Smith and I am 30 years old.")

# Define a StringConcept to extract a person's name
name_concept = StringConcept(
    name="Person name",
    description="Full name of the person",
)

# Attach the concept to the document
doc.add_concepts([name_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept from the document
name_concept = llm.extract_concepts_from_document(doc)[0]

# Get the extracted value
print(name_concept.extracted_items[0].value)  # Output: "John Smith"
# Or access the extracted value from the document object
print(doc.concepts[0].extracted_items[0].value)  # Output: "John Smith"
Open In Colab

⚙️ Parameters#

When creating a StringConcept, you can specify the following parameters:

Parameter

Type

Description

name

str

A unique name identifier for the concept

description

str

A clear description of what the concept represents and what should be extracted

examples

list[StringExample]

Optional. Example values that help the LLM better understand what to extract and the expected format (e.g., “Party Name (Role)” format for contract parties). This additional guidance helps improve extraction accuracy and consistency.

llm_role

str

The role of the LLM responsible for extracting the concept. Available values: "extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision". Defaults to "extractor_text". For more details, see 🏷️ LLM Roles.

add_justifications

bool

Whether to include justifications for extracted items (defaults to False). Justifications provide explanations of why the LLM extracted specific values and the reasoning behind the extraction, which is especially useful for complex extractions or when debugging results.

justification_depth

str

Justification detail level. Available values: "brief", "balanced", "comprehensive". Defaults to "brief"

justification_max_sents

int

Maximum sentences in a justification (defaults to 2)

add_references

bool

Whether to include source references for extracted items (defaults to False). References indicate the specific locations in the document where the information was either directly found or from which it was inferred, helping to trace back extracted values to their source content even when the extraction involves reasoning or interpretation.

reference_depth

str

Source reference granularity. Available values: "paragraphs", "sentences". Defaults to "paragraphs"

singular_occurrence

bool

Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). This is particularly relevant when it might be unclear for the LLM whether to focus on the concept as a single item or extract multiple items. For example, when extracting the total amount of payments in a contract, where payments might be mentioned in different parts of the document but you only want the final total. Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type (e.g., “document title” vs “key findings”).

custom_data

dict

Optional. Dictionary for storing any additional data that you want to associate with the concept. This data must be JSON-serializable. This data is not used for extraction but can be useful for custom processing or downstream tasks. Defaults to an empty dictionary.

🚀 Advanced Usage#

✏️ Adding Examples#

You can add examples to improve the extraction accuracy and set the expected format for a StringConcept:

# ContextGem: StringConcept Extraction with Examples

import os

from contextgem import Document, DocumentLLM, StringConcept, StringExample

# Create a Document object from text
contract_text = """
SERVICE AGREEMENT
This Service Agreement (the "Agreement") is entered into as of January 15, 2025 by and between:
XYZ Innovations Inc., a Delaware corporation with offices at 123 Tech Avenue, San Francisco, CA 
("Provider"), and
Omega Enterprises LLC, a New York limited liability company with offices at 456 Business Plaza, 
New York, NY ("Customer").
"""
doc = Document(raw_text=contract_text)

# Create a StringConcept for extracting parties and their roles
parties_concept = StringConcept(
    name="Contract parties",
    description="Names of parties and their roles in the contract",
    examples=[
        StringExample(content="Acme Corporation (Supplier)"),
        StringExample(content="TechGroup Inc. (Client)"),
    ],  # add examples providing additional guidance to the LLM
)

# Attach the concept to the document
doc.add_concepts([parties_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept from the document
parties_concept = llm.extract_concepts_from_document(doc)[0]

# Print the extracted parties and their roles
print("Extracted parties and roles:")
for item in parties_concept.extracted_items:
    print(f"- {item.value}")

# Expected output:
# - XYZ Innovations Inc. (Provider)
# - Omega Enterprises LLC (Customer)
Open In Colab

🔍 References and Justifications for Extraction#

You can configure a StringConcept to include justifications and references. Justifications help explain the reasoning behind extracted values, especially for complex or inferred information like conclusions or assessments, while references point to the specific parts of the document that informed the extraction:

# ContextGem: StringConcept Extraction with References and Justifications

import os

from contextgem import Document, DocumentLLM, StringConcept

# Sample document text containing financial information
financial_text = """
2024 Financial Performance Summary

Revenue increased to $120 million in fiscal year 2024, representing 15% growth compared to the previous year. This growth was primarily driven by the expansion of our enterprise client base and the successful launch of our premium service tier.

The Board has recommended a dividend of $1.25 per share, which will be payable to shareholders of record as of March 15, 2025.
"""

# Create a Document from the text
doc = Document(raw_text=financial_text)

# Create a StringConcept with justifications and references enabled
key_figures_concept = StringConcept(
    name="Financial key figures",
    description="Important financial metrics and figures mentioned in the report",
    add_justifications=True,  # enable justifications to understand extraction reasoning
    justification_depth="balanced",
    justification_max_sents=3,  # allow up to 3 sentences for each justification
    add_references=True,  # include references to source text
    reference_depth="sentences",  # reference specific sentences rather than paragraphs
)

# Attach the concept to the document
doc.add_concepts([key_figures_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4o-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept
key_figures_concept = llm.extract_concepts_from_document(doc)[0]

# Print the extracted items with justifications and references
print("Extracted financial key figures:")
for item in key_figures_concept.extracted_items:
    print(f"\nFigure: {item.value}")
    print(f"Justification: {item.justification}")
    print("Source references:")
    for sent in item.reference_sentences:
        print(f"- {sent.raw_text}")
Open In Colab

📊 Extracted Items#

When a StringConcept is extracted, it is populated with a list of extracted items accessible through the .extracted_items property. Each item is an instance of the _StringItem class with the following attributes:

Attribute

Type

Description

value

str

The extracted text string

justification

str

Explanation of why this string was extracted (only if add_justifications=True)

reference_paragraphs

list[Paragraph]

List of paragraph objects that informed the extraction (only if add_references=True)

reference_sentences

list[Sentence]

List of sentence objects that informed the extraction (only if add_references=True and reference_depth="sentences")

💡 Best Practices#

Here are some best practices to optimize your use of StringConcept:

  • Provide a clear and specific description that helps the LLM understand exactly what to extract.

  • Include examples (using StringExample) to improve extraction accuracy and demonstrate the expected format (e.g., “Party Name (Role)” for contract parties or “Revenue: $X million” for financial figures).

  • Enable justifications (using add_justifications=True) when you need to see why the LLM extracted certain values.

  • Enable references (using add_references=True) when you need to trace back to where in the document the information was found or understand what evidence informed extracted values (especially for inferred information).

  • When relevant, enforce only a single item extraction (using singular_occurrence=True). This is particularly relevant when it might be unclear for the LLM whether to focus on the concept as a single item or extract multiple items. For example, when extracting the total amount of payments in a contract, where payments might be mentioned in different parts of the document but you only want the final total.