StringConcept

StringConcept#

StringConcept is a versatile concept type in ContextGem that extracts text-based information from documents, ranging from simple data fields to complex analytical insights.

📝 Overview#

StringConcept is used when you need to extract text values from documents, including:

Simple fields: names, titles, descriptions, identifiers
Complex analyses: conclusions, assessments, recommendations, summaries
Detected elements: anomalies, patterns, key findings, critical insights

This concept type offers flexibility to extract both factual information and interpretive content that requires advanced understanding.

💻 Usage Example#

Here’s a simple example of how to use StringConcept to extract a person’s name from a document:

# ContextGem: StringConcept Extraction

import os

from contextgem import Document, DocumentLLM, StringConcept

# Create a Document object from text
doc = Document(raw_text="My name is John Smith and I am 30 years old.")

# Define a StringConcept to extract a person's name
name_concept = StringConcept(
    name="Person name",
    description="Full name of the person",
)

# Attach the concept to the document
doc.add_concepts([name_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept from the document
name_concept = llm.extract_concepts_from_document(doc)[0]

# Get the extracted value
print(name_concept.extracted_items[0].value)  # Output: "John Smith"
# Or access the extracted value from the document object
print(doc.concepts[0].extracted_items[0].value)  # Output: "John Smith"

⚙️ Parameters#

When creating a StringConcept, you can specify the following parameters:

Parameter	Type	Description
`name`	str	A unique name identifier for the concept
`description`	str	A clear description of what the concept represents and what should be extracted
`examples`	list[`StringExample`]	Optional. Example values that help the LLM better understand what to extract and the expected format (e.g., “Party Name (Role)” format for contract parties). This additional guidance helps improve extraction accuracy and consistency.
`llm_role`	str	The role of the LLM responsible for extracting the concept. Available values: `"extractor_text"`, `"reasoner_text"`, `"extractor_vision"`, `"reasoner_vision"`. Defaults to `"extractor_text"`. For more details, see 🏷️ LLM Roles.
`add_justifications`	bool	Whether to include justifications for extracted items (defaults to `False`). Justifications provide explanations of why the LLM extracted specific values and the reasoning behind the extraction, which is especially useful for complex extractions or when debugging results.
`justification_depth`	str	Justification detail level. Available values: `"brief"`, `"balanced"`, `"comprehensive"`. Defaults to `"brief"`
`justification_max_sents`	int	Maximum sentences in a justification (defaults to `2`)
`add_references`	bool	Whether to include source references for extracted items (defaults to `False`). References indicate the specific locations in the document where the information was either directly found or from which it was inferred, helping to trace back extracted values to their source content even when the extraction involves reasoning or interpretation.
`reference_depth`	str	Source reference granularity. Available values: `"paragraphs"`, `"sentences"`. Defaults to `"paragraphs"`
`singular_occurrence`	bool	Whether this concept is restricted to having only one extracted item. If `True`, only a single extracted item will be extracted. Defaults to `False` (multiple extracted items are allowed). This is particularly relevant when it might be unclear for the LLM whether to focus on the concept as a single item or extract multiple items. For example, when extracting the total amount of payments in a contract, where payments might be mentioned in different parts of the document but you only want the final total. Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type (e.g., “document title” vs “key findings”).
`custom_data`	dict	Optional. Dictionary for storing any additional data that you want to associate with the concept. This data must be JSON-serializable. This data is not used for extraction but can be useful for custom processing or downstream tasks. Defaults to an empty dictionary.

🚀 Advanced Usage#

✏️ Adding Examples#

You can add examples to improve the extraction accuracy and set the expected format for a StringConcept:

# ContextGem: StringConcept Extraction with Examples

import os

from contextgem import Document, DocumentLLM, StringConcept, StringExample

# Create a Document object from text
contract_text = """
SERVICE AGREEMENT
This Service Agreement (the "Agreement") is entered into as of January 15, 2025 by and between:
XYZ Innovations Inc., a Delaware corporation with offices at 123 Tech Avenue, San Francisco, CA 
("Provider"), and
Omega Enterprises LLC, a New York limited liability company with offices at 456 Business Plaza, 
New York, NY ("Customer").
"""
doc = Document(raw_text=contract_text)

# Create a StringConcept for extracting parties and their roles
parties_concept = StringConcept(
    name="Contract parties",
    description="Names of parties and their roles in the contract",
    examples=[
        StringExample(content="Acme Corporation (Supplier)"),
        StringExample(content="TechGroup Inc. (Client)"),
    ],  # add examples providing additional guidance to the LLM
)

# Attach the concept to the document
doc.add_concepts([parties_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept from the document
parties_concept = llm.extract_concepts_from_document(doc)[0]

# Print the extracted parties and their roles
print("Extracted parties and roles:")
for item in parties_concept.extracted_items:
    print(f"- {item.value}")

# Expected output:
# - XYZ Innovations Inc. (Provider)
# - Omega Enterprises LLC (Customer)

🔍 References and Justifications for Extraction#

You can configure a StringConcept to include justifications and references. Justifications help explain the reasoning behind extracted values, especially for complex or inferred information like conclusions or assessments, while references point to the specific parts of the document that informed the extraction:

# ContextGem: StringConcept Extraction with References and Justifications

import os

from contextgem import Document, DocumentLLM, StringConcept

# Sample document text containing financial information
financial_text = """
2024 Financial Performance Summary

Revenue increased to $120 million in fiscal year 2024, representing 15% growth compared to the previous year. This growth was primarily driven by the expansion of our enterprise client base and the successful launch of our premium service tier.

The Board has recommended a dividend of $1.25 per share, which will be payable to shareholders of record as of March 15, 2025.
"""

# Create a Document from the text
doc = Document(raw_text=financial_text)

# Create a StringConcept with justifications and references enabled
key_figures_concept = StringConcept(
    name="Financial key figures",
    description="Important financial metrics and figures mentioned in the report",
    add_justifications=True,  # enable justifications to understand extraction reasoning
    justification_depth="balanced",
    justification_max_sents=3,  # allow up to 3 sentences for each justification
    add_references=True,  # include references to source text
    reference_depth="sentences",  # reference specific sentences rather than paragraphs
)

# Attach the concept to the document
doc.add_concepts([key_figures_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4o-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept
key_figures_concept = llm.extract_concepts_from_document(doc)[0]

# Print the extracted items with justifications and references
print("Extracted financial key figures:")
for item in key_figures_concept.extracted_items:
    print(f"\nFigure: {item.value}")
    print(f"Justification: {item.justification}")
    print("Source references:")
    for sent in item.reference_sentences:
        print(f"- {sent.raw_text}")

📊 Extracted Items#

When a StringConcept is extracted, it is populated with a list of extracted items accessible through the .extracted_items property. Each item is an instance of the _StringItem class with the following attributes:

Attribute	Type	Description
`value`	str	The extracted text string
`justification`	str	Explanation of why this string was extracted (only if `add_justifications=True`)
`reference_paragraphs`	list[`Paragraph`]	List of paragraph objects that informed the extraction (only if `add_references=True`)
`reference_sentences`	list[`Sentence`]	List of sentence objects that informed the extraction (only if `add_references=True` and `reference_depth="sentences"`)

💡 Best Practices#

Here are some best practices to optimize your use of StringConcept:

Provide a clear and specific description that helps the LLM understand exactly what to extract.
Include examples (using StringExample) to improve extraction accuracy and demonstrate the expected format (e.g., “Party Name (Role)” for contract parties or “Revenue: $X million” for financial figures).
Enable justifications (using add_justifications=True) when you need to see why the LLM extracted certain values.
Enable references (using add_references=True) when you need to trace back to where in the document the information was found or understand what evidence informed extracted values (especially for inferred information).
When relevant, enforce only a single item extraction (using singular_occurrence=True). This is particularly relevant when it might be unclear for the LLM whether to focus on the concept as a single item or extract multiple items. For example, when extracting the total amount of payments in a contract, where payments might be mentioned in different parts of the document but you only want the final total.