StringConcept#
StringConcept
is a versatile concept type in ContextGem that extracts text-based information from documents, ranging from simple data fields to complex analytical insights.
📝 Overview#
StringConcept
is used when you need to extract text values from documents, including:
Simple fields: names, titles, descriptions, identifiers
Complex analyses: conclusions, assessments, recommendations, summaries
Detected elements: anomalies, patterns, key findings, critical insights
This concept type offers flexibility to extract both factual information and interpretive content that requires advanced understanding.
💻 Usage Example#
Here’s a simple example of how to use StringConcept
to extract a person’s name from a document:
# ContextGem: StringConcept Extraction
import os
from contextgem import Document, DocumentLLM, StringConcept
# Create a Document object from text
doc = Document(raw_text="My name is John Smith and I am 30 years old.")
# Define a StringConcept to extract a person's name
name_concept = StringConcept(
name="Person name",
description="Full name of the person",
)
# Attach the concept to the document
doc.add_concepts([name_concept])
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1-mini",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract the concept from the document
name_concept = llm.extract_concepts_from_document(doc)[0]
# Get the extracted value
print(name_concept.extracted_items[0].value) # Output: "John Smith"
# Or access the extracted value from the document object
print(doc.concepts[0].extracted_items[0].value) # Output: "John Smith"
⚙️ Parameters#
When creating a StringConcept
, you can specify the following parameters:
Parameter |
Type |
Description |
---|---|---|
|
str |
A unique name identifier for the concept |
|
str |
A clear description of what the concept represents and what should be extracted |
|
list[ |
Optional. Example values that help the LLM better understand what to extract and the expected format (e.g., “Party Name (Role)” format for contract parties). This additional guidance helps improve extraction accuracy and consistency. |
|
str |
The role of the LLM responsible for extracting the concept. Available values: |
|
bool |
Whether to include justifications for extracted items (defaults to |
|
str |
Justification detail level. Available values: |
|
int |
Maximum sentences in a justification (defaults to |
|
bool |
Whether to include source references for extracted items (defaults to |
|
str |
Source reference granularity. Available values: |
|
bool |
Whether this concept is restricted to having only one extracted item. If |
|
dict |
Optional. Dictionary for storing any additional data that you want to associate with the concept. This data must be JSON-serializable. This data is not used for extraction but can be useful for custom processing or downstream tasks. Defaults to an empty dictionary. |
🚀 Advanced Usage#
✏️ Adding Examples#
You can add examples to improve the extraction accuracy and set the expected format for a StringConcept
:
# ContextGem: StringConcept Extraction with Examples
import os
from contextgem import Document, DocumentLLM, StringConcept, StringExample
# Create a Document object from text
contract_text = """
SERVICE AGREEMENT
This Service Agreement (the "Agreement") is entered into as of January 15, 2025 by and between:
XYZ Innovations Inc., a Delaware corporation with offices at 123 Tech Avenue, San Francisco, CA
("Provider"), and
Omega Enterprises LLC, a New York limited liability company with offices at 456 Business Plaza,
New York, NY ("Customer").
"""
doc = Document(raw_text=contract_text)
# Create a StringConcept for extracting parties and their roles
parties_concept = StringConcept(
name="Contract parties",
description="Names of parties and their roles in the contract",
examples=[
StringExample(content="Acme Corporation (Supplier)"),
StringExample(content="TechGroup Inc. (Client)"),
], # add examples providing additional guidance to the LLM
)
# Attach the concept to the document
doc.add_concepts([parties_concept])
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1-mini",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract the concept from the document
parties_concept = llm.extract_concepts_from_document(doc)[0]
# Print the extracted parties and their roles
print("Extracted parties and roles:")
for item in parties_concept.extracted_items:
print(f"- {item.value}")
# Expected output:
# - XYZ Innovations Inc. (Provider)
# - Omega Enterprises LLC (Customer)
🔍 References and Justifications for Extraction#
You can configure a StringConcept
to include justifications and references. Justifications help explain the reasoning behind extracted values, especially for complex or inferred information like conclusions or assessments, while references point to the specific parts of the document that informed the extraction:
# ContextGem: StringConcept Extraction with References and Justifications
import os
from contextgem import Document, DocumentLLM, StringConcept
# Sample document text containing financial information
financial_text = """
2024 Financial Performance Summary
Revenue increased to $120 million in fiscal year 2024, representing 15% growth compared to the previous year. This growth was primarily driven by the expansion of our enterprise client base and the successful launch of our premium service tier.
The Board has recommended a dividend of $1.25 per share, which will be payable to shareholders of record as of March 15, 2025.
"""
# Create a Document from the text
doc = Document(raw_text=financial_text)
# Create a StringConcept with justifications and references enabled
key_figures_concept = StringConcept(
name="Financial key figures",
description="Important financial metrics and figures mentioned in the report",
add_justifications=True, # enable justifications to understand extraction reasoning
justification_depth="balanced",
justification_max_sents=3, # allow up to 3 sentences for each justification
add_references=True, # include references to source text
reference_depth="sentences", # reference specific sentences rather than paragraphs
)
# Attach the concept to the document
doc.add_concepts([key_figures_concept])
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4o-mini",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract the concept
key_figures_concept = llm.extract_concepts_from_document(doc)[0]
# Print the extracted items with justifications and references
print("Extracted financial key figures:")
for item in key_figures_concept.extracted_items:
print(f"\nFigure: {item.value}")
print(f"Justification: {item.justification}")
print("Source references:")
for sent in item.reference_sentences:
print(f"- {sent.raw_text}")
📊 Extracted Items#
When a StringConcept
is extracted, it is populated with a list of extracted items accessible through the .extracted_items
property. Each item is an instance of the _StringItem
class with the following attributes:
Attribute |
Type |
Description |
---|---|---|
|
str |
The extracted text string |
|
str |
Explanation of why this string was extracted (only if |
|
list[ |
List of paragraph objects that informed the extraction (only if |
|
list[ |
List of sentence objects that informed the extraction (only if |
💡 Best Practices#
Here are some best practices to optimize your use of StringConcept
:
Provide a clear and specific description that helps the LLM understand exactly what to extract.
Include examples (using
StringExample
) to improve extraction accuracy and demonstrate the expected format (e.g., “Party Name (Role)” for contract parties or “Revenue: $X million” for financial figures).Enable justifications (using
add_justifications=True
) when you need to see why the LLM extracted certain values.Enable references (using
add_references=True
) when you need to trace back to where in the document the information was found or understand what evidence informed extracted values (especially for inferred information).When relevant, enforce only a single item extraction (using
singular_occurrence=True
). This is particularly relevant when it might be unclear for the LLM whether to focus on the concept as a single item or extract multiple items. For example, when extracting the total amount of payments in a contract, where payments might be mentioned in different parts of the document but you only want the final total.