NumericalConcept#

NumericalConcept is a specialized concept type that extracts, calculates, or derives numerical values (integers, floats, or both) from document content.

📝 Overview#

NumericalConcept enables powerful numerical data extraction and analysis from documents, such as:

  • Direct extraction: retrieving explicitly stated values like prices, percentages, dates, or measurements

  • Calculated values: computing sums, averages, growth rates, or other derived metrics

  • Quantitative assessments: determining counts, frequencies, totals, or numerical scores

The concept can work with integers, floating-point numbers, or both types based on your configuration.

đź’» Usage Example#

Here’s a simple example of how to use NumericalConcept to extract a price from a document:

# ContextGem: NumericalConcept Extraction

import os

from contextgem import Document, DocumentLLM, NumericalConcept

# Create a Document object from text
doc = Document(
    raw_text="The latest smartphone model costs $899.99 and will be available next week."
)

# Define a NumericalConcept to extract the price
price_concept = NumericalConcept(
    name="Product price",
    description="The price of the product",
    numeric_type="float",  # We expect a decimal price
)

# Attach the concept to the document
doc.add_concepts([price_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept from the document
price_concept = llm.extract_concepts_from_document(doc)[0]

# Print the extracted value
print(price_concept.extracted_items[0].value)  # Output: 899.99
# Or access the extracted value from the document object
print(doc.concepts[0].extracted_items[0].value)  # Output: 899.99
Open In Colab

⚙️ Parameters#

When creating a NumericalConcept, you can specify the following parameters:

Parameter

Type

Description

name

str

A unique name identifier for the concept

description

str

A clear description of what numerical value to extract, which can include explicit values to find, calculations to perform, or quantitative assessments to derive from the document content

numeric_type

str

The type of numerical values to extract. Available values: "int", "float", "any". Defaults to "any". When "any" is specified, the system will automatically determine whether to use an integer or floating-point representation based on the extracted value, choosing the most appropriate type for each numerical item.

llm_role

str

The role of the LLM responsible for extracting the concept. Available values: "extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision". Defaults to "extractor_text". For more details, see 🏷️ LLM Roles.

add_justifications

bool

Whether to include justifications for extracted items (defaults to False). Justifications provide explanations of why the LLM extracted specific numerical values and the reasoning behind the extraction, which is especially useful for complex calculations, inferred values, or when debugging results.

justification_depth

str

Justification detail level. Available values: "brief", "balanced", "comprehensive". Defaults to "brief"

justification_max_sents

int

Maximum sentences in a justification (defaults to 2)

add_references

bool

Whether to include source references for extracted items (defaults to False). References indicate the specific locations in the document where the numerical values were either directly found or from which they were calculated or inferred, helping to trace back extracted values to their source content even when the extraction involves complex calculations or mathematical reasoning.

reference_depth

str

Source reference granularity. Available values: "paragraphs", "sentences". Defaults to "paragraphs"

singular_occurrence

bool

Whether this concept is restricted to having only one extracted item. If True, only a single numerical value will be extracted. Defaults to False (multiple numerical values are allowed). For numerical concepts, this parameter is particularly useful when you want to extract a single specific value rather than identifying multiple numerical values throughout the document. This helps distinguish between single-value concepts versus multi-value concepts (e.g., “total contract value” vs “all payment amounts”). Note that with advanced LLMs, this constraint may not be required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type.

custom_data

dict

Optional. Dictionary for storing any additional data that you want to associate with the concept. This data must be JSON-serializable. This data is not used for extraction but can be useful for custom processing or downstream tasks. Defaults to an empty dictionary.

🚀 Advanced Usage#

🔍 References and Justifications for Extraction#

You can configure a NumericalConcept to include justifications and references. Justifications help explain the reasoning behind the extracted values, while references point to the specific parts of the document where the numerical values were either directly found or from which they were calculated or inferred, helping to trace back extracted values to their source content even when the extraction involves complex calculations or mathematical reasoning:

# ContextGem: NumericalConcept Extraction with References and Justifications

import os

from contextgem import Document, DocumentLLM, NumericalConcept

# Document with values that require calculation/inference
report_text = """
Quarterly Sales Report - Q2 2023

Product A: Sold 450 units at $75 each
Product B: Sold 320 units at $125 each
Product C: Sold 180 units at $95 each

Marketing expenses: $28,500
Operating costs: $42,700
"""

# Create a Document from the text
doc = Document(raw_text=report_text)

# Create a NumericalConcept for total revenue
total_revenue_concept = NumericalConcept(
    name="Total quarterly revenue",
    description="The total revenue calculated by multiplying units sold by their price",
    add_justifications=True,
    justification_depth="comprehensive",  # Detailed justification to show calculation steps
    justification_max_sents=4,  # Maximum number of sentences for justification
    add_references=True,
    reference_depth="paragraphs",  # Reference specific paragraphs
    singular_occurrence=True,  # Ensure that the data is merged into a single item
)

# Attach the concept to the document
doc.add_concepts([total_revenue_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/o4-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept
total_revenue_concept = llm.extract_concepts_from_document(doc)[0]

# Print the extracted inferred value with justification
print("Calculated total quarterly revenue:")
for item in total_revenue_concept.extracted_items:
    print(f"\nTotal Revenue: {item.value}")
    print(f"Calculation Justification: {item.justification}")
    print("Source references:")
    for para in item.reference_paragraphs:
        print(f"- {para.raw_text}")
Open In Colab

📊 Extracted Items#

When a NumericalConcept is extracted, it is populated with a list of extracted items accessible through the .extracted_items property. Each item is an instance of the _NumericalItem class with the following attributes:

Attribute

Type

Description

value

int or float

The extracted numerical value, either an integer or floating-point number depending on the numeric_type setting

justification

str

Explanation of why this numerical value was extracted (only if add_justifications=True)

reference_paragraphs

list[Paragraph]

List of paragraph objects where the numerical value was found or from which it was calculated or inferred (only if add_references=True)

reference_sentences

list[Sentence]

List of sentence objects where the numerical value was found or from which it was calculated or inferred (only if add_references=True and reference_depth="sentences")

đź’ˇ Best Practices#

Here are some best practices to optimize your use of NumericalConcept:

  • Provide a clear and specific description that helps the LLM understand exactly what numerical values to extract, using precise and unambiguous language in your concept names and descriptions. For numerical concepts, be explicit about the exact values you’re seeking (e.g., “the total contract value in USD” rather than just “contract value”). Avoid vague terms that could lead to incorrect extractions—for example, use “quarterly revenue figures in millions” instead of “revenue numbers” to ensure consistent and accurate extractions.

  • Use the appropriate numeric_type based on what you expect to extract or calculate:

    • Use "int" for counts, quantities, or whole numbers

    • Use "float" for prices, measurements, or values that may have decimal points

    • Use "any" when you’re not sure or need to extract both types

  • Break down complex numerical extractions into multiple simpler numerical concepts when appropriate. Instead of one concept extracting “all financial metrics,” consider separate concepts for “revenue figures,” “expense amounts,” and “profit margins.” This provides more structured data and makes it easier to process the results for specific purposes.

  • Enable justifications (using add_justifications=True) when you need to understand the reasoning behind the LLM’s numerical extractions, especially when calculations or conversions are involved.

  • Enable references (using add_references=True) when you need to trace back to specific parts of the document that contained the numerical values or were used to calculate derived values.

  • Use singular_occurrence=True to enforce only a single numerical value extraction. This is particularly useful for concepts that should yield a unique value, such as “total contract value” or “effective interest rate,” rather than identifying multiple numerical values throughout the document.