Concepts

Concepts#

Module for handling concepts at aspect and document levels.

This module provides classes for defining different types of concepts that can be extracted from documents and aspects. Concepts represent specific pieces of information to be identified and extracted by LLMs, such as strings, numbers, boolean values, JSON objects, and ratings.

Each concept type has specific properties and behaviors tailored to the kind of data it represents, including validation rules, extraction methods, and reference handling. Concepts can be attached to documents or aspects and can include examples, justifications, and references to the source text.

class contextgem.public.concepts.StringConcept(**data)[source]#

Bases: _Concept

A concept model for string-based information extraction from documents and aspects.

This class provides functionality for defining, extracting, and managing string data as conceptual entities within documents or aspects.

Variables:

name (str) – The name of the concept (non-empty string, stripped).
description (str) – A brief description of the concept (non-empty string, stripped).
examples (list[StringExample]) – Example strings illustrating the concept usage.
llm_role (LLMRoleAny) – The role of the LLM responsible for extracting the concept (“extractor_text”, “reasoner_text”, “extractor_vision”, “reasoner_vision”). Defaults to “extractor_text”.
add_justifications (bool) – Whether to include justifications for extracted items.
justification_depth (JustificationDepth) – Justification detail level. Defaults to “brief”.
justification_max_sents (int) – Maximum sentences in justification. Defaults to 2.
add_references (bool) – Whether to include source references for extracted items.
reference_depth (ReferenceDepth) – Source reference granularity (“paragraphs” or “sentences”). Defaults to “paragraphs”. Only relevant when references are added to extracted items. Affects the structure of extracted_items.
singular_occurrence (StrictBool) – Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type (e.g., “document title” vs “key findings”).

Parameters:

custom_data (dict)
add_justifications (Annotated[bool, Strict(strict=True)])
justification_depth (Literal['brief', 'balanced', 'comprehensive'])
justification_max_sents (Annotated[int, Strict(strict=True)])
name (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
description (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
llm_role (Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision'])
add_references (Annotated[bool, Strict(strict=True)])
reference_depth (Literal['paragraphs', 'sentences'])
singular_occurrence (Annotated[bool, Strict(strict=True)])
examples (list[StringExample])

Example:

String concept definition#

from contextgem import StringConcept, StringExample


# Define a string concept for identifying contract party names
# and their roles in the contract
party_names_and_roles_concept = StringConcept(
    name="Party names and roles",
    description=(
        "Names of all parties entering into the agreement and their contractual roles"
    ),
    examples=[
        StringExample(
            content="X (Client)",  # guidance regarding format
        )
    ],
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

examples: list[StringExample]#

clone()#

Creates and returns a deep copy of the current instance.

Return type:: typing.Self
Returns:: A deep copy of the current instance.

property extracted_items: list[_ExtractedItem]#

Provides access to extracted items.

Returns:: A list containing the extracted items as _ExtractedItem objects.
Return type:: list[_ExtractedItem]

classmethod from_dict(obj_dict)#

Reconstructs an instance of the class from a dictionary representation.

This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component.

Parameters:: obj_dict (dict[str, Any]) – Dictionary containing the serialized object data.
Returns:: A new instance of the class with restored attributes.
Return type:: Self

classmethod from_disk(file_path)#

Loads an instance of the class from a JSON file stored on disk.

This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method.

Parameters:

file_path (str | Path) – Path to the JSON file to load (must end with ‘.json’). Can be a string or a Path object.

Returns:

An instance of the class populated with the data from the file.

Return type:

Self

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If deserialization fails.

classmethod from_json(json_string)#

Creates an instance of the class from a JSON string representation.

This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class.

Parameters:: json_string (str) – JSON string containing the serialized object data.
Returns:: A new instance of the class with restored state.
Return type:: Self
Raises:: TypeError – If the class name in the serialized data doesn’t match.

to_dict()#

Transforms the current object into a dictionary representation.

Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes

When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed.

Returns:: A dictionary representation of the current object with all necessary data for serialization
Return type:: dict[str, Any]

to_disk(file_path)#

Saves the serialized instance to a JSON file at the specified path.

This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding.

Parameters:

file_path (str | Path) – Path where the JSON file should be saved (must end with ‘.json’). Can be a string or a Path object.

Return type:

None

Returns:

None

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If there’s an error during the file writing process.

to_json()#

Converts the object to its JSON string representation.

Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method.

Returns:: A JSON string representation of the object.
Return type:: str

property unique_id: str#: Returns the ULID of the instance.

name: NonEmptyStr#

description: NonEmptyStr#

llm_role: LLMRoleAny#

add_references: StrictBool#

reference_depth: ReferenceDepth#

singular_occurrence: StrictBool#

add_justifications: StrictBool#

justification_depth: JustificationDepth#

justification_max_sents: StrictInt#

custom_data: dict#

class contextgem.public.concepts.BooleanConcept(**data)[source]#

Bases: _Concept

A concept model for boolean (True/False) information extraction from documents and aspects.

This class handles identification and extraction of boolean values that represent conceptual properties or attributes within content.

Variables:

name (str) – The name of the concept (non-empty string, stripped).
description (str) – A brief description of the concept (non-empty string, stripped).
llm_role (LLMRoleAny) – The role of the LLM responsible for extracting the concept (“extractor_text”, “reasoner_text”, “extractor_vision”, “reasoner_vision”). Defaults to “extractor_text”.
add_justifications (bool) – Whether to include justifications for extracted items.
justification_depth (JustificationDepth) – Justification detail level. Defaults to “brief”.
justification_max_sents (int) – Maximum sentences in justification. Defaults to 2.
add_references (bool) – Whether to include source references for extracted items.
reference_depth (ReferenceDepth) – Source reference granularity (“paragraphs” or “sentences”). Defaults to “paragraphs”. Only relevant when references are added to extracted items. Affects the structure of extracted_items.
singular_occurrence (StrictBool) – Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type (e.g., “contains confidential information” vs “compliance violations”).

Parameters:

custom_data (dict)
add_justifications (Annotated[bool, Strict(strict=True)])
justification_depth (Literal['brief', 'balanced', 'comprehensive'])
justification_max_sents (Annotated[int, Strict(strict=True)])
name (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
description (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
llm_role (Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision'])
add_references (Annotated[bool, Strict(strict=True)])
reference_depth (Literal['paragraphs', 'sentences'])
singular_occurrence (Annotated[bool, Strict(strict=True)])

Example:

Boolean concept definition#

from contextgem import BooleanConcept


# Create the concept with specific configuration
has_confidentiality = BooleanConcept(
    name="Contains confidentiality clause",
    description="Determines whether the contract includes provisions requiring parties to maintain confidentiality",
    llm_role="reasoner_text",
    singular_occurrence=True,
    add_justifications=True,
    justification_depth="brief",
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

clone()#

Creates and returns a deep copy of the current instance.

Return type:: typing.Self
Returns:: A deep copy of the current instance.

property extracted_items: list[_ExtractedItem]#

Provides access to extracted items.

Returns:: A list containing the extracted items as _ExtractedItem objects.
Return type:: list[_ExtractedItem]

classmethod from_dict(obj_dict)#

Reconstructs an instance of the class from a dictionary representation.

This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component.

Parameters:: obj_dict (dict[str, Any]) – Dictionary containing the serialized object data.
Returns:: A new instance of the class with restored attributes.
Return type:: Self

classmethod from_disk(file_path)#

Loads an instance of the class from a JSON file stored on disk.

This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method.

Parameters:

file_path (str | Path) – Path to the JSON file to load (must end with ‘.json’). Can be a string or a Path object.

Returns:

An instance of the class populated with the data from the file.

Return type:

Self

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If deserialization fails.

classmethod from_json(json_string)#

Creates an instance of the class from a JSON string representation.

This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class.

Parameters:: json_string (str) – JSON string containing the serialized object data.
Returns:: A new instance of the class with restored state.
Return type:: Self
Raises:: TypeError – If the class name in the serialized data doesn’t match.

to_dict()#

Transforms the current object into a dictionary representation.

Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes

When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed.

Returns:: A dictionary representation of the current object with all necessary data for serialization
Return type:: dict[str, Any]

to_disk(file_path)#

Saves the serialized instance to a JSON file at the specified path.

This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding.

Parameters:

file_path (str | Path) – Path where the JSON file should be saved (must end with ‘.json’). Can be a string or a Path object.

Return type:

None

Returns:

None

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If there’s an error during the file writing process.

to_json()#

Converts the object to its JSON string representation.

Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method.

Returns:: A JSON string representation of the object.
Return type:: str

property unique_id: str#: Returns the ULID of the instance.

name: NonEmptyStr#

description: NonEmptyStr#

llm_role: LLMRoleAny#

add_references: StrictBool#

reference_depth: ReferenceDepth#

singular_occurrence: StrictBool#

add_justifications: StrictBool#

justification_depth: JustificationDepth#

justification_max_sents: StrictInt#

custom_data: dict#

class contextgem.public.concepts.NumericalConcept(**data)[source]#

Bases: _Concept

A concept model for numerical information extraction from documents and aspects.

This class handles identification and extraction of numeric values (integers, floats, or both) that represent conceptual measurements or quantities within content.

Variables:

name (str) – The name of the concept (non-empty string, stripped).
description (str) – A brief description of the concept (non-empty string, stripped).
numeric_type (Literal["int", "float", "any"]) – Type constraint for extracted numbers (“int”, “float”, or “any”). Defaults to “any” for auto-detection.
llm_role (LLMRoleAny) – The role of the LLM responsible for extracting the concept (“extractor_text”, “reasoner_text”, “extractor_vision”, “reasoner_vision”). Defaults to “extractor_text”.
add_justifications (bool) – Whether to include justifications for extracted items.
justification_depth (JustificationDepth) – Justification detail level. Defaults to “brief”.
justification_max_sents (int) – Maximum sentences in justification. Defaults to 2.
add_references (bool) – Whether to include source references for extracted items.
reference_depth (ReferenceDepth) – Source reference granularity (“paragraphs” or “sentences”). Defaults to “paragraphs”. Only relevant when references are added to extracted items. Affects the structure of extracted_items.
singular_occurrence (StrictBool) – Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type (e.g., “total revenue” vs “monthly sales figures”).

Parameters:

custom_data (dict)
add_justifications (Annotated[bool, Strict(strict=True)])
justification_depth (Literal['brief', 'balanced', 'comprehensive'])
justification_max_sents (Annotated[int, Strict(strict=True)])
name (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
description (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
llm_role (Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision'])
add_references (Annotated[bool, Strict(strict=True)])
reference_depth (Literal['paragraphs', 'sentences'])
singular_occurrence (Annotated[bool, Strict(strict=True)])
numeric_type (Literal['int', 'float', 'any'])

Example:

Numerical concept definition#

from contextgem import NumericalConcept


# Create concepts for different numerical values in the contract
payment_amount = NumericalConcept(
    name="Payment amount",
    description="The monetary value to be paid according to the contract terms",
    numeric_type="float",
    llm_role="extractor_text",
    add_references=True,
    reference_depth="sentences",
)

payment_days = NumericalConcept(
    name="Payment term days",
    description="The number of days within which payment must be made",
    numeric_type="int",
    llm_role="extractor_text",
    add_justifications=True,
    justification_depth="balanced",
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

numeric_type: Literal['int', 'float', 'any']#

clone()#

Creates and returns a deep copy of the current instance.

Return type:: typing.Self
Returns:: A deep copy of the current instance.

property extracted_items: list[_ExtractedItem]#

Provides access to extracted items.

Returns:: A list containing the extracted items as _ExtractedItem objects.
Return type:: list[_ExtractedItem]

classmethod from_dict(obj_dict)#

Reconstructs an instance of the class from a dictionary representation.

This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component.

Parameters:: obj_dict (dict[str, Any]) – Dictionary containing the serialized object data.
Returns:: A new instance of the class with restored attributes.
Return type:: Self

classmethod from_disk(file_path)#

Loads an instance of the class from a JSON file stored on disk.

This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method.

Parameters:

file_path (str | Path) – Path to the JSON file to load (must end with ‘.json’). Can be a string or a Path object.

Returns:

An instance of the class populated with the data from the file.

Return type:

Self

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If deserialization fails.

classmethod from_json(json_string)#

Creates an instance of the class from a JSON string representation.

This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class.

Parameters:: json_string (str) – JSON string containing the serialized object data.
Returns:: A new instance of the class with restored state.
Return type:: Self
Raises:: TypeError – If the class name in the serialized data doesn’t match.

to_dict()#

Transforms the current object into a dictionary representation.

Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes

When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed.

Returns:: A dictionary representation of the current object with all necessary data for serialization
Return type:: dict[str, Any]

to_disk(file_path)#

Saves the serialized instance to a JSON file at the specified path.

This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding.

Parameters:

file_path (str | Path) – Path where the JSON file should be saved (must end with ‘.json’). Can be a string or a Path object.

Return type:

None

Returns:

None

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If there’s an error during the file writing process.

to_json()#

Converts the object to its JSON string representation.

Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method.

Returns:: A JSON string representation of the object.
Return type:: str

property unique_id: str#: Returns the ULID of the instance.

name: NonEmptyStr#

description: NonEmptyStr#

llm_role: LLMRoleAny#

add_references: StrictBool#

reference_depth: ReferenceDepth#

singular_occurrence: StrictBool#

add_justifications: StrictBool#

justification_depth: JustificationDepth#

justification_max_sents: StrictInt#

custom_data: dict#

class contextgem.public.concepts.RatingConcept(**data)[source]#

Bases: _Concept

A concept model for rating-based information extraction with defined scale boundaries.

This class handles identification and extraction of integer ratings that must fall within the boundaries of a specified rating scale.

Variables:

name (str) – The name of the concept (non-empty string, stripped).
description (str) – A brief description of the concept (non-empty string, stripped).
rating_scale (RatingScale | tuple[int, int]) – The rating scale defining valid value boundaries. Can be either a RatingScale object (deprecated, will be removed in v1.0.0) or a tuple of (start, end) integers.
llm_role (LLMRoleAny) – The role of the LLM responsible for extracting the concept (“extractor_text”, “reasoner_text”, “extractor_vision”, “reasoner_vision”). Defaults to “extractor_text”.
add_justifications (bool) – Whether to include justifications for extracted items.
justification_depth (JustificationDepth) – Justification detail level. Defaults to “brief”.
justification_max_sents (int) – Maximum sentences in justification. Defaults to 2.
add_references (bool) – Whether to include source references for extracted items.
reference_depth (ReferenceDepth) – Source reference granularity (“paragraphs” or “sentences”). Defaults to “paragraphs”. Only relevant when references are added to extracted items. Affects the structure of extracted_items.
singular_occurrence (StrictBool) – Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type (e.g., “product rating score” vs “customer satisfaction ratings”).

Parameters:

custom_data (dict)
add_justifications (Annotated[bool, Strict(strict=True)])
justification_depth (Literal['brief', 'balanced', 'comprehensive'])
justification_max_sents (Annotated[int, Strict(strict=True)])
name (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
description (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
llm_role (Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision'])
add_references (Annotated[bool, Strict(strict=True)])
reference_depth (Literal['paragraphs', 'sentences'])
singular_occurrence (Annotated[bool, Strict(strict=True)])
rating_scale (RatingScale | tuple[Annotated[int, Strict(strict=True)], Annotated[int, Strict(strict=True)]])

Example:

Rating concept definition#

from contextgem import RatingConcept


# Create a concept to rate the fairness of contract terms
fairness_rating = RatingConcept(
    name="Contract fairness rating",
    description="Evaluation of how balanced and fair the contract terms are for all parties",
    rating_scale=(1, 5),
    llm_role="reasoner_text",
    add_justifications=True,
    justification_depth="comprehensive",
    justification_max_sents=10,
)

# Create a concept to rate the clarity of contract language
clarity_rating = RatingConcept(
    name="Language clarity rating",
    description="Assessment of how clear and unambiguous the contract language is",
    rating_scale=(1, 10),
    llm_role="reasoner_text",
    add_justifications=True,
    justification_depth="balanced",
    justification_max_sents=3,
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

rating_scale: RatingScale | tuple[StrictInt, StrictInt]#

property extracted_items: list[_IntegerItem]#

Provides access to extracted items.

Returns:: A list containing the extracted items as _ExtractedItem objects.
Return type:: list[_ExtractedItem]

clone()#

Creates and returns a deep copy of the current instance.

Return type:: typing.Self
Returns:: A deep copy of the current instance.

classmethod from_dict(obj_dict)#

Reconstructs an instance of the class from a dictionary representation.

This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component.

Parameters:: obj_dict (dict[str, Any]) – Dictionary containing the serialized object data.
Returns:: A new instance of the class with restored attributes.
Return type:: Self

classmethod from_disk(file_path)#

Loads an instance of the class from a JSON file stored on disk.

This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method.

Parameters:

file_path (str | Path) – Path to the JSON file to load (must end with ‘.json’). Can be a string or a Path object.

Returns:

An instance of the class populated with the data from the file.

Return type:

Self

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If deserialization fails.

classmethod from_json(json_string)#

Creates an instance of the class from a JSON string representation.

This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class.

Parameters:: json_string (str) – JSON string containing the serialized object data.
Returns:: A new instance of the class with restored state.
Return type:: Self
Raises:: TypeError – If the class name in the serialized data doesn’t match.

to_dict()#

Transforms the current object into a dictionary representation.

Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes

When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed.

Returns:: A dictionary representation of the current object with all necessary data for serialization
Return type:: dict[str, Any]

to_disk(file_path)#

Saves the serialized instance to a JSON file at the specified path.

This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding.

Parameters:

file_path (str | Path) – Path where the JSON file should be saved (must end with ‘.json’). Can be a string or a Path object.

Return type:

None

Returns:

None

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If there’s an error during the file writing process.

to_json()#

Converts the object to its JSON string representation.

Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method.

Returns:: A JSON string representation of the object.
Return type:: str

property unique_id: str#: Returns the ULID of the instance.

name: NonEmptyStr#

description: NonEmptyStr#

llm_role: LLMRoleAny#

add_references: StrictBool#

reference_depth: ReferenceDepth#

singular_occurrence: StrictBool#

add_justifications: StrictBool#

justification_depth: JustificationDepth#

justification_max_sents: StrictInt#

custom_data: dict#

class contextgem.public.concepts.JsonObjectConcept(**data)[source]#

Bases: _Concept

A concept model for structured JSON object extraction from documents and aspects.

This class handles identification and extraction of structured data in JSON format, with validation against a predefined schema structure.

Variables:

name (str) – The name of the concept (non-empty string, stripped).
description (str) – A brief description of the concept (non-empty string, stripped).
structure (type | dict[str, Any]) –
JSON object schema as a class with type annotations or dictionary where keys are field names and values are type annotations. All dictionary keys must be strings. Supports generic aliases, union types, nested dictionaries for complex hierarchical structures, lists of dictionaries for array items, Literal types, and classes with type annotations (Pydantic models, dataclasses, etc.) for nested structures. All annotated types must be JSON-serializable. Examples:
- Simple structure: {"item": str, "amount": int | float}
- Nested structure: {"item": str, "details": {"price": float, "quantity": int}}
- List of objects: {"items": [{"name": str, "price": float}]}
- List of primitives: {"names": [str], "scores": [int | float], "statuses": [Literal["active", "inactive"]]}
- List of classes: {"addresses": [AddressModel], "users": [UserModel]}
- Literal values: {"status": Literal["pending", "completed", "failed"]}
- With type annotated classes: {"address": AddressModel} where AddressModel can be a Pydantic model, dataclass, or any class with type annotations
Note: For lists, you can use either generic syntax (list[str]) or literal syntax ([str]). List instances support primitive types, unions, literals, and typed classes. Both {"items": [ClassName]} and {"items": list[ClassName]} are equivalent.

Note: Class types cannot be used as dictionary keys or values. For example, dict[str, Address] is not allowed. Use alternative structures like nested objects or lists of objects instead.

Note: When using classes that contain other classes as type hints, inherit from JsonObjectClassStruct in all parts of the class hierarchy, to ensure proper conversion of nested class hierarchies to dictionary representations for serialization.

Tip: do not overcomplicate the structure to avoid prompt overloading.
examples (list[JsonObjectExample]) – Example JSON objects illustrating the concept usage.
llm_role (LLMRoleAny) – The role of the LLM responsible for extracting the concept (“extractor_text”, “reasoner_text”, “extractor_vision”, “reasoner_vision”). Defaults to “extractor_text”.
add_justifications (bool) – Whether to include justifications for extracted items.
justification_depth (JustificationDepth) – Justification detail level. Defaults to “brief”.
justification_max_sents (int) – Maximum sentences in justification. Defaults to 2.
add_references (bool) – Whether to include source references for extracted items.
reference_depth (ReferenceDepth) – Source reference granularity (“paragraphs” or “sentences”). Defaults to “paragraphs”. Only relevant when references are added to extracted items. Affects the structure of extracted_items.
singular_occurrence (StrictBool) – Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type (e.g., “product specifications” vs “customer order details”).

Parameters:

custom_data (dict)
add_justifications (Annotated[bool, Strict(strict=True)])
justification_depth (Literal['brief', 'balanced', 'comprehensive'])
justification_max_sents (Annotated[int, Strict(strict=True)])
name (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
description (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
llm_role (Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision'])
add_references (Annotated[bool, Strict(strict=True)])
reference_depth (Literal['paragraphs', 'sentences'])
singular_occurrence (Annotated[bool, Strict(strict=True)])
structure (type | dict[Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)], Any])
examples (list[JsonObjectExample])

Example:

JSON object concept definition#

from typing import Literal

from contextgem import JsonObjectConcept


# Define a JSON object concept for capturing address information
address_info_concept = JsonObjectConcept(
    name="Address information",
    description=(
        "Structured address data from text including street, "
        "city, state, postal code, and country."
    ),
    structure={
        "street": str | None,
        "city": str | None,
        "state": str | None,
        "postal_code": str | None,
        "country": str | None,
        "address_type": Literal["residential", "business"] | None,
    },
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

structure: type | dict[NonEmptyStr, Any]#

examples: list[JsonObjectExample]#

clone()#

Creates and returns a deep copy of the current instance.

Return type:: typing.Self
Returns:: A deep copy of the current instance.

property extracted_items: list[_ExtractedItem]#

Provides access to extracted items.

Returns:: A list containing the extracted items as _ExtractedItem objects.
Return type:: list[_ExtractedItem]

classmethod from_dict(obj_dict)#

Reconstructs an instance of the class from a dictionary representation.

This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component.

Parameters:: obj_dict (dict[str, Any]) – Dictionary containing the serialized object data.
Returns:: A new instance of the class with restored attributes.
Return type:: Self

classmethod from_disk(file_path)#

Loads an instance of the class from a JSON file stored on disk.

This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method.

Parameters:

file_path (str | Path) – Path to the JSON file to load (must end with ‘.json’). Can be a string or a Path object.

Returns:

An instance of the class populated with the data from the file.

Return type:

Self

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If deserialization fails.

classmethod from_json(json_string)#

Creates an instance of the class from a JSON string representation.

This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class.

Parameters:: json_string (str) – JSON string containing the serialized object data.
Returns:: A new instance of the class with restored state.
Return type:: Self
Raises:: TypeError – If the class name in the serialized data doesn’t match.

to_dict()#

Transforms the current object into a dictionary representation.

Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes

When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed.

Returns:: A dictionary representation of the current object with all necessary data for serialization
Return type:: dict[str, Any]

to_disk(file_path)#

Saves the serialized instance to a JSON file at the specified path.

This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding.

Parameters:

file_path (str | Path) – Path where the JSON file should be saved (must end with ‘.json’). Can be a string or a Path object.

Return type:

None

Returns:

None

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If there’s an error during the file writing process.

to_json()#

Converts the object to its JSON string representation.

Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method.

Returns:: A JSON string representation of the object.
Return type:: str

property unique_id: str#: Returns the ULID of the instance.

name: NonEmptyStr#

description: NonEmptyStr#

llm_role: LLMRoleAny#

add_references: StrictBool#

reference_depth: ReferenceDepth#

singular_occurrence: StrictBool#

add_justifications: StrictBool#

justification_depth: JustificationDepth#

justification_max_sents: StrictInt#

custom_data: dict#

class contextgem.public.concepts.DateConcept(**data)[source]#

Bases: _Concept

A concept model for date object extraction from documents and aspects.

This class handles identification and extraction of dates, with support for parsing string representations in a specified format into Python date objects.

Variables:

name (str) – The name of the concept (non-empty string, stripped).
description (str) – A brief description of the concept (non-empty string, stripped).
llm_role (LLMRoleAny) – The role of the LLM responsible for extracting the concept (“extractor_text”, “reasoner_text”, “extractor_vision”, “reasoner_vision”). Defaults to “extractor_text”.
add_justifications (bool) – Whether to include justifications for extracted items.
justification_depth (JustificationDepth) – Justification detail level. Defaults to “brief”.
justification_max_sents (int) – Maximum sentences in justification. Defaults to 2.
add_references (bool) – Whether to include source references for extracted items.
reference_depth (ReferenceDepth) – Source reference granularity (“paragraphs” or “sentences”). Defaults to “paragraphs”. Only relevant when references are added to extracted items. Affects the structure of extracted_items.
singular_occurrence (StrictBool) – Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type (e.g., “contract signing date” vs “meeting dates”).

Parameters:

custom_data (dict)
add_justifications (Annotated[bool, Strict(strict=True)])
justification_depth (Literal['brief', 'balanced', 'comprehensive'])
justification_max_sents (Annotated[int, Strict(strict=True)])
name (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
description (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
llm_role (Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision'])
add_references (Annotated[bool, Strict(strict=True)])
reference_depth (Literal['paragraphs', 'sentences'])
singular_occurrence (Annotated[bool, Strict(strict=True)])

Example:

Date concept definition#

from contextgem import DateConcept


# Create a date concept to extract the effective date of the contract
effective_date = DateConcept(
    name="Effective date",
    description="The effective as specified in the contract",
    add_references=True,  # Include references to where dates were found
    singular_occurrence=True,  # Only extract one effective date per document
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

clone()#

Creates and returns a deep copy of the current instance.

Return type:: typing.Self
Returns:: A deep copy of the current instance.

property extracted_items: list[_ExtractedItem]#

Provides access to extracted items.

Returns:: A list containing the extracted items as _ExtractedItem objects.
Return type:: list[_ExtractedItem]

classmethod from_dict(obj_dict)#

Reconstructs an instance of the class from a dictionary representation.

This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component.

Parameters:: obj_dict (dict[str, Any]) – Dictionary containing the serialized object data.
Returns:: A new instance of the class with restored attributes.
Return type:: Self

classmethod from_disk(file_path)#

Loads an instance of the class from a JSON file stored on disk.

This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method.

Parameters:

file_path (str | Path) – Path to the JSON file to load (must end with ‘.json’). Can be a string or a Path object.

Returns:

An instance of the class populated with the data from the file.

Return type:

Self

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If deserialization fails.

classmethod from_json(json_string)#

Creates an instance of the class from a JSON string representation.

This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class.

Parameters:: json_string (str) – JSON string containing the serialized object data.
Returns:: A new instance of the class with restored state.
Return type:: Self
Raises:: TypeError – If the class name in the serialized data doesn’t match.

to_dict()#

Transforms the current object into a dictionary representation.

Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes

When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed.

Returns:: A dictionary representation of the current object with all necessary data for serialization
Return type:: dict[str, Any]

to_disk(file_path)#

Saves the serialized instance to a JSON file at the specified path.

This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding.

Parameters:

file_path (str | Path) – Path where the JSON file should be saved (must end with ‘.json’). Can be a string or a Path object.

Return type:

None

Returns:

None

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If there’s an error during the file writing process.

to_json()#

Converts the object to its JSON string representation.

Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method.

Returns:: A JSON string representation of the object.
Return type:: str

property unique_id: str#: Returns the ULID of the instance.

name: NonEmptyStr#

description: NonEmptyStr#

llm_role: LLMRoleAny#

add_references: StrictBool#

reference_depth: ReferenceDepth#

singular_occurrence: StrictBool#

add_justifications: StrictBool#

justification_depth: JustificationDepth#

justification_max_sents: StrictInt#

custom_data: dict#

class contextgem.public.concepts.LabelConcept(**data)[source]#

Bases: _Concept

A concept model for label-based classification of documents and aspects.

This class handles identification and classification using predefined labels, supporting both multi-class (single label selection) and multi-label (multiple label selection) classification approaches.

Note: When none of the predefined labels apply to the content being classified, no extracted items will be returned (empty extracted_items list). This ensures that only valid, predefined labels are selected and prevents forced classification when no appropriate label exists.

Variables:

name (str) – The name of the concept (non-empty string, stripped).
description (str) – A brief description of the concept (non-empty string, stripped).
labels (list[str]) – List of predefined labels for classification. Must contain at least 2 unique labels.
classification_type (ClassificationType) – Classification mode - “multi_class” for single label selection, “multi_label” for multiple label selection. Defaults to “multi_class”.
llm_role (LLMRoleAny) – The role of the LLM responsible for extracting the concept (“extractor_text”, “reasoner_text”, “extractor_vision”, “reasoner_vision”). Defaults to “extractor_text”.
add_justifications (bool) – Whether to include justifications for extracted items.
justification_depth (JustificationDepth) – Justification detail level. Defaults to “brief”.
justification_max_sents (int) – Maximum sentences in justification. Defaults to 2.
add_references (bool) – Whether to include source references for extracted items.
reference_depth (ReferenceDepth) – Source reference granularity (“paragraphs” or “sentences”). Defaults to “paragraphs”. Only relevant when references are added to extracted items. Affects the structure of extracted_items.
singular_occurrence (StrictBool) – Whether this concept is restricted to having only one extracted item. If True, only a single extracted item will be extracted. Defaults to False (multiple extracted items are allowed). Note that with advanced LLMs, this constraint may not be strictly required as they can often infer the appropriate number of items to extract from the concept’s name, description, and type (e.g., “document type” vs “content topics”).

Parameters:

custom_data (dict)
add_justifications (Annotated[bool, Strict(strict=True)])
justification_depth (Literal['brief', 'balanced', 'comprehensive'])
justification_max_sents (Annotated[int, Strict(strict=True)])
name (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
description (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)])
llm_role (Literal['extractor_text', 'reasoner_text', 'extractor_vision', 'reasoner_vision'])
add_references (Annotated[bool, Strict(strict=True)])
reference_depth (Literal['paragraphs', 'sentences'])
singular_occurrence (Annotated[bool, Strict(strict=True)])
labels (list[Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)]])
classification_type (Literal['multi_class', 'multi_label'])

Example:

Label concept definition#

from contextgem import LabelConcept


# Multi-class classification: single label selection
document_type_concept = LabelConcept(
    name="Document Type",
    description="Classify the type of legal document",
    labels=["NDA", "Consultancy Agreement", "Privacy Policy", "Other"],
    classification_type="multi_class",
    singular_occurrence=True,
)

# Multi-label classification: multiple label selection
content_topics_concept = LabelConcept(
    name="Content Topics",
    description="Identify all relevant topics covered in the document",
    labels=["Finance", "Legal", "Technology", "HR", "Operations", "Marketing"],
    classification_type="multi_label",
    add_justifications=True,
    justification_depth="brief",  # add justifications for the selected labels
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

labels: list[NonEmptyStr]#

classification_type: ClassificationType#

property extracted_items: list[_LabelItem]#

Provides access to extracted items.

Returns:: A list containing the extracted items as _ExtractedItem objects.
Return type:: list[_ExtractedItem]

clone()#

Creates and returns a deep copy of the current instance.

Return type:: typing.Self
Returns:: A deep copy of the current instance.

classmethod from_dict(obj_dict)#

Reconstructs an instance of the class from a dictionary representation.

This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component.

Parameters:: obj_dict (dict[str, Any]) – Dictionary containing the serialized object data.
Returns:: A new instance of the class with restored attributes.
Return type:: Self

classmethod from_disk(file_path)#

Loads an instance of the class from a JSON file stored on disk.

This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method.

Parameters:

file_path (str | Path) – Path to the JSON file to load (must end with ‘.json’). Can be a string or a Path object.

Returns:

An instance of the class populated with the data from the file.

Return type:

Self

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If deserialization fails.

classmethod from_json(json_string)#

Creates an instance of the class from a JSON string representation.

This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class.

Parameters:: json_string (str) – JSON string containing the serialized object data.
Returns:: A new instance of the class with restored state.
Return type:: Self
Raises:: TypeError – If the class name in the serialized data doesn’t match.

to_dict()#

Transforms the current object into a dictionary representation.

Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes

When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed.

Returns:: A dictionary representation of the current object with all necessary data for serialization
Return type:: dict[str, Any]

to_disk(file_path)#

Saves the serialized instance to a JSON file at the specified path.

This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding.

Parameters:

file_path (str | Path) – Path where the JSON file should be saved (must end with ‘.json’). Can be a string or a Path object.

Return type:

None

Returns:

None

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If there’s an error during the file writing process.

to_json()#

Converts the object to its JSON string representation.

Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method.

Returns:: A JSON string representation of the object.
Return type:: str

property unique_id: str#: Returns the ULID of the instance.

name: NonEmptyStr#

description: NonEmptyStr#

llm_role: LLMRoleAny#

add_references: StrictBool#

reference_depth: ReferenceDepth#

singular_occurrence: StrictBool#

add_justifications: StrictBool#

justification_depth: JustificationDepth#

justification_max_sents: StrictInt#

custom_data: dict#

Concepts

Contents

Concepts#