Documents

Documents#

Module for handling documents.

This module provides the Document class, which represents a structured or unstructured file containing written or visual content. Documents can be processed to extract information, analyze content, and organize data into paragraphs, sentences, aspects, and concepts.

class contextgem.public.documents.Document(**data)[source]#

Bases: _Document

Represents a document containing textual and visual content for analysis.

A document serves as the primary container for content analysis within the ContextGem framework, enabling complex document understanding and information extraction workflows.

Variables:

raw_text (str | None) – The main text of the document as a single string. Defaults to None.
paragraphs (list[Paragraph]) – List of Paragraph instances in consecutive order as they appear in the document. Defaults to an empty list.
images (list[Image]) – List of Image instances attached to or representing the document. Defaults to an empty list.
aspects (list[Aspect]) – List of aspects associated with the document for focused analysis. Validated to ensure unique names and descriptions. Defaults to an empty list.
concepts (list[_Concept]) – List of concepts associated with the document for information extraction. Validated to ensure unique names and descriptions. Defaults to an empty list.
paragraph_segmentation_mode (Literal["newlines", "sat"]) – Mode for paragraph segmentation. When set to “sat”, uses a SaT (Segment Any Text https://arxiv.org/abs/2406.16678) model. Defaults to “newlines”.
sat_model_id (SaTModelId) – SaT model ID for paragraph/sentence segmentation or a local path to a SaT model. For model IDs, defaults to “sat-3l-sm”. See segment-any-text/wtpsplit for the list of available models. For local paths, provide either a string path or a Path object pointing to the directory containing the SaT model.
pre_segment_sentences (bool) – Whether to pre-segment sentences during Document initialization. When False (default), sentence segmentation is deferred until sentences are actually needed, improving initialization performance. When True, sentences are segmented immediately during Document creation using the SaT model.

Parameters:

custom_data (dict)
raw_text (Annotated[str, Strict(strict=True), StringConstraints(strip_whitespace=True, to_upper=None, to_lower=None, strict=None, min_length=1, max_length=None, pattern=None)] | None)
paragraphs (list[_Paragraph])
images (list[_Image])
aspects (Annotated[Sequence[_Aspect], BeforeValidator(func=~contextgem.internal.typings.validators._validate_sequence_is_list, json_schema_input_type=PydanticUndefined)])
concepts (Annotated[Sequence[_Concept], BeforeValidator(func=~contextgem.internal.typings.validators._validate_sequence_is_list, json_schema_input_type=PydanticUndefined)])
paragraph_segmentation_mode (Literal['newlines', 'sat'])
sat_model_id (Literal['sat-1l', 'sat-1l-sm', 'sat-3l', 'sat-3l-sm', 'sat-6l', 'sat-6l-sm', 'sat-9l', 'sat-12l', 'sat-12l-sm'] | str | ~pathlib._local.Path)
pre_segment_sentences (bool)

Note:

Normally, you do not need to construct/populate paragraphs manually, as they are populated automatically from document’s raw_text attribute. Only use this constructor for advanced use cases, such as when you have a custom paragraph segmentation tool.

Example:

Document definition#

from pathlib import Path

from contextgem import Document, Paragraph, create_image


# Create a document with raw text content
contract_document = Document(
    raw_text=(
        "...This agreement is effective as of January 1, 2025.\n\n"
        "All parties must comply with the terms outlined herein. The terms include "
        "monthly reporting requirements and quarterly performance reviews.\n\n"
        "Failure to adhere to these terms may result in termination of the agreement. "
        "Additionally, any breach of confidentiality will be subject to penalties as "
        "described in this agreement.\n\n"
        "This agreement shall remain in force for a period of three (3) years unless "
        "otherwise terminated according to the provisions stated above..."
    ),
    paragraph_segmentation_mode="newlines",  # Default mode, splits on newlines
)

# Create a document with more advanced paragraph segmentation using a SaT model
report_document = Document(
    raw_text=(
        "Executive Summary "
        "This report outlines our quarterly performance. "
        "Revenue increased by [15%] compared to the previous quarter.\n\n"
        "Customer satisfaction metrics show positive trends across all regions..."
    ),
    paragraph_segmentation_mode="sat",  # Use SaT model for intelligent paragraph segmentation
    sat_model_id="sat-3l-sm",  # Specify which SaT model to use
)

# Create a document with predefined paragraphs, e.g. when you use a custom
# paragraph segmentation tool
document_from_paragraphs = Document(
    paragraphs=[
        Paragraph(raw_text="This is the first paragraph."),
        Paragraph(raw_text="This is the second paragraph with more content."),
        Paragraph(raw_text="Final paragraph concluding the document."),
        # ...
    ]
)

# Create document with images

# Path is adapted for doc tests
current_file = Path(__file__).resolve()
root_path = current_file.parents[4]
image_path = root_path / "tests" / "images" / "invoices" / "invoice.png"

# Create a document with only images (no text)
image_document = Document(
    images=[
        create_image(image_path),  # contextgem.Image instance
        # ...
    ]
)

# Create a document with both text and images
mixed_document = Document(
    raw_text="This document contains both text and visual elements.",
    images=[
        create_image(image_path),  # contextgem.Image instance
        # ...
    ],
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

add_aspects(aspects)#

Adds aspects to the existing aspects list of an instance and returns the updated instance. This method ensures that the provided aspects are deeply copied to avoid any unintended state modification of the original reusable aspects.

Parameters:: aspects (list[_Aspect]) – A list of aspects to be added. Each aspect is deeply copied to ensure the original list remains unaltered.
Returns:: Updated instance containing the newly added aspects.
Return type:: Self

add_concepts(concepts)#

Adds a list of new concepts to the existing concepts attribute of the instance. This method ensures that the provided list of concepts is deep-copied to prevent unintended side effects from modifying the input list outside of this method.

Parameters:: concepts (list[_Concept]) – A list of concepts to be added. It will be deep-copied before being added to the instance’s concepts attribute.
Returns:: Returns the instance itself after the modification.
Return type:: Self

assign_pipeline(pipeline, overwrite_existing=False)#

Assigns a given pipeline to the document. The method deep-copies the input pipeline to prevent any modifications to the state of aspects or concepts in the original pipeline. If the aspects or concepts are already associated with the document, an error is raised unless the overwrite_existing parameter is explicitly set to True.

Parameters:

pipeline (_ExtractionPipeline | _DocumentPipeline) – The ExtractionPipeline (or deprecated DocumentPipeline) object to attach to the document.
overwrite_existing (bool) – A boolean flag. If set to True, any existing aspects and concepts assigned to the document will be overwritten by the new pipeline. Defaults to False.

Return type:

typing.Self

Returns:

Returns the current instance of the document after assigning the pipeline.

clone()#

Creates and returns a deep copy of the current instance.

Return type:: typing.Self
Returns:: A deep copy of the current instance.

classmethod from_dict(obj_dict)#

Reconstructs an instance of the class from a dictionary representation.

This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component.

Parameters:: obj_dict (dict[str, Any]) – Dictionary containing the serialized object data.
Returns:: A new instance of the class with restored attributes.
Return type:: Self

classmethod from_disk(file_path)#

Loads an instance of the class from a JSON file stored on disk.

This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method.

Parameters:

file_path (str | Path) – Path to the JSON file to load (must end with ‘.json’). Can be a string or a Path object.

Returns:

An instance of the class populated with the data from the file.

Return type:

Self

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If deserialization fails.

classmethod from_json(json_string)#

Creates an instance of the class from a JSON string representation.

This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class.

Parameters:: json_string (str) – JSON string containing the serialized object data.
Returns:: A new instance of the class with restored state.
Return type:: Self
Raises:: TypeError – If the class name in the serialized data doesn’t match.

get_aspect_by_name(name)#

Finds and returns an aspect with the specified name from the list of available aspects, if the instance has aspects attribute.

Parameters:: name (str) – The name of the aspect to find.
Returns:: The aspect with the specified name.
Return type:: _Aspect
Raises:: ValueError – If no aspect with the specified name is found.

get_aspects_by_names(names)#

Retrieve a list of _Aspect objects corresponding to the provided list of names.

Parameters:: names (list[str]) – List of aspect names to retrieve. The names must be provided as a list of strings.
Returns:: A list of _Aspect objects corresponding to provided names.
Return type:: list[_Aspect]

get_concept_by_name(name)#

Retrieves a concept from the list of concepts based on the provided name, if the instance has concepts attribute.

Parameters:: name (str) – The name of the concept to search for.
Returns:: The _Concept object with the specified name.
Return type:: _Concept
Raises:: ValueError – If no concept with the specified name is found.

get_concepts_by_names(names)#

Retrieve a list of _Concept objects corresponding to the provided list of names.

Parameters:: names (list[str]) – List of concept names to retrieve. The names must be provided as a list of strings.
Returns:: A list of _Concept objects corresponding to provided names.
Return type:: list[_Concept]

property llm_roles: set[str]#

A set of LLM roles associated with the object’s aspects and concepts.

Returns:: A set containing unique LLM roles gathered from aspects and concepts.
Return type:: set[str]

remove_all_aspects()#

Removes all aspects from the instance and returns the updated instance.

This method clears the aspects attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining.

Return type:: typing.Self
Returns:: The updated instance with all aspects removed

remove_all_concepts()#

Removes all concepts from the instance and returns the updated instance.

This method clears the concepts attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining.

Return type:: typing.Self
Returns:: The updated instance with all concepts removed

remove_all_instances()#

Removes all assigned instances from the object and resets them as empty lists. Returns the modified instance.

Returns:: The modified object with all assigned instances removed.
Return type:: Self

remove_aspect_by_name(name)#

Removes an aspect from the assigned aspects by its name.

Parameters:: name (str) – The name of the aspect to be removed
Returns:: Updated instance with the aspect removed.
Return type:: Self

remove_aspects_by_names(names)#

Removes multiple aspects from an object based on the provided list of names.

Parameters:: names (list[str]) – A list of names identifying the aspects to be removed.
Returns:: The updated object after the specified aspects have been removed.
Return type:: Self

remove_concept_by_name(name)#

Removes a concept from the assigned concepts by its name.

Parameters:: name (str) – The name of the concept to be removed
Returns:: Updated instance with the concept removed.
Return type:: Self

remove_concepts_by_names(names)#

Removes concepts from the object by their names.

Parameters:: names (list[str]) – A list of concept names to be removed.
Returns:: Returns the updated instance after removing the specified concepts.
Return type:: Self

property sentences: list[_Sentence]#

Provides access to all sentences within the paragraphs of the document by flattening and combining sentences from each paragraph into a single list.

Returns:: A list of _Sentence objects that are contained within all paragraphs.
Return type:: list[_Sentence]

to_dict()#

Transforms the current object into a dictionary representation.

Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes

When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed.

Returns:: A dictionary representation of the current object with all necessary data for serialization
Return type:: dict[str, Any]

to_disk(file_path)#

Saves the serialized instance to a JSON file at the specified path.

This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding.

Parameters:

file_path (str | Path) – Path where the JSON file should be saved (must end with ‘.json’). Can be a string or a Path object.

Return type:

None

Returns:

None

Raises:

ValueError – If the file path doesn’t end with ‘.json’.
RuntimeError – If there’s an error during the file writing process.

to_json()#

Converts the object to its JSON string representation.

Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method.

Returns:: A JSON string representation of the object.
Return type:: str

property unique_id: str#: Returns the ULID of the instance.

raw_text: NonEmptyStr | None#

paragraphs: list[_Paragraph]#

images: list[_Image]#

aspects: Annotated[Sequence[_Aspect], BeforeValidator(_validate_sequence_is_list)]#

concepts: Annotated[Sequence[_Concept], BeforeValidator(_validate_sequence_is_list)]#

paragraph_segmentation_mode: Literal['newlines', 'sat']#

sat_model_id: SaTModelId#

pre_segment_sentences: bool#

custom_data: dict#

Documents

Contents

Documents#