Pipelines#
Module for handling document processing pipelines.
This module provides the DocumentPipeline class, which represents a reusable collection of pre-defined aspects and concepts that can be assigned to documents. Pipelines enable standardized document analysis by packaging common extraction patterns into reusable units.
Pipelines serve as templates for document processing, allowing consistent application of the same analysis approach across multiple documents. They encapsulate both the structural organization (aspects) and the specific information to extract (concepts) in a single, assignable object.
- class contextgem.public.pipelines.DocumentPipeline(**data)[source]#
Bases:
_AssignedInstancesProcessor
Represents a reusable collection of predefined aspects and concepts for document analysis.
Document pipelines serve as templates that can be assigned to multiple documents, ensuring consistent application of the same analysis criteria. They package common extraction patterns into reusable units, allowing for standardized document processing.
- Variables:
aspects – A list of aspects to extract from documents. Aspects represent structural categories of information. Defaults to an empty list.
concepts – A list of concepts to identify within documents. Concepts represent specific information elements to extract. Defaults to an empty list.
- Parameters:
- Note:
A pipeline is a reusable configuration of extraction steps. You can use the same pipeline to extract data from multiple documents.
- Example:
- Document pipeline definition#
from contextgem import ( Aspect, BooleanConcept, DateConcept, Document, DocumentPipeline, StringConcept, ) # Create a pipeline for NDA (Non-Disclosure Agreement) review nda_pipeline = DocumentPipeline( aspects=[ Aspect( name="Confidential information", description="Clauses defining the confidential information", ), Aspect( name="Exclusions", description="Clauses defining exclusions from confidential information", ), Aspect( name="Obligations", description="Clauses defining confidentiality obligations", ), Aspect( name="Liability", description="Clauses defining liability for breach of the agreement", ), # ... Add more aspects as needed ], concepts=[ StringConcept( name="Anomaly", description="Anomaly in the contract, e.g. out-of-context or nonsensical clauses", llm_role="reasoner_text", add_references=True, # Add references to the source text reference_depth="sentences", # Reference to the sentence level add_justifications=True, # Add justifications for the anomaly justification_depth="balanced", # Justification at the sentence level justification_max_sents=5, # Maximum number of sentences in the justification ), BooleanConcept( name="Is mutual", description="Whether the NDA is mutual (bidirectional) or one-way", singular_occurrence=True, llm_role="reasoner_text", # Use the reasoner role for this concept ), DateConcept( name="Effective date", description="The date when the NDA agreement becomes effective", singular_occurrence=True, ), StringConcept( name="Term", description="The term of the NDA", ), StringConcept( name="Governing law", description="The governing law of the agreement", singular_occurrence=True, ), # ... Add more concepts as needed ], ) # Assign the pipeline to the NDA document nda_document = Document(raw_text="[NDA text]") nda_document.assign_pipeline(nda_pipeline) # Now the document is ready for processing with the NDA review pipeline! # The document can be processed to extract the defined aspects and concepts # Extract all aspects and concepts from the NDA using an LLM group # with LLMs with roles "extractor_text" and "reasoner_text". # llm_group.extract_all(nda_document)
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- aspects: list[Aspect]#
- concepts: list[_Concept]#
- add_aspects(aspects)#
Adds aspects to the existing aspects list of an instance and returns the updated instance. This method ensures that the provided aspects are deeply copied to avoid any unintended state modification of the original reusable aspects.
- add_concepts(concepts)#
Adds a list of new concepts to the existing concepts attribute of the instance. This method ensures that the provided list of concepts is deep-copied to prevent unintended side effects from modifying the input list outside of this method.
- Parameters:
concepts (list[_Concept]) – A list of concepts to be added. It will be deep-copied before being added to the instance’s concepts attribute.
- Returns:
Returns the instance itself after the modification.
- Return type:
Self
- clone()#
Creates and returns a deep copy of the current instance.
- Return type:
typing.Self
- Returns:
A deep copy of the current instance.
- classmethod from_dict(obj_dict)#
Reconstructs an instance of the class from a dictionary representation.
This method deserializes a dictionary containing the object’s attributes and values into a new instance of the class. It handles complex nested structures like aspects, concepts, and extracted items, properly reconstructing each component.
- classmethod from_disk(file_path)#
Loads an instance of the class from a JSON file stored on disk.
This method reads the JSON content from the specified file path and deserializes it into an instance of the class using the from_json method.
- Parameters:
file_path (str) – Path to the JSON file to load (must end with ‘.json’).
- Returns:
An instance of the class populated with the data from the file.
- Return type:
Self
- Raises:
ValueError – If the file path doesn’t end with ‘.json’.
OSError – If there’s an error reading the file.
RuntimeError – If deserialization fails.
- classmethod from_json(json_string)#
Creates an instance of the class from a JSON string representation.
This method deserializes the provided JSON string into a dictionary and uses the from_dict method to construct the class instance. It validates that the class name in the serialized data matches the current class.
- get_aspect_by_name(name)#
Finds and returns an aspect with the specified name from the list of available aspects, if the instance has aspects attribute.
- Parameters:
name (str) – The name of the aspect to find.
- Returns:
The aspect with the specified name.
- Return type:
- Raises:
ValueError – If no aspect with the specified name is found.
- get_aspects_by_names(names)#
Retrieve a list of Aspect objects corresponding to the provided list of names.
- get_concept_by_name(name)#
Retrieves a concept from the list of concepts based on the provided name, if the instance has concepts attribute.
- Parameters:
name (str) – The name of the concept to search for.
- Returns:
The _Concept object with the specified name.
- Return type:
_Concept
- Raises:
ValueError – If no concept with the specified name is found.
- get_concepts_by_names(names)#
Retrieve a list of _Concept objects corresponding to the provided list of names.
- remove_all_aspects()#
Removes all aspects from the instance and returns the updated instance.
This method clears the aspects attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining.
- Return type:
typing.Self
- Returns:
The updated instance with all aspects removed
- remove_all_concepts()#
Removes all concepts from the instance and returns the updated instance.
This method clears the concepts attribute of the instance by resetting it to an empty list. It returns the same instance, allowing for method chaining.
- Return type:
typing.Self
- Returns:
The updated instance with all concepts removed
- remove_all_instances()#
Removes all assigned instances from the object and resets them as empty lists. Returns the modified instance.
- Returns:
The modified object with all assigned instances removed.
- Return type:
Self
- remove_aspect_by_name(name)#
Removes an aspect from the assigned aspects by its name.
- Parameters:
name (str) – The name of the aspect to be removed
- Returns:
Updated instance with the aspect removed.
- Return type:
Self
- remove_aspects_by_names(names)#
Removes multiple aspects from an object based on the provided list of names.
- remove_concept_by_name(name)#
Removes a concept from the assigned concepts by its name.
- Parameters:
name (str) – The name of the concept to be removed
- Returns:
Updated instance with the concept removed.
- Return type:
Self
- remove_concepts_by_names(names)#
Removes concepts from the object by their names.
- to_dict()#
Transforms the current object into a dictionary representation.
Converts the object to a dictionary that includes: - All public attributes - Special handling for specific public and private attributes
When an LLM or LLM group is serialized, its API credentials and usage/cost stats are removed.
- to_disk(file_path)#
Saves the serialized instance to a JSON file at the specified path.
This method converts the instance to a dictionary representation using to_dict(), then writes it to disk as a formatted JSON file with UTF-8 encoding.
- Parameters:
file_path (str) – Path where the JSON file should be saved (must end with ‘.json’).
- Return type:
- Returns:
None
- Raises:
ValueError – If the file path doesn’t end with ‘.json’.
IOError – If there’s an error during the file writing process.
- to_json()#
Converts the object to its JSON string representation.
Serializes the object into a JSON-formatted string using the dictionary representation provided by the to_dict() method.
- Returns:
A JSON string representation of the object.
- Return type:
- custom_data: dict#