Serializing objects and results

Serializing objects and results#

ContextGem provides multiple serialization methods to preserve your document processing pipeline components and results. These methods enable you to save your work, transfer data between systems, or integrate with other applications.

When using serialization, all extracted data is preserved in the serialized objects.

💾 Serialization Methods#

The following ContextGem objects support serialization:

Document - Contains document content and extracted information
ExtractionPipeline - Defines extraction structure and logic
DocumentLLM - Stores LLM configuration for document processing

Each object supports three serialization methods:

to_json() - Converts the object to a JSON string for cross-platform compatibility
to_dict() - Converts the object to a Python dictionary for in-memory operations
to_disk(file_path) - Saves the object directly to disk at the specified path

🔄 Deserialization Methods#

To reconstruct objects from their serialized forms, use the corresponding class methods:

from_json(json_string) - Creates an object from a JSON string
from_dict(dict_object) - Creates an object from a Python dictionary
from_disk(file_path) - Loads an object from a file on disk

📝 Example Usage#

# Example of serializing and deserializing ContextGem document,
# extraction pipeline, and LLM config.

import os
from pathlib import Path

from contextgem import (
    Aspect,
    BooleanConcept,
    Document,
    DocumentLLM,
    DocxConverter,
    ExtractionPipeline,
    StringConcept,
)


# Create a document object
converter = DocxConverter()
docx_path = str(
    Path(__file__).resolve().parents[4]
    / "tests"
    / "docx_files"
    / "en_nda_with_anomalies.docx"
)  # your file path here (Path adapted for testing)
doc = converter.convert(docx_path, strict_mode=True)

# Create an extraction pipeline
extraction_pipeline = ExtractionPipeline(
    aspects=[
        Aspect(
            name="Categories of confidential information",
            description="Clauses describing confidential information covered by the NDA",
            concepts=[
                StringConcept(
                    name="Types of disclosure",
                    description="Types of disclosure of confidential information",
                ),
                # ...
            ],
        ),
        # ...
    ],
    concepts=[
        BooleanConcept(
            name="Is mutual",
            description="Whether the NDA is mutual (both parties act as discloser/recipient)",
            add_justifications=True,
        ),
        # ...
    ],
)

# Attach the pipeline to the document
doc.assign_pipeline(extraction_pipeline)

# Configure a document LLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract data from the document
doc = llm.extract_all(doc)

# Serialize the LLM config, pipeline and document
llm_config_json = llm.to_json()  # or to_dict() / to_disk()
extraction_pipeline_json = extraction_pipeline.to_json()  # or to_dict() / to_disk()
processed_doc_json = doc.to_json()  # or to_dict() / to_disk()

# Deserialize the LLM config, pipeline and document
llm_deserialized = DocumentLLM.from_json(
    llm_config_json
)  # or from_dict() / from_disk()
extraction_pipeline_deserialized = ExtractionPipeline.from_json(
    extraction_pipeline_json
)  # or from_dict() / from_disk()
processed_doc_deserialized = Document.from_json(
    processed_doc_json
)  # or from_dict() / from_disk()

# All extracted data is preserved!
assert processed_doc_deserialized.aspects[0].concepts[0].extracted_items

🚀 Use Cases#

Caching Results: Save processed documents to avoid repeating expensive LLM calls
Transfer Between Systems: Export results from one environment and import in another
API Integration: Convert objects to JSON for API responses
Workflow Persistence: Save pipeline configurations for later reuse