Serializing objects and results#

ContextGem provides multiple serialization methods to preserve your document processing pipeline components and results. These methods enable you to save your work, transfer data between systems, or integrate with other applications.

When using serialization, all extracted data is preserved in the serialized objects.

💾 Serialization Methods#

The following ContextGem objects support serialization:

  • Document - Contains document content and extracted information

  • DocumentPipeline - Defines extraction structure and logic

  • DocumentLLM - Stores LLM configuration for document processing

Each object supports three serialization methods:

  • to_json() - Converts the object to a JSON string for cross-platform compatibility

  • to_dict() - Converts the object to a Python dictionary for in-memory operations

  • to_disk(file_path) - Saves the object directly to disk at the specified path

🔄 Deserialization Methods#

To reconstruct objects from their serialized forms, use the corresponding class methods:

  • from_json(json_string) - Creates an object from a JSON string

  • from_dict(dict_object) - Creates an object from a Python dictionary

  • from_disk(file_path) - Loads an object from a file on disk

📝 Example Usage#

# Example of serializing and deserializing ContextGem document,
# document pipeline, and LLM config.

import os
from pathlib import Path

from contextgem import (
    Aspect,
    BooleanConcept,
    Document,
    DocumentLLM,
    DocumentPipeline,
    DocxConverter,
    StringConcept,
)

# Create a document object
converter = DocxConverter()
docx_path = str(
    Path(__file__).resolve().parents[4]
    / "tests"
    / "docx_files"
    / "en_nda_with_anomalies.docx"
)  # your file path here (Path adapted for testing)
doc = converter.convert(docx_path, strict_mode=True)

# Create a document pipeline
document_pipeline = DocumentPipeline(
    aspects=[
        Aspect(
            name="Categories of confidential information",
            description="Clauses describing confidential information covered by the NDA",
            concepts=[
                StringConcept(
                    name="Types of disclosure",
                    description="Types of disclosure of confidential information",
                ),
                # ...
            ],
        ),
        # ...
    ],
    concepts=[
        BooleanConcept(
            name="Is mutual",
            description="Whether the NDA is mutual (both parties act as discloser/recipient)",
            add_justifications=True,
        ),
        # ...
    ],
)

# Attach the pipeline to the document
doc.assign_pipeline(document_pipeline)

# Configure a document LLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract data from the document
doc = llm.extract_all(doc)

# Serialize the LLM config, pipeline and document
llm_config_json = llm.to_json()  # or to_dict() / to_disk()
document_pipeline_json = document_pipeline.to_json()  # or to_dict() / to_disk()
processed_doc_json = doc.to_json()  # or to_dict() / to_disk()

# Deserialize the LLM config, pipeline and document
llm_deserialized = DocumentLLM.from_json(
    llm_config_json
)  # or from_dict() / from_disk()
document_pipeline_deserialized = DocumentPipeline.from_json(
    document_pipeline_json
)  # or from_dict() / from_disk()
processed_doc_deserialized = Document.from_json(
    processed_doc_json
)  # or from_dict() / from_disk()

# All extracted data is preserved!
assert processed_doc_deserialized.aspects[0].concepts[0].extracted_items

🚀 Use Cases#

  • Caching Results: Save processed documents to avoid repeating expensive LLM calls

  • Transfer Between Systems: Export results from one environment and import in another

  • API Integration: Convert objects to JSON for API responses

  • Workflow Persistence: Save pipeline configurations for later reuse