Serializing objects and results#
ContextGem provides multiple serialization methods to preserve your document processing pipeline components and results. These methods enable you to save your work, transfer data between systems, or integrate with other applications.
When using serialization, all extracted data is preserved in the serialized objects.
💾 Serialization Methods#
The following ContextGem objects support serialization:
Document
- Contains document content and extracted informationDocumentPipeline
- Defines extraction structure and logicDocumentLLM
- Stores LLM configuration for document processing
Each object supports three serialization methods:
to_json()
- Converts the object to a JSON string for cross-platform compatibilityto_dict()
- Converts the object to a Python dictionary for in-memory operationsto_disk(file_path)
- Saves the object directly to disk at the specified path
🔄 Deserialization Methods#
To reconstruct objects from their serialized forms, use the corresponding class methods:
from_json(json_string)
- Creates an object from a JSON stringfrom_dict(dict_object)
- Creates an object from a Python dictionaryfrom_disk(file_path)
- Loads an object from a file on disk
📝 Example Usage#
# Example of serializing and deserializing ContextGem document,
# document pipeline, and LLM config.
import os
from pathlib import Path
from contextgem import (
Aspect,
BooleanConcept,
Document,
DocumentLLM,
DocumentPipeline,
DocxConverter,
StringConcept,
)
# Create a document object
converter = DocxConverter()
docx_path = str(
Path(__file__).resolve().parents[4]
/ "tests"
/ "docx_files"
/ "en_nda_with_anomalies.docx"
) # your file path here (Path adapted for testing)
doc = converter.convert(docx_path, strict_mode=True)
# Create a document pipeline
document_pipeline = DocumentPipeline(
aspects=[
Aspect(
name="Categories of confidential information",
description="Clauses describing confidential information covered by the NDA",
concepts=[
StringConcept(
name="Types of disclosure",
description="Types of disclosure of confidential information",
),
# ...
],
),
# ...
],
concepts=[
BooleanConcept(
name="Is mutual",
description="Whether the NDA is mutual (both parties act as discloser/recipient)",
add_justifications=True,
),
# ...
],
)
# Attach the pipeline to the document
doc.assign_pipeline(document_pipeline)
# Configure a document LLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1-mini",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract data from the document
doc = llm.extract_all(doc)
# Serialize the LLM config, pipeline and document
llm_config_json = llm.to_json() # or to_dict() / to_disk()
document_pipeline_json = document_pipeline.to_json() # or to_dict() / to_disk()
processed_doc_json = doc.to_json() # or to_dict() / to_disk()
# Deserialize the LLM config, pipeline and document
llm_deserialized = DocumentLLM.from_json(
llm_config_json
) # or from_dict() / from_disk()
document_pipeline_deserialized = DocumentPipeline.from_json(
document_pipeline_json
) # or from_dict() / from_disk()
processed_doc_deserialized = Document.from_json(
processed_doc_json
) # or from_dict() / from_disk()
# All extracted data is preserved!
assert processed_doc_deserialized.aspects[0].concepts[0].extracted_items
🚀 Use Cases#
Caching Results: Save processed documents to avoid repeating expensive LLM calls
Transfer Between Systems: Export results from one environment and import in another
API Integration: Convert objects to JSON for API responses
Workflow Persistence: Save pipeline configurations for later reuse