JsonObjectConcept

JsonObjectConcept#

JsonObjectConcept is a powerful concept type that extracts structured data in the form of JSON objects from documents, enabling sophisticated information organization and retrieval.

📝 Overview#

JsonObjectConcept is used when you need to extract complex, structured information from unstructured text, including:

Nested data structures: objects with multiple fields, hierarchical information, and related attributes
Standardized formats: consistent data extraction following predefined schemas for reliable downstream processing
Complex entity extraction: comprehensive extraction of entities with multiple attributes and relationships

This concept type offers the flexibility to define precise schemas that match your data requirements, ensuring that extracted information maintains structural integrity and relationships between different data elements.

💻 Usage Example#

Here’s a simple example of how to use JsonObjectConcept to extract product information:

# ContextGem: JsonObjectConcept Extraction

import os
from pprint import pprint
from typing import Literal

from contextgem import Document, DocumentLLM, JsonObjectConcept

# Define product information text
product_text = """
Product: Smart Fitness Watch X7
Price: $199.99
Features: Heart rate monitoring, GPS tracking, Sleep analysis
Battery Life: 5 days
Water Resistance: IP68
Available Colors: Black, Silver, Blue
Customer Rating: 4.5/5
"""

# Create a Document object from text
doc = Document(raw_text=product_text)

# Define a JsonObjectConcept with a structure for product information
product_concept = JsonObjectConcept(
    name="Product Information",
    description="Extract detailed product information including name, price, features, and specifications",
    structure={
        "name": str,
        "price": float,
        "features": list[str],
        "specifications": {
            "battery_life": str,
            "water_resistance": Literal["IP67", "IP68", "IPX7", "Not water resistant"],
        },
        "available_colors": list[str],
        "customer_rating": float,
    },
)

# Attach the concept to the document
doc.add_concepts([product_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept from the document
product_concept = llm.extract_concepts_from_document(doc)[0]

# Print the extracted structured data
extracted_product = product_concept.extracted_items[0].value
pprint(extracted_product)

⚙️ Parameters#

When creating a JsonObjectConcept, you can specify the following parameters:

Parameter	Type	Description
`name`	str	A unique name identifier for the concept
`description`	str	A clear description of what the concept represents and what should be extracted
`structure`	type \| dict[str, Any]	JSON object schema defining the data structure to be extracted. Can be specified as a Python class with type annotations or a dictionary with field names as keys and their corresponding types as values. This schema can represent simple flat structures or complex nested hierarchies with multiple levels of organization. The LLM will attempt to extract data that conforms to this structure, enabling precise and consistent extraction of complex information patterns.
`examples`	List[`JsonObjectExample`]	Optional. Example JSON objects illustrating the concept usage. Such examples must conform to the `structure` schema. Examples significantly improve extraction accuracy by showing the LLM concrete instances of the expected output format and content patterns. This is particularly valuable for complex schemas with nested structures or when there are specific formatting conventions that should be followed (e.g., how dates, identifiers, or specialized fields should be represented). Examples also help clarify how to handle edge cases or ambiguous information in the source document.
`llm_role`	str	The role of the LLM responsible for extracting the concept. Available values: `"extractor_text"`, `"reasoner_text"`, `"extractor_vision"`, `"reasoner_vision"`. Defaults to `"extractor_text"`. For more details, see 🏷️ LLM Roles.
`add_justifications`	bool	Whether to include justifications for extracted items (defaults to `False`). Justifications provide explanations of why the LLM extracted specific JSON structures and the reasoning behind field values. This is especially valuable for complex structures where the extraction process involves inference or when multiple data points must be synthesized. For example, a justification might explain how the LLM determined a product’s category based on various features mentioned across different paragraphs, or why certain optional fields were populated or left empty based on available information in the document.
`justification_depth`	str	Justification detail level. Available values: `"brief"`, `"balanced"`, `"comprehensive"`. Defaults to `"brief"`
`justification_max_sents`	int	Maximum sentences in a justification (defaults to `2`)
`add_references`	bool	Whether to include source references for extracted items (defaults to `False`). References indicate the specific locations in the document that informed the extraction of the JSON structure. This is particularly valuable for complex objects where field values may be calculated or inferred from multiple scattered pieces of information throughout the document. References help trace back extracted values to their source evidence, validate the extraction reasoning, and understand which parts of the document contributed to the synthesis of structured data, especially for fields requiring interpretation, not only direct extraction.
`reference_depth`	str	Source reference granularity. Available values: `"paragraphs"`, `"sentences"`. Defaults to `"paragraphs"`
`singular_occurrence`	bool	Whether this concept is restricted to having only one extracted item. If `True`, only a single JSON object will be extracted. Defaults to `False` (multiple JSON objects are allowed). For JSON object concepts, this parameter is particularly useful when you want to extract a comprehensive structured representation of a single entity (e.g., “product specifications” or “company profile”) rather than multiple instances of structured data scattered throughout the document. This is especially valuable when extracting complex nested objects that aggregate information from different parts of the document into a cohesive whole. Note that with advanced LLMs, this constraint may not be required as they can often infer the appropriate number of objects to extract from the concept’s name, description, and schema structure.
`custom_data`	dict	Optional. Dictionary for storing any additional data that you want to associate with the concept. This data must be JSON-serializable. This data is not used for extraction but can be useful for custom processing or downstream tasks. Defaults to an empty dictionary.

🏗️ Defining Structure#

The structure parameter defines the schema for the data you want to extract. JsonObjectConcept uses Pydantic models internally to validate all structures, ensuring type safety and data integrity. You can define this structure using either dictionaries or classes. Dictionary-based definitions provide a simpler abstraction for defining JSON object structures, while still benefiting from Pydantic’s robust validation system under the hood.

You can define the structure in several ways:

Using a dictionary with type annotations:

from contextgem import JsonObjectConcept

product_info_concept = JsonObjectConcept(
    name="Product Information",
    description="Product details",
    structure={
        "name": str,
        "price": float,
        "is_available": bool,
        "ratings": list[float],
    },
)

Using nested dictionaries for complex structures:

from contextgem import JsonObjectConcept

device_config_concept = JsonObjectConcept(
    name="Device Configuration",
    description="Configuration details for a networked device",
    structure={
        "device": {"id": str, "type": str, "model": str},
        "network": {"ip_address": str, "subnet_mask": str, "gateway": str},
        "settings": {"enabled": bool, "mode": str},
    },
)

Using a Python class with type annotations:

While dictionary structures provide the simplest way to define JSON schemas, you may prefer to use class definitions if that better fits your codebase style. You can define your structure using a Python class with type annotations:

from pydantic import BaseModel

from contextgem import JsonObjectConcept


# Use a Pydantic model to define the structure of the JSON object
class ProductSpec(BaseModel):
    name: str
    version: str
    features: list[str]


product_spec_concept = JsonObjectConcept(
    name="Product Specification",
    description="Technical specifications for a product",
    structure=ProductSpec,
)

Using nested classes for complex structures:

If you prefer to use class definitions for hierarchical data structures (already supported by dictionary structures), you can use nested class definitions. This approach offers a more object-oriented style that may better align with your existing codebase, especially when working with dataclasses or Pydantic models in your application code.

When using nested class definitions, all classes in the structure must inherit from the JsonObjectClassStruct utility class to enable automatic conversion of the whole class hierarchy to a dictionary structure:

from dataclasses import dataclass

from contextgem import JsonObjectConcept
from contextgem.public.utils import JsonObjectClassStruct

# Use dataclasses to define the structure of the JSON object


# All classes in the nested class structure must inherit from JsonObjectClassStruct
# to enable automatic conversion of the class hierarchy to a dictionary structure
# for JsonObjectConcept
@dataclass
class Location(JsonObjectClassStruct):
    latitude: float
    longitude: float
    altitude: float


@dataclass
class Sensor(JsonObjectClassStruct):
    id: str
    type: str
    location: Location  # reference to another class
    active: bool


@dataclass
class SensorNetwork(JsonObjectClassStruct):
    network_id: str
    primary_sensor: Sensor  # reference to another class
    backup_sensors: list[Sensor]  # list of another class


sensor_network_concept = JsonObjectConcept(
    name="IoT Sensor Network",
    description="Configuration for a network of IoT sensors",
    structure=SensorNetwork,  # nested class structure
)

🚀 Advanced Usage#

✏️ Adding Examples#

You can provide examples of structured JSON objects to improve extraction accuracy, especially for complex schemas or when there might be ambiguity in how to organize or format the extracted information:

# ContextGem: JsonObjectConcept Extraction with Examples

import os
from pprint import pprint

from contextgem import Document, DocumentLLM, JsonObjectConcept, JsonObjectExample

# Document object with ambiguous medical report text
medical_report = """
PATIENT ASSESSMENT
Date: March 15, 2023
Patient: John Doe (ID: 12345)

Vital Signs:
BP: 125/82 mmHg
HR: 72 bpm
Temp: 98.6°F
SpO2: 98%

Chief Complaint:
Patient presents with persistent cough for 2 weeks, mild fever in evenings (up to 100.4°F), and fatigue. 
No shortness of breath. Patient reports recent travel to Southeast Asia 3 weeks ago.

Assessment:
Physical examination shows slight wheezing in upper right lung. No signs of pneumonia on chest X-ray.
WBC slightly elevated at 11,500. Patient appears in stable condition but fatigued.

Impression:
1. Acute bronchitis, likely viral
2. Rule out early TB given travel history
3. Fatigue, likely secondary to infection

Plan:
- Rest for 5 days
- Symptomatic treatment with over-the-counter cough suppressant
- Follow-up in 1 week
- TB test ordered

Dr. Sarah Johnson, MD
"""
doc = Document(raw_text=medical_report)

# Create a JsonObjectConcept for extracting medical assessment data
# Without examples, the LLM might struggle with ambiguous fields or formatting variations
medical_assessment_concept = JsonObjectConcept(
    name="Medical Assessment",
    description="Key information from a patient medical assessment",
    structure={
        "patient": {
            "id": str,
            "vital_signs": {
                "blood_pressure": str,
                "heart_rate": int,
                "temperature": float,
                "oxygen_saturation": int,
            },
        },
        "clinical": {
            "symptoms": list[str],
            "diagnosis": list[str],
            "travel_history": bool,
        },
        "treatment": {"recommendations": list[str], "follow_up_days": int},
    },
    # Examples provide helpful guidance on how to:
    # 1. Map data from unstructured text to structured fields
    # 2. Handle formatting variations (BP as "120/80" vs separate systolic/diastolic)
    # 3. Extract implicit information (converting "SpO2: 98%" to just 98)
    examples=[
        JsonObjectExample(
            content={
                "patient": {
                    "id": "87654",
                    "vital_signs": {
                        "blood_pressure": "130/85",
                        "heart_rate": 68,
                        "temperature": 98.2,
                        "oxygen_saturation": 99,
                    },
                },
                "clinical": {
                    "symptoms": ["headache", "dizziness", "nausea"],
                    "diagnosis": ["Migraine", "Dehydration"],
                    "travel_history": False,
                },
                "treatment": {
                    "recommendations": [
                        "Hydration",
                        "Pain medication",
                        "Dark room rest",
                    ],
                    "follow_up_days": 14,
                },
            }
        ),
        JsonObjectExample(
            content={
                "patient": {
                    "id": "23456",
                    "vital_signs": {
                        "blood_pressure": "145/92",
                        "heart_rate": 88,
                        "temperature": 100.8,
                        "oxygen_saturation": 96,
                    },
                },
                "clinical": {
                    "symptoms": ["sore throat", "cough", "fever"],
                    "diagnosis": ["Strep throat", "Pharyngitis"],
                    "travel_history": True,
                },
                "treatment": {
                    "recommendations": ["Antibiotics", "Throat lozenges", "Rest"],
                    "follow_up_days": 7,
                },
            }
        ),
    ],
)

# Attach the concept to the document
doc.add_concepts([medical_assessment_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept from the document
medical_assessment_concept = llm.extract_concepts_from_document(doc)[0]

# Print the extracted medical assessment
print("Extracted medical assessment:")
assessment = medical_assessment_concept.extracted_items[0].value
pprint(assessment)

🔍 References and Justifications for Extraction#

You can configure a JsonObjectConcept to include justifications and references, which provide transparency into the extraction process. Justifications explain the reasoning behind the extracted values, while references point to the specific parts of the document that were used as sources for the extraction:

# ContextGem: JsonObjectConcept Extraction with References and Justifications

import os
from pprint import pprint
from typing import Literal

from contextgem import Document, DocumentLLM, JsonObjectConcept

# Sample document text containing a customer complaint
customer_complaint = """
CUSTOMER COMPLAINT #CR-2023-0472
Date: November 15, 2023
Customer: Sarah Johnson

Description:
I purchased the Ultra Premium Blender (Model XJ-5000) from your online store on October 3, 2023. The product was delivered on October 10, 2023. After using it only 5 times, the motor started making loud grinding noises and then completely stopped working on November 12.

I've tried troubleshooting using the manual, including checking for obstructions and resetting the device, but nothing has resolved the issue. I expected much better quality given the premium price point ($249.99) and the 5-year warranty advertised.

I've been a loyal customer for over 7 years and have purchased several kitchen appliances from your company. This is the first time I've experienced such a significant quality issue. I would like a replacement unit or a full refund.

Previous interactions:
- Spoke with customer service representative Alex on Nov 13 (Ref #CS-98721)
- Was told to submit this formal complaint after troubleshooting was unsuccessful
- No resolution offered during initial call

Contact: sarah.johnson@example.com | (555) 123-4567
"""

# Create a Document from the text
doc = Document(raw_text=customer_complaint)

# Create a JsonObjectConcept with justifications and references enabled
complaint_analysis_concept = JsonObjectConcept(
    name="Complaint analysis",
    description="Detailed analysis of a customer complaint",
    structure={
        "issue_type": Literal[
            "product defect",
            "delivery problem",
            "billing error",
            "service issue",
            "other",
        ],
        "warranty_applicable": bool,
        "severity": Literal["low", "medium", "high", "critical"],
        "customer_loyalty_status": Literal["new", "regular", "loyal", "premium"],
        "recommended_resolution": Literal[
            "replacement", "refund", "repair", "partial refund", "other"
        ],
        "priority_level": Literal["low", "standard", "high", "urgent"],
        "expected_business_impact": Literal["minimal", "moderate", "significant"],
    },
    add_justifications=True,
    justification_depth="comprehensive",  # provide detailed justifications
    justification_max_sents=10,  # provide up to 10 sentences for each justification
    add_references=True,
    reference_depth="sentences",  # provide references to the sentences in the document
)

# Attach the concept to the document
doc.add_concepts([complaint_analysis_concept])

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract the concept
complaint_analysis_concept = llm.extract_concepts_from_document(doc)[0]

# Get the extracted complaint analysis
complaint_analysis_item = complaint_analysis_concept.extracted_items[0]

# Print the structured analysis
print("Complaint Analysis\n")
pprint(complaint_analysis_item.value)

print("\nJustification:")
print(complaint_analysis_item.justification)

# Print key source references
print("\nReferences:")
for sent in complaint_analysis_item.reference_sentences:
    print(f"- {sent.raw_text}")

💡 Best Practices#

Keep your JSON structures simple yet comprehensive, focusing on the essential fields needed for your use case to avoid LLM prompt overloading.
Include realistic examples (using JsonObjectExample) that precisely match your schema to guide extraction, especially for ambiguous or specialized data formats.
Provide detailed descriptions for your JsonObjectConcept that specify exactly what structured data to extract and how fields should be interpreted.
For complex JSON objects, use nested dictionaries or class hierarchies to organize related fields logically.
Enable justifications (using add_justifications=True) when interpretation rationale is important, especially for extractions that involve judgment or qualitative assessment, such as sentiment analysis (positive/negative), priority assignment (high/medium/low), or data categorization where the LLM must make interpretive decisions rather than extract explicit facts.
Enable references (using add_references=True) when you need to verify the document source of extracted values for compliance or verification purposes. This is especially valuable when the LLM is not just directly extracting explicit text, but also interpreting or inferring information from context. For example, in legal document analysis where traceability of information is essential for auditing or validation, references help track both explicit statements and the implicit information the model has derived from them.
Use singular_occurrence=True when you expect exactly one instance of the structured data in the document (e.g., a single product specification, one patient medical record, or a unique customer complaint). This is useful for documents with a clear singular focus. Conversely, omit this parameter (False is the default) when you need to extract multiple instances of the same structure from a document, such as multiple product listings in a catalog, several patient records in a hospital report, or various customer complaints in a feedback compilation.