JsonObjectConcept#
JsonObjectConcept
is a powerful concept type that extracts structured data in the form of JSON objects from documents, enabling sophisticated information organization and retrieval.
📝 Overview#
JsonObjectConcept
is used when you need to extract complex, structured information from unstructured text, including:
Nested data structures: objects with multiple fields, hierarchical information, and related attributes
Standardized formats: consistent data extraction following predefined schemas for reliable downstream processing
Complex entity extraction: comprehensive extraction of entities with multiple attributes and relationships
This concept type offers the flexibility to define precise schemas that match your data requirements, ensuring that extracted information maintains structural integrity and relationships between different data elements.
💻 Usage Example#
Here’s a simple example of how to use JsonObjectConcept
to extract product information:
# ContextGem: JsonObjectConcept Extraction
import os
from pprint import pprint
from typing import Literal
from contextgem import Document, DocumentLLM, JsonObjectConcept
# Define product information text
product_text = """
Product: Smart Fitness Watch X7
Price: $199.99
Features: Heart rate monitoring, GPS tracking, Sleep analysis
Battery Life: 5 days
Water Resistance: IP68
Available Colors: Black, Silver, Blue
Customer Rating: 4.5/5
"""
# Create a Document object from text
doc = Document(raw_text=product_text)
# Define a JsonObjectConcept with a structure for product information
product_concept = JsonObjectConcept(
name="Product Information",
description="Extract detailed product information including name, price, features, and specifications",
structure={
"name": str,
"price": float,
"features": list[str],
"specifications": {
"battery_life": str,
"water_resistance": Literal["IP67", "IP68", "IPX7", "Not water resistant"],
},
"available_colors": list[str],
"customer_rating": float,
},
)
# Attach the concept to the document
doc.add_concepts([product_concept])
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1-mini",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract the concept from the document
product_concept = llm.extract_concepts_from_document(doc)[0]
# Print the extracted structured data
extracted_product = product_concept.extracted_items[0].value
pprint(extracted_product)
⚙️ Parameters#
When creating a JsonObjectConcept
, you can specify the following parameters:
Parameter |
Type |
Description |
---|---|---|
|
str |
A unique name identifier for the concept |
|
str |
A clear description of what the concept represents and what should be extracted |
|
type | dict[str, Any] |
JSON object schema defining the data structure to be extracted. Can be specified as a Python class with type annotations or a dictionary with field names as keys and their corresponding types as values. This schema can represent simple flat structures or complex nested hierarchies with multiple levels of organization. The LLM will attempt to extract data that conforms to this structure, enabling precise and consistent extraction of complex information patterns. |
|
List[ |
Optional. Example JSON objects illustrating the concept usage. Such examples must conform to the |
|
str |
The role of the LLM responsible for extracting the concept. Available values: |
|
bool |
Whether to include justifications for extracted items (defaults to |
|
str |
Justification detail level. Available values: |
|
int |
Maximum sentences in a justification (defaults to |
|
bool |
Whether to include source references for extracted items (defaults to |
|
str |
Source reference granularity. Available values: |
|
bool |
Whether this concept is restricted to having only one extracted item. If |
|
dict |
Optional. Dictionary for storing any additional data that you want to associate with the concept. This data must be JSON-serializable. This data is not used for extraction but can be useful for custom processing or downstream tasks. Defaults to an empty dictionary. |
🏗️ Defining Structure#
The structure
parameter defines the schema for the data you want to extract. JsonObjectConcept uses Pydantic models internally to validate all structures, ensuring type safety and data integrity. You can define this structure using either dictionaries or classes. Dictionary-based definitions provide a simpler abstraction for defining JSON object structures, while still benefiting from Pydantic’s robust validation system under the hood.
You can define the structure in several ways:
Using a dictionary with type annotations:
from contextgem import JsonObjectConcept
product_info_concept = JsonObjectConcept(
name="Product Information",
description="Product details",
structure={
"name": str,
"price": float,
"is_available": bool,
"ratings": list[float],
},
)
Using nested dictionaries for complex structures:
from contextgem import JsonObjectConcept
device_config_concept = JsonObjectConcept(
name="Device Configuration",
description="Configuration details for a networked device",
structure={
"device": {"id": str, "type": str, "model": str},
"network": {"ip_address": str, "subnet_mask": str, "gateway": str},
"settings": {"enabled": bool, "mode": str},
},
)
Using a Python class with type annotations:
While dictionary structures provide the simplest way to define JSON schemas, you may prefer to use class definitions if that better fits your codebase style. You can define your structure using a Python class with type annotations:
from pydantic import BaseModel
from contextgem import JsonObjectConcept
# Use a Pydantic model to define the structure of the JSON object
class ProductSpec(BaseModel):
name: str
version: str
features: list[str]
product_spec_concept = JsonObjectConcept(
name="Product Specification",
description="Technical specifications for a product",
structure=ProductSpec,
)
Using nested classes for complex structures:
If you prefer to use class definitions for hierarchical data structures (already supported by dictionary structures), you can use nested class definitions. This approach offers a more object-oriented style that may better align with your existing codebase, especially when working with dataclasses or Pydantic models in your application code.
When using nested class definitions, all classes in the structure must inherit from the JsonObjectClassStruct
utility class to enable automatic conversion of the whole class hierarchy to a dictionary structure:
from dataclasses import dataclass
from contextgem import JsonObjectConcept
from contextgem.public.utils import JsonObjectClassStruct
# Use dataclasses to define the structure of the JSON object
# All classes in the nested class structure must inherit from JsonObjectClassStruct
# to enable automatic conversion of the class hierarchy to a dictionary structure
# for JsonObjectConcept
@dataclass
class Location(JsonObjectClassStruct):
latitude: float
longitude: float
altitude: float
@dataclass
class Sensor(JsonObjectClassStruct):
id: str
type: str
location: Location # reference to another class
active: bool
@dataclass
class SensorNetwork(JsonObjectClassStruct):
network_id: str
primary_sensor: Sensor # reference to another class
backup_sensors: list[Sensor] # list of another class
sensor_network_concept = JsonObjectConcept(
name="IoT Sensor Network",
description="Configuration for a network of IoT sensors",
structure=SensorNetwork, # nested class structure
)
🚀 Advanced Usage#
✏️ Adding Examples#
You can provide examples of structured JSON objects to improve extraction accuracy, especially for complex schemas or when there might be ambiguity in how to organize or format the extracted information:
# ContextGem: JsonObjectConcept Extraction with Examples
import os
from pprint import pprint
from contextgem import Document, DocumentLLM, JsonObjectConcept, JsonObjectExample
# Document object with ambiguous medical report text
medical_report = """
PATIENT ASSESSMENT
Date: March 15, 2023
Patient: John Doe (ID: 12345)
Vital Signs:
BP: 125/82 mmHg
HR: 72 bpm
Temp: 98.6°F
SpO2: 98%
Chief Complaint:
Patient presents with persistent cough for 2 weeks, mild fever in evenings (up to 100.4°F), and fatigue.
No shortness of breath. Patient reports recent travel to Southeast Asia 3 weeks ago.
Assessment:
Physical examination shows slight wheezing in upper right lung. No signs of pneumonia on chest X-ray.
WBC slightly elevated at 11,500. Patient appears in stable condition but fatigued.
Impression:
1. Acute bronchitis, likely viral
2. Rule out early TB given travel history
3. Fatigue, likely secondary to infection
Plan:
- Rest for 5 days
- Symptomatic treatment with over-the-counter cough suppressant
- Follow-up in 1 week
- TB test ordered
Dr. Sarah Johnson, MD
"""
doc = Document(raw_text=medical_report)
# Create a JsonObjectConcept for extracting medical assessment data
# Without examples, the LLM might struggle with ambiguous fields or formatting variations
medical_assessment_concept = JsonObjectConcept(
name="Medical Assessment",
description="Key information from a patient medical assessment",
structure={
"patient": {
"id": str,
"vital_signs": {
"blood_pressure": str,
"heart_rate": int,
"temperature": float,
"oxygen_saturation": int,
},
},
"clinical": {
"symptoms": list[str],
"diagnosis": list[str],
"travel_history": bool,
},
"treatment": {"recommendations": list[str], "follow_up_days": int},
},
# Examples provide helpful guidance on how to:
# 1. Map data from unstructured text to structured fields
# 2. Handle formatting variations (BP as "120/80" vs separate systolic/diastolic)
# 3. Extract implicit information (converting "SpO2: 98%" to just 98)
examples=[
JsonObjectExample(
content={
"patient": {
"id": "87654",
"vital_signs": {
"blood_pressure": "130/85",
"heart_rate": 68,
"temperature": 98.2,
"oxygen_saturation": 99,
},
},
"clinical": {
"symptoms": ["headache", "dizziness", "nausea"],
"diagnosis": ["Migraine", "Dehydration"],
"travel_history": False,
},
"treatment": {
"recommendations": [
"Hydration",
"Pain medication",
"Dark room rest",
],
"follow_up_days": 14,
},
}
),
JsonObjectExample(
content={
"patient": {
"id": "23456",
"vital_signs": {
"blood_pressure": "145/92",
"heart_rate": 88,
"temperature": 100.8,
"oxygen_saturation": 96,
},
},
"clinical": {
"symptoms": ["sore throat", "cough", "fever"],
"diagnosis": ["Strep throat", "Pharyngitis"],
"travel_history": True,
},
"treatment": {
"recommendations": ["Antibiotics", "Throat lozenges", "Rest"],
"follow_up_days": 7,
},
}
),
],
)
# Attach the concept to the document
doc.add_concepts([medical_assessment_concept])
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract the concept from the document
medical_assessment_concept = llm.extract_concepts_from_document(doc)[0]
# Print the extracted medical assessment
print("Extracted medical assessment:")
assessment = medical_assessment_concept.extracted_items[0].value
pprint(assessment)
🔍 References and Justifications for Extraction#
You can configure a JsonObjectConcept
to include justifications and references, which provide transparency into the extraction process. Justifications explain the reasoning behind the extracted values, while references point to the specific parts of the document that were used as sources for the extraction:
# ContextGem: JsonObjectConcept Extraction with References and Justifications
import os
from pprint import pprint
from typing import Literal
from contextgem import Document, DocumentLLM, JsonObjectConcept
# Sample document text containing a customer complaint
customer_complaint = """
CUSTOMER COMPLAINT #CR-2023-0472
Date: November 15, 2023
Customer: Sarah Johnson
Description:
I purchased the Ultra Premium Blender (Model XJ-5000) from your online store on October 3, 2023. The product was delivered on October 10, 2023. After using it only 5 times, the motor started making loud grinding noises and then completely stopped working on November 12.
I've tried troubleshooting using the manual, including checking for obstructions and resetting the device, but nothing has resolved the issue. I expected much better quality given the premium price point ($249.99) and the 5-year warranty advertised.
I've been a loyal customer for over 7 years and have purchased several kitchen appliances from your company. This is the first time I've experienced such a significant quality issue. I would like a replacement unit or a full refund.
Previous interactions:
- Spoke with customer service representative Alex on Nov 13 (Ref #CS-98721)
- Was told to submit this formal complaint after troubleshooting was unsuccessful
- No resolution offered during initial call
Contact: sarah.johnson@example.com | (555) 123-4567
"""
# Create a Document from the text
doc = Document(raw_text=customer_complaint)
# Create a JsonObjectConcept with justifications and references enabled
complaint_analysis_concept = JsonObjectConcept(
name="Complaint analysis",
description="Detailed analysis of a customer complaint",
structure={
"issue_type": Literal[
"product defect",
"delivery problem",
"billing error",
"service issue",
"other",
],
"warranty_applicable": bool,
"severity": Literal["low", "medium", "high", "critical"],
"customer_loyalty_status": Literal["new", "regular", "loyal", "premium"],
"recommended_resolution": Literal[
"replacement", "refund", "repair", "partial refund", "other"
],
"priority_level": Literal["low", "standard", "high", "urgent"],
"expected_business_impact": Literal["minimal", "moderate", "significant"],
},
add_justifications=True,
justification_depth="comprehensive", # provide detailed justifications
justification_max_sents=10, # provide up to 10 sentences for each justification
add_references=True,
reference_depth="sentences", # provide references to the sentences in the document
)
# Attach the concept to the document
doc.add_concepts([complaint_analysis_concept])
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract the concept
complaint_analysis_concept = llm.extract_concepts_from_document(doc)[0]
# Get the extracted complaint analysis
complaint_analysis_item = complaint_analysis_concept.extracted_items[0]
# Print the structured analysis
print("Complaint Analysis\n")
pprint(complaint_analysis_item.value)
print("\nJustification:")
print(complaint_analysis_item.justification)
# Print key source references
print("\nReferences:")
for sent in complaint_analysis_item.reference_sentences:
print(f"- {sent.raw_text}")
💡 Best Practices#
Keep your JSON structures simple yet comprehensive, focusing on the essential fields needed for your use case to avoid LLM prompt overloading.
Include realistic examples (using
JsonObjectExample
) that precisely match your schema to guide extraction, especially for ambiguous or specialized data formats.Provide detailed descriptions for your JsonObjectConcept that specify exactly what structured data to extract and how fields should be interpreted.
For complex JSON objects, use nested dictionaries or class hierarchies to organize related fields logically.
Enable justifications (using
add_justifications=True
) when interpretation rationale is important, especially for extractions that involve judgment or qualitative assessment, such as sentiment analysis (positive/negative), priority assignment (high/medium/low), or data categorization where the LLM must make interpretive decisions rather than extract explicit facts.Enable references (using
add_references=True
) when you need to verify the document source of extracted values for compliance or verification purposes. This is especially valuable when the LLM is not just directly extracting explicit text, but also interpreting or inferring information from context. For example, in legal document analysis where traceability of information is essential for auditing or validation, references help track both explicit statements and the implicit information the model has derived from them.Use
singular_occurrence=True
when you expect exactly one instance of the structured data in the document (e.g., a single product specification, one patient medical record, or a unique customer complaint). This is useful for documents with a clear singular focus. Conversely, omit this parameter (False
is the default) when you need to extract multiple instances of the same structure from a document, such as multiple product listings in a catalog, several patient records in a hospital report, or various customer complaints in a feedback compilation.