Extraction Methods#
This guide documents the extraction methods provided by the DocumentLLM
and DocumentLLMGroup
classes for extracting aspects and concepts from documents using large language models.
đź“„đź§ Complete Document Processing#
extract_all()
#
Performs comprehensive extraction by processing a Document
for all Aspect
and _Concept
instances. This is the most commonly used method for complete document analysis.
Note
See supported concept types in Supported Concepts. All public concept types inherit from the internal _Concept
base class.
Method Signature:
def extract_all(
self,
document: Document,
overwrite_existing: bool = False,
max_items_per_call: int = 0,
use_concurrency: bool = False,
max_paragraphs_to_analyze_per_call: int = 0,
max_images_to_analyze_per_call: int = 0,
) -> Document
Note
An async equivalent extract_all_async()
is also available.
Parameters:
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
Required |
The document with attached |
|
|
bool |
|
Whether to overwrite already processed |
|
int |
|
Maximum number of |
|
bool |
|
Enable concurrent processing of multiple |
|
int |
|
Maximum paragraphs to include in a single LLM call (single LLM prompt). |
|
int |
|
Maximum |
Return Value:
Returns the same Document
instance passed as input, but with all attached Aspect
and _Concept
instances populated with their extracted items. The document’s aspects and concepts will have their extracted_items
field populated with the extracted information, and if applicable, reference_paragraphs
/ reference_sentences
will be set based on the extraction parameters. The exact structure of references depends on the reference_depth
setting of each aspect and concept.
Example Usage:
# ContextGem: Extracting All Aspects and Concepts from Document
import os
from contextgem import Aspect, Document, DocumentLLM, StringConcept
# Sample text content
text_content = """
John Smith is a 30-year-old software engineer working at TechCorp.
He has 5 years of experience in Python development and leads a team of 8 developers.
His annual salary is $95,000 and he graduated from MIT with a Computer Science degree.
"""
# Create a Document object from text
doc = Document(raw_text=text_content)
# Define aspects and concepts directly on the document
doc.aspects = [
Aspect(
name="Professional Information",
description="Information about the person's career, job, and work experience",
)
]
doc.concepts = [
StringConcept(
name="Person name",
description="Full name of the person",
)
]
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1-mini",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract all aspects and concepts from the document
processed_doc = llm.extract_all(doc)
# Access extracted aspect information
aspect = processed_doc.aspects[0]
print(f"Aspect: {aspect.name}")
print(f"Extracted items: {[item.value for item in aspect.extracted_items]}")
# Access extracted concept information
concept = processed_doc.concepts[0]
print(f"Concept: {concept.name}")
print(f"Extracted value: {concept.extracted_items[0].value}")
đź“„ Aspect Extraction Methods#
extract_aspects_from_document()
#
Extracts Aspect
instances from a Document
.
Method Signature:
def extract_aspects_from_document(
self,
document: Document,
from_aspects: Optional[list[Aspect]] = None,
overwrite_existing: bool = False,
max_items_per_call: int = 0,
use_concurrency: bool = False,
max_paragraphs_to_analyze_per_call: int = 0,
) -> list[Aspect]
Note
An async equivalent extract_aspects_from_document_async()
is also available.
Parameters:
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
Required |
The document with attached |
|
|
Optional[list[ |
|
Specific aspects to extract from the document. If |
|
bool |
|
Whether to overwrite already processed aspects with newly extracted information. This is particularly useful when reprocessing documents with updated LLMs or extraction parameters. |
|
int |
|
Maximum number of |
|
bool |
|
Enable concurrent processing of multiple |
|
int |
|
Maximum paragraphs to include in a single LLM call (single LLM prompt). |
Return Value:
Returns a list of Aspect
instances that were processed during extraction. If from_aspects
was specified, returns only those aspects; otherwise returns all aspects attached to the document. Each aspect in the returned list will have its extracted_items
field populated with the extracted information, and its reference_paragraphs
field will always be set. The reference_sentences
field will only be populated when the aspect’s reference_depth
is set to "sentences"
.
Example Usage:
# ContextGem: Extracting Aspects from Documents
import os
from contextgem import Aspect, Document, DocumentLLM
# Sample text content
text_content = """
TechCorp is a leading software development company founded in 2015 with headquarters in San Francisco.
The company specializes in cloud-based solutions and has grown to 500 employees across 12 countries.
Their flagship product, CloudManager Pro, serves over 10,000 enterprise clients worldwide.
TechCorp reported $50 million in revenue for 2023, representing a 25% growth from the previous year.
The company is known for its innovative AI-powered analytics platform and excellent customer support.
They recently expanded into the European market and plan to launch three new products in 2024.
"""
# Create a Document object from text
doc = Document(raw_text=text_content)
# Define aspects to extract from the document
doc.aspects = [
Aspect(
name="Company Overview",
description="Basic information about the company, founding, location, and size",
),
Aspect(
name="Financial Performance",
description="Revenue, growth metrics, and financial indicators",
),
Aspect(
name="Products and Services",
description="Information about the company's products, services, and offerings",
),
]
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1-mini",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract aspects from the document
extracted_aspects = llm.extract_aspects_from_document(doc)
# Access extracted aspect information
for aspect in extracted_aspects:
print(f"Aspect: {aspect.name}")
print(f"Extracted items: {[item.value for item in aspect.extracted_items]}")
print("---")
đź§ Concept Extraction Methods#
extract_concepts_from_document()
#
Extracts _Concept
instances from a Document
object.
Note
See supported concept types in Supported Concepts. All public concept types inherit from the internal _Concept
base class.
Method Signature:
def extract_concepts_from_document(
self,
document: Document,
from_concepts: Optional[list[_Concept]] = None,
overwrite_existing: bool = False,
max_items_per_call: int = 0,
use_concurrency: bool = False,
max_paragraphs_to_analyze_per_call: int = 0,
max_images_to_analyze_per_call: int = 0,
) -> list[_Concept]
Note
An async equivalent extract_concepts_from_document_async()
is also available.
Parameters:
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
Required |
The document from which concepts are to be extracted. |
|
|
Optional[list[ |
|
Specific concepts to extract from the document. If |
|
bool |
|
Whether to overwrite already processed concepts with newly extracted information. This is particularly useful when reprocessing documents with updated LLMs or extraction parameters. |
|
int |
|
Maximum number of |
|
bool |
|
Enable concurrent processing of multiple |
|
int |
|
Maximum paragraphs to include in a single LLM call (single LLM prompt). |
|
int |
|
Maximum images to include in a single LLM call (single LLM prompt). |
Return Value:
Returns a list of _Concept
instances that were processed during extraction. If from_concepts
was specified, returns only those concepts; otherwise returns all concepts attached to the document. Each concept in the returned list will have its extracted_items
field populated with the extracted information, and if applicable, reference_paragraphs
/ reference_sentences
will be set based on the extraction parameters.
Example Usage:
# ContextGem: Extracting Concepts Directly from Documents
import os
from contextgem import Document, DocumentLLM, NumericalConcept, StringConcept
# Sample text content
text_content = """
GreenTech Solutions is an environmental technology company founded in 2018 in Portland, Oregon.
The company develops sustainable energy solutions and has 75 employees working remotely across the United States.
Their primary product, EcoMonitor, helps businesses track carbon emissions and has been adopted by 2,500 organizations.
GreenTech Solutions reported strong financial performance with $8.5 million in revenue for 2024.
The company's CEO, Sarah Johnson, announced plans to achieve carbon neutrality by 2025.
They recently opened a new research facility in Seattle and hired 20 additional engineers.
"""
# Create a Document object from text
doc = Document(raw_text=text_content)
# Define concepts to extract from the document
doc.concepts = [
StringConcept(
name="Company Name",
description="Full name of the company",
),
StringConcept(
name="CEO Name",
description="Full name of the company's CEO",
),
NumericalConcept(
name="Employee Count",
description="Total number of employees at the company",
numeric_type="int",
),
StringConcept(
name="Annual Revenue",
description="Company's total revenue for the year",
),
]
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# Extract concepts from the document
extracted_concepts = llm.extract_concepts_from_document(doc)
# Access extracted concept information
print("Concepts extracted from document:")
for concept in extracted_concepts:
print(f" {concept.name}: {[item.value for item in concept.extracted_items]}")
extract_concepts_from_aspect()
#
Extracts _Concept
instances associated with a given Aspect
in a Document
.
The aspect must be previously processed before concept extraction can occur. This means that the aspect should have already gone through extraction, which identifies the relevant context (text segments) in the document that match the aspect’s description. This extracted context is then used as the foundation for concept extraction, allowing concepts to be identified specifically within the scope of the aspect.
Note
See supported concept types in Supported Concepts. All public concept types inherit from the internal _Concept
base class.
Method Signature:
def extract_concepts_from_aspect(
self,
aspect: Aspect,
document: Document,
from_concepts: Optional[list[_Concept]] = None,
overwrite_existing: bool = False,
max_items_per_call: int = 0,
use_concurrency: bool = False,
max_paragraphs_to_analyze_per_call: int = 0,
) -> list[_Concept]
Note
An async equivalent extract_concepts_from_aspect_async()
is also available.
Parameters:
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
Required |
The aspect from which to extract concepts. Must be previously processed through aspect extraction before concepts can be extracted. |
|
|
Required |
The document that contains the aspect with the attached concepts to be extracted. |
|
|
Optional[list[ |
|
Specific concepts to extract from the aspect. If |
|
bool |
|
Whether to overwrite already processed concepts with newly extracted information. This is particularly useful when reprocessing documents with updated LLMs or extraction parameters. |
|
int |
|
Maximum number of |
|
bool |
|
Enable concurrent processing of multiple |
|
int |
|
Maximum number of the aspect’s paragraphs to analyze in a single LLM call (single LLM prompt). |
Return Value:
Returns a list of _Concept
instances that were processed during extraction from the specified aspect. If from_concepts
was specified, returns only those concepts; otherwise returns all concepts attached to the aspect. Each concept in the returned list will have its extracted_items
field populated with the extracted information, and if applicable, reference_paragraphs
/ reference_sentences
will be set based on the extraction parameters.
Example Usage:
# ContextGem: Extracting Concepts from Specific Aspects
import os
from contextgem import Aspect, Document, DocumentLLM, NumericalConcept, StringConcept
# Sample text content
text_content = """
DataFlow Systems is an innovative fintech startup that was established in 2020 in Austin, Texas.
The company has rapidly grown to 150 employees and operates in 8 major cities across North America.
DataFlow's core platform, FinanceStream, is used by more than 5,000 small businesses for automated accounting.
In their latest financial report, DataFlow Systems announced $12 million in annual revenue for 2024.
This represents an impressive 40% increase compared to their 2023 performance.
The company has secured $25 million in Series B funding and plans to expand internationally next year.
"""
# Create a Document object from text
doc = Document(raw_text=text_content)
# Define an aspect to extract from the document
financial_aspect = Aspect(
name="Financial Performance",
description="Revenue, growth metrics, and financial indicators",
)
# Add concepts to the aspect
financial_aspect.concepts = [
StringConcept(
name="Annual Revenue",
description="Total revenue reported for the year",
),
NumericalConcept(
name="Growth Rate",
description="Percentage growth rate compared to previous year",
numeric_type="float",
),
NumericalConcept(
name="Revenue Year",
description="The year for which revenue is reported",
),
]
# Attach the aspect to the document
doc.aspects = [financial_aspect]
# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
model="azure/gpt-4.1",
api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)
# First, extract the aspect from the document (required before concept extraction)
extracted_aspects = llm.extract_aspects_from_document(doc)
financial_aspect = extracted_aspects[0]
# Extract concepts from the specific aspect
extracted_concepts = llm.extract_concepts_from_aspect(financial_aspect, doc)
# Access extracted concepts for the aspect
print(f"Aspect: {financial_aspect.name}")
print(f"Extracted items: {[item.value for item in financial_aspect.extracted_items]}")
print("\nConcepts extracted from this aspect:")
for concept in extracted_concepts:
print(f" {concept.name}: {[item.value for item in concept.extracted_items]}")