Extraction Methods#

This guide documents the extraction methods provided by the DocumentLLM and DocumentLLMGroup classes for extracting aspects and concepts from documents using large language models.


đź“„đź§  Complete Document Processing#

extract_all()#

Performs comprehensive extraction by processing a Document for all Aspect and _Concept instances. This is the most commonly used method for complete document analysis.

Note

See supported concept types in Supported Concepts. All public concept types inherit from the internal _Concept base class.

Method Signature:

def extract_all(
    self,
    document: Document,
    overwrite_existing: bool = False,
    max_items_per_call: int = 0,
    use_concurrency: bool = False,
    max_paragraphs_to_analyze_per_call: int = 0,
    max_images_to_analyze_per_call: int = 0,
) -> Document

Note

An async equivalent extract_all_async() is also available.

Parameters:

Parameter

Type

Default

Description

document

Document

Required

The document with attached Aspect and/or _Concept instances to extract.

overwrite_existing

bool

False

Whether to overwrite already processed Aspect and _Concept instances with newly extracted information. This is particularly useful when reprocessing documents with updated LLMs or extraction parameters.

max_items_per_call

int

0

Maximum number of Aspect and/or _Concept instances with the same extraction parameters to process in a single LLM call (single LLM prompt). 0 means all aspect and/or concept instances with same extraction params in a one call. This is particularly useful for complex tasks or long documents to prevent prompt overloading and allow the LLM to focus on a smaller set of extraction tasks at once.

use_concurrency

bool

False

Enable concurrent processing of multiple Aspect and/or _Concept instances. Can significantly reduce processing time by executing multiple extraction tasks in parallel, especially beneficial for documents with many aspects and concepts. However, it might cause rate limit errors with LLM providers. When enabled, adjust the async_limiter on your DocumentLLM to control request frequency (default is 3 acquisitions per 10 seconds). For optimal results, combine with max_items_per_call=1 to maximize concurrency, although this would cause increase in LLM API costs as each aspect/concept will be processed in a separate LLM call (LLM prompt). See Optimizing for Speed for examples of concurrency configuration.

max_paragraphs_to_analyze_per_call

int

0

Maximum paragraphs to include in a single LLM call (single LLM prompt). 0 means all paragraphs. This parameter is crucial when working with long documents that exceed the LLM’s context window. By limiting the number of paragraphs per call, you can ensure the LLM processes the document in manageable segments while maintaining semantic coherence. This prevents token limit errors and often improves extraction quality by allowing the model to focus on smaller portions of text at a time. For more details on handling long documents, see Dealing with Long Documents.

max_images_to_analyze_per_call

int

0

Maximum Image instances to analyze in a single LLM call (single LLM prompt). 0 means all images. This parameter is crucial when working with documents containing multiple images that might exceed the LLM’s context window. By limiting the number of images per call, you can ensure the LLM processes the document’s visual content in manageable batches. Relevant only when extracting document-level concepts from document images. See 🖼️ Concept Extraction from Document (vision) for an example of extracting concepts from document images.


Return Value:

Returns the same Document instance passed as input, but with all attached Aspect and _Concept instances populated with their extracted items. The document’s aspects and concepts will have their extracted_items field populated with the extracted information, and if applicable, reference_paragraphs/ reference_sentences will be set based on the extraction parameters. The exact structure of references depends on the reference_depth setting of each aspect and concept.

Example Usage:

Extracting all aspects and concepts from a document#
# ContextGem: Extracting All Aspects and Concepts from Document

import os

from contextgem import Aspect, Document, DocumentLLM, StringConcept

# Sample text content
text_content = """
John Smith is a 30-year-old software engineer working at TechCorp. 
He has 5 years of experience in Python development and leads a team of 8 developers.
His annual salary is $95,000 and he graduated from MIT with a Computer Science degree.
"""

# Create a Document object from text
doc = Document(raw_text=text_content)

# Define aspects and concepts directly on the document
doc.aspects = [
    Aspect(
        name="Professional Information",
        description="Information about the person's career, job, and work experience",
    )
]

doc.concepts = [
    StringConcept(
        name="Person name",
        description="Full name of the person",
    )
]

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract all aspects and concepts from the document
processed_doc = llm.extract_all(doc)

# Access extracted aspect information
aspect = processed_doc.aspects[0]
print(f"Aspect: {aspect.name}")
print(f"Extracted items: {[item.value for item in aspect.extracted_items]}")

# Access extracted concept information
concept = processed_doc.concepts[0]
print(f"Concept: {concept.name}")
print(f"Extracted value: {concept.extracted_items[0].value}")
Open In Colab

đź“„ Aspect Extraction Methods#

extract_aspects_from_document()#

Extracts Aspect instances from a Document.

Method Signature:

def extract_aspects_from_document(
    self,
    document: Document,
    from_aspects: Optional[list[Aspect]] = None,
    overwrite_existing: bool = False,
    max_items_per_call: int = 0,
    use_concurrency: bool = False,
    max_paragraphs_to_analyze_per_call: int = 0,
) -> list[Aspect]

Note

An async equivalent extract_aspects_from_document_async() is also available.

Parameters:

Parameter

Type

Default

Description

document

Document

Required

The document with attached Aspect instances to be extracted.

from_aspects

Optional[list[Aspect]]

None

Specific aspects to extract from the document. If None, extracts all aspects attached to the document. This allows you to selectively process only certain aspects rather than the entire set.

overwrite_existing

bool

False

Whether to overwrite already processed aspects with newly extracted information. This is particularly useful when reprocessing documents with updated LLMs or extraction parameters.

max_items_per_call

int

0

Maximum number of Aspect instances with the same extraction parameters to process in a single LLM call (single LLM prompt). 0 means all aspect instances with same extraction params in a one call. This is particularly useful for complex tasks or long documents to prevent prompt overloading and allow the LLM to focus on a smaller set of extraction tasks at once.

use_concurrency

bool

False

Enable concurrent processing of multiple Aspect instances. Can significantly reduce processing time by executing multiple extraction tasks concurrently, especially beneficial for documents with many aspects. However, it might cause rate limit errors with LLM providers. When enabled, adjust the async_limiter on your DocumentLLM to control request frequency (default is 3 acquisitions per 10 seconds). For optimal results, combine with max_items_per_call=1 to maximize concurrency, although this would cause increase in LLM API costs as each aspect will be processed in a separate LLM call (LLM prompt). See Optimizing for Speed for examples of concurrency configuration.

max_paragraphs_to_analyze_per_call

int

0

Maximum paragraphs to include in a single LLM call (single LLM prompt). 0 means all paragraphs. This parameter is crucial when working with long documents that exceed the LLM’s context window. By limiting the number of paragraphs per call, you can ensure the LLM processes the document in manageable segments while maintaining semantic coherence. This prevents token limit errors and often improves extraction quality by allowing the model to focus on smaller portions of text at a time. For more details on handling long documents, see Dealing with Long Documents.


Return Value:

Returns a list of Aspect instances that were processed during extraction. If from_aspects was specified, returns only those aspects; otherwise returns all aspects attached to the document. Each aspect in the returned list will have its extracted_items field populated with the extracted information, and its reference_paragraphs field will always be set. The reference_sentences field will only be populated when the aspect’s reference_depth is set to "sentences".

Example Usage:

Extracting aspects from a document#
# ContextGem: Extracting Aspects from Documents

import os

from contextgem import Aspect, Document, DocumentLLM

# Sample text content
text_content = """
TechCorp is a leading software development company founded in 2015 with headquarters in San Francisco.
The company specializes in cloud-based solutions and has grown to 500 employees across 12 countries.
Their flagship product, CloudManager Pro, serves over 10,000 enterprise clients worldwide.
TechCorp reported $50 million in revenue for 2023, representing a 25% growth from the previous year.
The company is known for its innovative AI-powered analytics platform and excellent customer support.
They recently expanded into the European market and plan to launch three new products in 2024.
"""

# Create a Document object from text
doc = Document(raw_text=text_content)

# Define aspects to extract from the document
doc.aspects = [
    Aspect(
        name="Company Overview",
        description="Basic information about the company, founding, location, and size",
    ),
    Aspect(
        name="Financial Performance",
        description="Revenue, growth metrics, and financial indicators",
    ),
    Aspect(
        name="Products and Services",
        description="Information about the company's products, services, and offerings",
    ),
]

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1-mini",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract aspects from the document
extracted_aspects = llm.extract_aspects_from_document(doc)

# Access extracted aspect information
for aspect in extracted_aspects:
    print(f"Aspect: {aspect.name}")
    print(f"Extracted items: {[item.value for item in aspect.extracted_items]}")
    print("---")
Open In Colab

đź§  Concept Extraction Methods#

extract_concepts_from_document()#

Extracts _Concept instances from a Document object.

Note

See supported concept types in Supported Concepts. All public concept types inherit from the internal _Concept base class.

Method Signature:

def extract_concepts_from_document(
    self,
    document: Document,
    from_concepts: Optional[list[_Concept]] = None,
    overwrite_existing: bool = False,
    max_items_per_call: int = 0,
    use_concurrency: bool = False,
    max_paragraphs_to_analyze_per_call: int = 0,
    max_images_to_analyze_per_call: int = 0,
) -> list[_Concept]

Note

An async equivalent extract_concepts_from_document_async() is also available.

Parameters:

Parameter

Type

Default

Description

document

Document

Required

The document from which concepts are to be extracted.

from_concepts

Optional[list[_Concept]]

None

Specific concepts to extract from the document. If None, extracts all concepts attached to the document. This allows you to selectively process only certain concepts rather than the entire set.

overwrite_existing

bool

False

Whether to overwrite already processed concepts with newly extracted information. This is particularly useful when reprocessing documents with updated LLMs or extraction parameters.

max_items_per_call

int

0

Maximum number of _Concept instances with the same extraction parameters to process in a single LLM call (single LLM prompt). 0 means all concept instances with same extraction params in a one call. This is particularly useful for complex tasks or long documents to prevent prompt overloading and allow the LLM to focus on a smaller set of extraction tasks at once.

use_concurrency

bool

False

Enable concurrent processing of multiple _Concept instances. Can significantly reduce processing time by executing multiple extraction tasks concurrently, especially beneficial for documents with many concepts. However, it might cause rate limit errors with LLM providers. When enabled, adjust the async_limiter on your DocumentLLM to control request frequency (default is 3 acquisitions per 10 seconds). For optimal results, combine with max_items_per_call=1 to maximize concurrency, although this would cause increase in LLM API costs as each concept will be processed in a separate LLM call (LLM prompt). See Optimizing for Speed for examples of concurrency configuration.

max_paragraphs_to_analyze_per_call

int

0

Maximum paragraphs to include in a single LLM call (single LLM prompt). 0 means all paragraphs. This parameter is crucial when working with long documents that exceed the LLM’s context window. By limiting the number of paragraphs per call, you can ensure the LLM processes the document in manageable segments while maintaining semantic coherence.

max_images_to_analyze_per_call

int

0

Maximum images to include in a single LLM call (single LLM prompt). 0 means all images. This parameter is crucial when extracting concepts from documents with multiple images using vision-capable LLMs. It helps prevent overwhelming the model with too many visual inputs at once, manages token usage more effectively, and enables more focused concept extraction from visual content. See 🖼️ Concept Extraction from Document (vision) for an example of extracting concepts from document images.


Return Value:

Returns a list of _Concept instances that were processed during extraction. If from_concepts was specified, returns only those concepts; otherwise returns all concepts attached to the document. Each concept in the returned list will have its extracted_items field populated with the extracted information, and if applicable, reference_paragraphs/ reference_sentences will be set based on the extraction parameters.

Example Usage:

Extracting concepts from a document#
# ContextGem: Extracting Concepts Directly from Documents

import os

from contextgem import Document, DocumentLLM, NumericalConcept, StringConcept

# Sample text content
text_content = """
GreenTech Solutions is an environmental technology company founded in 2018 in Portland, Oregon.
The company develops sustainable energy solutions and has 75 employees working remotely across the United States.
Their primary product, EcoMonitor, helps businesses track carbon emissions and has been adopted by 2,500 organizations.
GreenTech Solutions reported strong financial performance with $8.5 million in revenue for 2024.
The company's CEO, Sarah Johnson, announced plans to achieve carbon neutrality by 2025.
They recently opened a new research facility in Seattle and hired 20 additional engineers.
"""

# Create a Document object from text
doc = Document(raw_text=text_content)

# Define concepts to extract from the document
doc.concepts = [
    StringConcept(
        name="Company Name",
        description="Full name of the company",
    ),
    StringConcept(
        name="CEO Name",
        description="Full name of the company's CEO",
    ),
    NumericalConcept(
        name="Employee Count",
        description="Total number of employees at the company",
        numeric_type="int",
    ),
    StringConcept(
        name="Annual Revenue",
        description="Company's total revenue for the year",
    ),
]

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# Extract concepts from the document
extracted_concepts = llm.extract_concepts_from_document(doc)

# Access extracted concept information
print("Concepts extracted from document:")
for concept in extracted_concepts:
    print(f"  {concept.name}: {[item.value for item in concept.extracted_items]}")
Open In Colab

extract_concepts_from_aspect()#

Extracts _Concept instances associated with a given Aspect in a Document.

The aspect must be previously processed before concept extraction can occur. This means that the aspect should have already gone through extraction, which identifies the relevant context (text segments) in the document that match the aspect’s description. This extracted context is then used as the foundation for concept extraction, allowing concepts to be identified specifically within the scope of the aspect.

Note

See supported concept types in Supported Concepts. All public concept types inherit from the internal _Concept base class.

Method Signature:

def extract_concepts_from_aspect(
    self,
    aspect: Aspect,
    document: Document,
    from_concepts: Optional[list[_Concept]] = None,
    overwrite_existing: bool = False,
    max_items_per_call: int = 0,
    use_concurrency: bool = False,
    max_paragraphs_to_analyze_per_call: int = 0,
) -> list[_Concept]

Note

An async equivalent extract_concepts_from_aspect_async() is also available.

Parameters:

Parameter

Type

Default

Description

aspect

Aspect

Required

The aspect from which to extract concepts. Must be previously processed through aspect extraction before concepts can be extracted.

document

Document

Required

The document that contains the aspect with the attached concepts to be extracted.

from_concepts

Optional[list[_Concept]]

None

Specific concepts to extract from the aspect. If None, extracts all concepts attached to the aspect. This allows you to selectively process only certain concepts rather than the entire set.

overwrite_existing

bool

False

Whether to overwrite already processed concepts with newly extracted information. This is particularly useful when reprocessing documents with updated LLMs or extraction parameters.

max_items_per_call

int

0

Maximum number of _Concept instances with the same extraction parameters to process in a single LLM call (single LLM prompt). 0 means all concept instances with same extraction params in one call. This is particularly useful for complex tasks to prevent prompt overloading and allow the LLM to focus on a smaller set of extraction tasks at once.

use_concurrency

bool

False

Enable concurrent processing of multiple _Concept instances. Can significantly reduce processing time by executing multiple extraction tasks concurrently, especially beneficial for aspects with many concepts. However, it might cause rate limit errors with LLM providers. When enabled, adjust the async_limiter on your DocumentLLM to control request frequency (default is 3 acquisitions per 10 seconds). For optimal results, combine with max_items_per_call=1 to maximize concurrency, although this would cause increase in LLM API costs as each concept will be processed in a separate LLM call (LLM prompt). See Optimizing for Speed for examples of concurrency configuration.

max_paragraphs_to_analyze_per_call

int

0

Maximum number of the aspect’s paragraphs to analyze in a single LLM call (single LLM prompt). 0 means all the aspect’s paragraphs. This parameter is crucial when working with long documents or aspects that cover extensive portions of text that might exceed the LLM’s context window. By limiting the number of paragraphs per call, you can break down analysis into manageable chunks or allow the LLM to focus more deeply on smaller sections of text at a time. For more details on handling long documents, see Dealing with Long Documents.


Return Value:

Returns a list of _Concept instances that were processed during extraction from the specified aspect. If from_concepts was specified, returns only those concepts; otherwise returns all concepts attached to the aspect. Each concept in the returned list will have its extracted_items field populated with the extracted information, and if applicable, reference_paragraphs/ reference_sentences will be set based on the extraction parameters.

Example Usage:

Extracting concepts from an aspect#
# ContextGem: Extracting Concepts from Specific Aspects

import os

from contextgem import Aspect, Document, DocumentLLM, NumericalConcept, StringConcept

# Sample text content
text_content = """
DataFlow Systems is an innovative fintech startup that was established in 2020 in Austin, Texas.
The company has rapidly grown to 150 employees and operates in 8 major cities across North America.
DataFlow's core platform, FinanceStream, is used by more than 5,000 small businesses for automated accounting.
In their latest financial report, DataFlow Systems announced $12 million in annual revenue for 2024.
This represents an impressive 40% increase compared to their 2023 performance.
The company has secured $25 million in Series B funding and plans to expand internationally next year.
"""

# Create a Document object from text
doc = Document(raw_text=text_content)

# Define an aspect to extract from the document
financial_aspect = Aspect(
    name="Financial Performance",
    description="Revenue, growth metrics, and financial indicators",
)

# Add concepts to the aspect
financial_aspect.concepts = [
    StringConcept(
        name="Annual Revenue",
        description="Total revenue reported for the year",
    ),
    NumericalConcept(
        name="Growth Rate",
        description="Percentage growth rate compared to previous year",
        numeric_type="float",
    ),
    NumericalConcept(
        name="Revenue Year",
        description="The year for which revenue is reported",
    ),
]

# Attach the aspect to the document
doc.aspects = [financial_aspect]

# Configure DocumentLLM with your API parameters
llm = DocumentLLM(
    model="azure/gpt-4.1",
    api_key=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_VERSION"),
    api_base=os.getenv("CONTEXTGEM_AZURE_OPENAI_API_BASE"),
)

# First, extract the aspect from the document (required before concept extraction)
extracted_aspects = llm.extract_aspects_from_document(doc)
financial_aspect = extracted_aspects[0]

# Extract concepts from the specific aspect
extracted_concepts = llm.extract_concepts_from_aspect(financial_aspect, doc)

# Access extracted concepts for the aspect
print(f"Aspect: {financial_aspect.name}")
print(f"Extracted items: {[item.value for item in financial_aspect.extracted_items]}")
print("\nConcepts extracted from this aspect:")
for concept in extracted_concepts:
    print(f"  {concept.name}: {[item.value for item in concept.extracted_items]}")
Open In Colab