Creating Documents#

This guide explains how to create and configure Document instances to process textual and visual content for analysis.

Documents serve as the container for the content from which information (aspects and concepts) can be extracted.

⚙️ Configuration Parameters#

The minimum configuration for a document requires either raw_text, paragraphs, or images:

Document creation#
from pathlib import Path

from contextgem import Document, Paragraph, create_image


# Create a document with raw text content
contract_document = Document(
    raw_text=(
        "...This agreement is effective as of January 1, 2025.\n\n"
        "All parties must comply with the terms outlined herein. The terms include "
        "monthly reporting requirements and quarterly performance reviews.\n\n"
        "Failure to adhere to these terms may result in termination of the agreement. "
        "Additionally, any breach of confidentiality will be subject to penalties as "
        "described in this agreement.\n\n"
        "This agreement shall remain in force for a period of three (3) years unless "
        "otherwise terminated according to the provisions stated above..."
    ),
    paragraph_segmentation_mode="newlines",  # Default mode, splits on newlines
)

# Create a document with more advanced paragraph segmentation using a SaT model
report_document = Document(
    raw_text=(
        "Executive Summary "
        "This report outlines our quarterly performance. "
        "Revenue increased by [15%] compared to the previous quarter.\n\n"
        "Customer satisfaction metrics show positive trends across all regions..."
    ),
    paragraph_segmentation_mode="sat",  # Use SaT model for intelligent paragraph segmentation
    sat_model_id="sat-3l-sm",  # Specify which SaT model to use
)

# Create a document with predefined paragraphs, e.g. when you use a custom
# paragraph segmentation tool
document_from_paragraphs = Document(
    paragraphs=[
        Paragraph(raw_text="This is the first paragraph."),
        Paragraph(raw_text="This is the second paragraph with more content."),
        Paragraph(raw_text="Final paragraph concluding the document."),
        # ...
    ]
)

# Create document with images

# Path is adapted for doc tests
current_file = Path(__file__).resolve()
root_path = current_file.parents[4]
image_path = root_path / "tests" / "images" / "invoices" / "invoice.png"

# Create a document with only images (no text)
image_document = Document(
    images=[
        create_image(image_path),  # contextgem.Image instance
        # ...
    ]
)

# Create a document with both text and images
mixed_document = Document(
    raw_text="This document contains both text and visual elements.",
    images=[
        create_image(image_path),  # contextgem.Image instance
        # ...
    ],
)

The Document class accepts the following parameters:

Parameter

Type

Default Value

Description

raw_text

str | None

None

The main text of the document as a single string.

paragraphs

list[Paragraph]

[]

List of Paragraph instances in consecutive order as they appear in the document. Normally auto-populated from raw_text.

images

list[Image]

[]

List of Image instances attached to or representing the document. Used for visual content analysis.

aspects

list[Aspect]

[]

List of Aspect instances associated with the document for focused analysis. Must have unique names and descriptions. See Aspect Extraction for more details.

concepts

list[_Concept]

[]

List of _Concept instances associated with the document for information extraction. Must have unique names and descriptions. See supported concept types in Supported Concepts.

paragraph_segmentation_mode

Literal["newlines", "sat"]

"newlines"

Mode for paragraph segmentation. "newlines" splits on newline characters, "sat" uses a SaT (Segment Any Text) model for intelligent segmentation.

sat_model_id

SaTModelId

"sat-3l-sm"

SaT model ID for paragraph/sentence segmentation or a local path to a SaT model. See wtpsplit models for available options.

pre_segment_sentences

bool

False

Whether to pre-segment sentences during Document initialization. When False, sentence segmentation is deferred until sentences are actually needed, improving initialization performance.

🔄 DOCX Document Conversion#

ContextGem provides a built-in DocxConverter to easily transform DOCX files into LLM-ready Document instances.

For detailed usage examples and configuration options, see DOCX Converter.

🎯 Adding Aspects and Concepts for Extraction#

Before extracting information from a document with an LLM, you must define and add aspects and concepts to your document instance. These components serve as the foundation for targeted analysis and structured information extraction.

Aspects define the text segments (sections, topics, themes) to be extracted from the document. They can be combined with concepts for comprehensive analysis.

Concepts define specific data points to be extracted or inferred from the document content: entities, insights, structured objects, classifications, numerical calculations, dates, ratings, and assessments.

For detailed guidance on creating and configuring these components, see: