Creating Documents

Creating Documents#

This guide explains how to create and configure Document instances to process textual and visual content for analysis.

Documents serve as the container for the content from which information (aspects and concepts) can be extracted.

⚙️ Configuration Parameters#

The minimum configuration for a document requires either raw_text, paragraphs, or images:

Document creation#

from pathlib import Path

from contextgem import Document, Paragraph, create_image


# Create a document with raw text content
contract_document = Document(
    raw_text=(
        "...This agreement is effective as of January 1, 2025.\n\n"
        "All parties must comply with the terms outlined herein. The terms include "
        "monthly reporting requirements and quarterly performance reviews.\n\n"
        "Failure to adhere to these terms may result in termination of the agreement. "
        "Additionally, any breach of confidentiality will be subject to penalties as "
        "described in this agreement.\n\n"
        "This agreement shall remain in force for a period of three (3) years unless "
        "otherwise terminated according to the provisions stated above..."
    ),
    paragraph_segmentation_mode="newlines",  # Default mode, splits on newlines
)

# Create a document with more advanced paragraph segmentation using a SaT model
report_document = Document(
    raw_text=(
        "Executive Summary "
        "This report outlines our quarterly performance. "
        "Revenue increased by [15%] compared to the previous quarter.\n\n"
        "Customer satisfaction metrics show positive trends across all regions..."
    ),
    paragraph_segmentation_mode="sat",  # Use SaT model for intelligent paragraph segmentation
    sat_model_id="sat-3l-sm",  # Specify which SaT model to use
)

# Create a document with predefined paragraphs, e.g. when you use a custom
# paragraph segmentation tool
document_from_paragraphs = Document(
    paragraphs=[
        Paragraph(raw_text="This is the first paragraph."),
        Paragraph(raw_text="This is the second paragraph with more content."),
        Paragraph(raw_text="Final paragraph concluding the document."),
        # ...
    ]
)

# Create document with images

# Path is adapted for doc tests
current_file = Path(__file__).resolve()
root_path = current_file.parents[4]
image_path = root_path / "tests" / "images" / "invoices" / "invoice.png"

# Create a document with only images (no text)
image_document = Document(
    images=[
        create_image(image_path),  # contextgem.Image instance
        # ...
    ]
)

# Create a document with both text and images
mixed_document = Document(
    raw_text="This document contains both text and visual elements.",
    images=[
        create_image(image_path),  # contextgem.Image instance
        # ...
    ],
)

The Document class accepts the following parameters:

Parameter	Type	Default Value	Description
`raw_text`	`str \| None`	`None`	The main text of the document as a single string.
`paragraphs`	`list[Paragraph]`	`[]`	List of `Paragraph` instances in consecutive order as they appear in the document. Normally auto-populated from `raw_text`.
`images`	`list[Image]`	`[]`	List of `Image` instances attached to or representing the document. Used for visual content analysis.
`aspects`	`list[Aspect]`	`[]`	List of `Aspect` instances associated with the document for focused analysis. Must have unique names and descriptions. See Aspect Extraction for more details.
`concepts`	`list[_Concept]`	`[]`	List of `_Concept` instances associated with the document for information extraction. Must have unique names and descriptions. See supported concept types in Supported Concepts.
`paragraph_segmentation_mode`	`Literal["newlines", "sat"]`	`"newlines"`	Mode for paragraph segmentation. `"newlines"` splits on newline characters, `"sat"` uses a SaT (Segment Any Text) model for intelligent segmentation.
`sat_model_id`	`SaTModelId`	`"sat-3l-sm"`	SaT model ID for paragraph/sentence segmentation or a local path to a SaT model. See wtpsplit models for available options.
`pre_segment_sentences`	`bool`	`False`	Whether to pre-segment sentences during Document initialization. When `False`, sentence segmentation is deferred until sentences are actually needed, improving initialization performance.

🔄 DOCX Document Conversion#

ContextGem provides a built-in DocxConverter to easily transform DOCX files into LLM-ready Document instances.

For detailed usage examples and configuration options, see DOCX Converter.

🎯 Adding Aspects and Concepts for Extraction#

Before extracting information from a document with an LLM, you must define and add aspects and concepts to your document instance. These components serve as the foundation for targeted analysis and structured information extraction.

Aspects define the text segments (sections, topics, themes) to be extracted from the document. They can be combined with concepts for comprehensive analysis.

Concepts define specific data points to be extracted or inferred from the document content: entities, insights, structured objects, classifications, numerical calculations, dates, ratings, and assessments.

For detailed guidance on creating and configuring these components, see:

Aspect Extraction - Complete guide to defining and using aspects
Supported Concepts - All available concept types and how to use them

Creating Documents

Contents

Creating Documents#

⚙️ Configuration Parameters#

🔄 DOCX Document Conversion#

🎯 Adding Aspects and Concepts for Extraction#