Creating Documents#
This guide explains how to create and configure Document
instances to process textual and visual content for analysis.
Documents serve as the container for the content from which information (aspects and concepts) can be extracted.
⚙️ Configuration Parameters#
The minimum configuration for a document requires either raw_text
, paragraphs
, or images
:
from pathlib import Path
from contextgem import Document, Paragraph, create_image
# Create a document with raw text content
contract_document = Document(
raw_text=(
"...This agreement is effective as of January 1, 2025.\n\n"
"All parties must comply with the terms outlined herein. The terms include "
"monthly reporting requirements and quarterly performance reviews.\n\n"
"Failure to adhere to these terms may result in termination of the agreement. "
"Additionally, any breach of confidentiality will be subject to penalties as "
"described in this agreement.\n\n"
"This agreement shall remain in force for a period of three (3) years unless "
"otherwise terminated according to the provisions stated above..."
),
paragraph_segmentation_mode="newlines", # Default mode, splits on newlines
)
# Create a document with more advanced paragraph segmentation using a SaT model
report_document = Document(
raw_text=(
"Executive Summary "
"This report outlines our quarterly performance. "
"Revenue increased by [15%] compared to the previous quarter.\n\n"
"Customer satisfaction metrics show positive trends across all regions..."
),
paragraph_segmentation_mode="sat", # Use SaT model for intelligent paragraph segmentation
sat_model_id="sat-3l-sm", # Specify which SaT model to use
)
# Create a document with predefined paragraphs, e.g. when you use a custom
# paragraph segmentation tool
document_from_paragraphs = Document(
paragraphs=[
Paragraph(raw_text="This is the first paragraph."),
Paragraph(raw_text="This is the second paragraph with more content."),
Paragraph(raw_text="Final paragraph concluding the document."),
# ...
]
)
# Create document with images
# Path is adapted for doc tests
current_file = Path(__file__).resolve()
root_path = current_file.parents[4]
image_path = root_path / "tests" / "images" / "invoices" / "invoice.png"
# Create a document with only images (no text)
image_document = Document(
images=[
create_image(image_path), # contextgem.Image instance
# ...
]
)
# Create a document with both text and images
mixed_document = Document(
raw_text="This document contains both text and visual elements.",
images=[
create_image(image_path), # contextgem.Image instance
# ...
],
)
The Document
class accepts the following parameters:
Parameter |
Type |
Default Value |
Description |
---|---|---|---|
|
|
|
The main text of the document as a single string. |
|
|
|
List of |
|
|
|
List of |
|
|
|
List of |
|
|
|
List of |
|
|
|
Mode for paragraph segmentation. |
|
|
|
SaT model ID for paragraph/sentence segmentation or a local path to a SaT model. See wtpsplit models for available options. |
|
|
|
Whether to pre-segment sentences during Document initialization. When |
🔄 DOCX Document Conversion#
ContextGem provides a built-in DocxConverter
to easily transform DOCX files into LLM-ready Document
instances.
For detailed usage examples and configuration options, see DOCX Converter.
🎯 Adding Aspects and Concepts for Extraction#
Before extracting information from a document with an LLM, you must define and add aspects and concepts to your document instance. These components serve as the foundation for targeted analysis and structured information extraction.
Aspects define the text segments (sections, topics, themes) to be extracted from the document. They can be combined with concepts for comprehensive analysis.
Concepts define specific data points to be extracted or inferred from the document content: entities, insights, structured objects, classifications, numerical calculations, dates, ratings, and assessments.
For detailed guidance on creating and configuring these components, see:
Aspect Extraction - Complete guide to defining and using aspects
Supported Concepts - All available concept types and how to use them