DOCX Converter

DOCX Converter#

ContextGem provides built-in converter to easily transform DOCX files into LLM-ready ContextGem document objects.

📑 Comprehensive extraction of document elements: paragraphs, headings, lists, tables, comments, footnotes, textboxes, headers/footers, links, embedded images, and inline formatting
🧩 Document structure preservation with rich metadata for improved LLM analysis
🛠️ Built-in converter that directly processes Word XML

🚀 Usage#

# Using ContextGem's DocxConverter

from contextgem import DocxConverter


converter = DocxConverter()

# Convert a DOCX file to an LLM-ready ContextGem Document
# from path
document = converter.convert("path/to/document.docx")
# or from file object
with open("path/to/document.docx", "rb") as docx_file_object:
    document = converter.convert(docx_file_object)

# Perform data extraction on the resulting Document object
# document.add_aspects(...)
# document.add_concepts(...)
# llm.extract_all(document)

# You can also use DocxConverter instance as a standalone text extractor
docx_text = converter.convert_to_text_format(
    "path/to/document.docx",
    output_format="markdown",  # or "raw"
)

🔄 Conversion Process#

The DocxConverter performs the following operations when converting a DOCX file to a ContextGem Document with convert() method:

Elements	Extraction Details	Control Parameter (Default)
Text	Extracts the full document text as raw text, and optionally applies markdown processing and formatting while preserving raw text separately	`apply_markdown=True`
Paragraphs	Extracts `Paragraph` objects with rich metadata serving as additional context for LLM (e.g., “Style: Normal, Table: 3, Row: 1, Column: 3, Table Cell”)	Always included
Headings	Preserves heading levels and formats as markdown headings when in markdown mode	Always included
Lists	Maintains list hierarchy, numbering, and formatting with proper indentation and list type information	Always included
Tables	Preserves table structure and formats tables in markdown mode	`include_tables=True`
Headers & Footers	Captures document headers and footers with appropriate metadata	`include_headers=True` / `include_footers=True`
Footnotes	Extracts footnotes with references and preserves connection to original text	`include_footnotes=True`
Comments	Preserves document comments with author information and timestamps	`include_comments=True`
Links	Processes and formats hyperlinks, preserving both link text and target URLs	`include_links=True`
Text Boxes	Extracts text from various text box formats	`include_textboxes=True`
Inline Formatting	Applies inline formatting such as bold, italic, underline, etc. when in markdown mode	`include_inline_formatting=True`
Images	Extracts embedded images and converts them to `Image` objects for further processing with vision models	`include_images=True`

ℹ️ Current Limitations#

DocxConverter has the following limitations:

Drawings such as charts are skipped as it is challenging to represent them in text format.
Inline markdown formatting (bold, italic, etc.) and hyperlink formatting are not supported in specially marked sections (headers, footers, footnotes, comments).
Extraction of generated table of contents (ToC) is not supported. (A ToC is an automatically generated list of document headings with page numbers that Word creates based on heading styles.)
When converting very long DOCX files with complex formatting, performance may be very slow. (Our goal is to improve this in the future.)

DOCX Converter

Contents

DOCX Converter#

🚀 Usage#

🔄 Conversion Process#

ℹ️ Current Limitations#