DOCX Converter

DOCX Converter#

Warning

Deprecated since v0.22.0. DocxConverter will be removed in v1.0.0.

Use a dedicated document conversion library — for example Docling or MarkItDown — to convert files to text, then pass the result to Document via raw_text=....

# Recommended pattern
from contextgem import Document

# text = <result from Docling / MarkItDown / your converter of choice>
document = Document(raw_text=text)

This page is retained as a reference for users on v0.22.x.

ContextGem’s built-in DOCX converter transforms DOCX files into LLM-ready ContextGem document objects.

📑 Comprehensive extraction of document elements: paragraphs, headings, lists, tables, comments, footnotes, textboxes, headers/footers, links, embedded images, and inline formatting
🧩 Document structure preservation with rich metadata for improved LLM analysis
🛠️ Built-in converter that directly processes Word XML

🚀 Usage#

from contextgem import DocxConverter

converter = DocxConverter()
document = converter.convert("path/to/file.docx")

🔄 Conversion Process#

The DocxConverter performs the following operations when converting a DOCX file to a ContextGem Document with convert() method:

Elements	Extraction Details	Control Parameter (Default)
Text	Extracts the full document text as raw text, and optionally applies markdown processing and formatting while preserving raw text separately	`apply_markdown=True`
Paragraphs	Extracts `Paragraph` objects with rich metadata serving as additional context for LLM (e.g., “Style: Normal, Table: 3, Row: 1, Column: 3, Table Cell”)	Always included
Headings	Preserves heading levels and formats as markdown headings when in markdown mode	Always included
Lists	Maintains list hierarchy, numbering, and formatting with proper indentation and list type information	Always included
Tables	Preserves table structure and formats tables in markdown mode	`include_tables=True`
Headers & Footers	Captures document headers and footers with appropriate metadata	`include_headers=True` / `include_footers=True`
Footnotes	Extracts footnotes with references and preserves connection to original text	`include_footnotes=True`
Comments	Preserves document comments with author information and timestamps	`include_comments=True`
Links	Processes and formats hyperlinks, preserving both link text and target URLs	`include_links=True`
Text Boxes	Extracts text from various text box formats	`include_textboxes=True`
Inline Formatting	Applies inline formatting such as bold, italic, underline, etc. when in markdown mode	`include_inline_formatting=True`
Images	Extracts embedded images and converts them to `Image` objects for further processing with vision models	`include_images=True`

ℹ️ Current Limitations#

DocxConverter has the following limitations:

Drawings such as charts are skipped as it is challenging to represent them in text format.
Inline markdown formatting (bold, italic, etc.) and hyperlink formatting are not supported in specially marked sections (headers, footers, footnotes, comments).
Extraction of generated table of contents (ToC) is not supported. (A ToC is an automatically generated list of document headings with page numbers that Word creates based on heading styles.)

DOCX Converter

Contents

DOCX Converter#

🚀 Usage#

🔄 Conversion Process#

ℹ️ Current Limitations#