DOCX Converter#

ContextGem provides built-in converter to easily transform DOCX files into LLM-ready ContextGem document objects.

  • πŸ“‘ Extracts information that other open-source tools often do not capture: misaligned tables, comments, footnotes, textboxes, headers/footers, and embedded images

  • 🧩 Preserves document structure with rich metadata for improved LLM analysis

  • πŸ› οΈ Custom native converter that directly processes Word XML with zero external dependencies

πŸš€ Usage#

# Using ContextGem's DocxConverter

from contextgem import DocxConverter

converter = DocxConverter()

# Convert a DOCX file to an LLM-ready ContextGem Document
# from path
document = converter.convert("path/to/document.docx")
# or from file object
with open("path/to/document.docx", "rb") as docx_file_object:
    document = converter.convert(docx_file_object)

# You can also use it as a standalone text extractor
docx_text = converter.convert_to_text_format(
    "path/to/document.docx",
    output_format="markdown",  # or "raw"
)

πŸ”„ Conversion Process#

The DocxConverter performs the following operations when converting a DOCX file to a ContextGem Document:

Elements

Extraction Details

Text

Extracts the full document text as either raw text or markdown format (controlled by raw_text_to_md parameter)

Paragraphs

Extracts Paragraph objects with rich metadata serving as additional context for LLM (e.g., β€œStyle: Normal, Table: 3, Row: 1, Column: 3, Table Cell”)

Headings

Preserves heading levels and formats as markdown headings when in markdown mode

Lists

Maintains list hierarchy, numbering, and formatting with proper indentation and list type information

Tables

Preserves table structure and formats tables in markdown mode (can be excluded using include_tables=False)

Headers & Footers

Captures document headers and footers with appropriate metadata (can be excluded using include_headers=False and include_footers=False)

Footnotes

Extracts footnotes with references and preserves connection to original text (can be excluded using include_footnotes=False)

Comments

Preserves document comments with author information and timestamps (can be excluded using include_comments=False)

Text Boxes

Extracts text from various text box formats (can be excluded using include_textboxes=False)

Images

Extracts embedded images and converts them to Image objects for further processing with vision models (can be excluded using include_images=False)

πŸ’₯ Beyond Standard Libraries#

Our evaluation of popular open-source DOCX processing libraries revealed critical limitations: most packages either omit important elements (e.g. comments, textboxes, or embedded images), fail to handle complex structures (such as inconsistently formatted tables), or cannot extract paragraphs with the rich metadata needed for LLM processing.

While it would have been much easier to use an existing open-source package as a dependency, these limitations compelled us to build a custom solution. The DocxConverter was developed specifically to address these gaps, ensuring extraction of the most commonly occurring DOCX elements with their contextual relationships preserved.

ℹ️ Current Limitations#

DocxConverter has the following limitations, some of which are intentional:

  • Character-level styling (e.g., bold, underline, italics, strikethrough) is intentionally skipped to ensure proper matching of processed paragraphs and sentences in the DOCX content.

  • Nested tables are preserved but may lead to table cell duplication.

  • Consecutive textboxes are preserved but may lead to textbox content duplication.

  • Drawings such as charts are skipped as it is challenging to represent them in text format.