DOCX Converter#

ContextGem provides built-in converter to easily transform DOCX files into LLM-ready ContextGem document objects.

  • πŸ“‘ Comprehensive extraction of document elements: paragraphs, headings, lists, tables, comments, footnotes, textboxes, headers/footers, links, embedded images, and inline formatting

  • 🧩 Document structure preservation with rich metadata for improved LLM analysis

  • πŸ› οΈ Built-in converter that directly processes Word XML

πŸš€ Usage#

# Using ContextGem's DocxConverter

from contextgem import DocxConverter


converter = DocxConverter()

# Convert a DOCX file to an LLM-ready ContextGem Document
# from path
document = converter.convert("path/to/document.docx")
# or from file object
with open("path/to/document.docx", "rb") as docx_file_object:
    document = converter.convert(docx_file_object)

# Perform data extraction on the resulting Document object
# document.add_aspects(...)
# document.add_concepts(...)
# llm.extract_all(document)

# You can also use DocxConverter instance as a standalone text extractor
docx_text = converter.convert_to_text_format(
    "path/to/document.docx",
    output_format="markdown",  # or "raw"
)

πŸ”„ Conversion Process#

The DocxConverter performs the following operations when converting a DOCX file to a ContextGem Document with convert() method:

Elements

Extraction Details

Control Parameter (Default)

Text

Extracts the full document text as raw text, and optionally applies markdown processing and formatting while preserving raw text separately

apply_markdown=True

Paragraphs

Extracts Paragraph objects with rich metadata serving as additional context for LLM (e.g., β€œStyle: Normal, Table: 3, Row: 1, Column: 3, Table Cell”)

Always included

Headings

Preserves heading levels and formats as markdown headings when in markdown mode

Always included

Lists

Maintains list hierarchy, numbering, and formatting with proper indentation and list type information

Always included

Tables

Preserves table structure and formats tables in markdown mode

include_tables=True

Headers & Footers

Captures document headers and footers with appropriate metadata

include_headers=True / include_footers=True

Footnotes

Extracts footnotes with references and preserves connection to original text

include_footnotes=True

Comments

Preserves document comments with author information and timestamps

include_comments=True

Links

Processes and formats hyperlinks, preserving both link text and target URLs

include_links=True

Text Boxes

Extracts text from various text box formats

include_textboxes=True

Inline Formatting

Applies inline formatting such as bold, italic, underline, etc. when in markdown mode

include_inline_formatting=True

Images

Extracts embedded images and converts them to Image objects for further processing with vision models

include_images=True

ℹ️ Current Limitations#

DocxConverter has the following limitations:

  • Drawings such as charts are skipped as it is challenging to represent them in text format.

  • Inline markdown formatting (bold, italic, etc.) and hyperlink formatting are not supported in specially marked sections (headers, footers, footnotes, comments).

  • Extraction of generated table of contents (ToC) is not supported. (A ToC is an automatically generated list of document headings with page numbers that Word creates based on heading styles.)

  • When converting very long DOCX files with complex formatting, performance may be very slow. (Our goal is to improve this in the future.)