DOCX Converter#
ContextGem provides built-in converter to easily transform DOCX files into LLM-ready ContextGem document objects.
π Extracts information that other open-source tools often do not capture: misaligned tables, comments, footnotes, textboxes, headers/footers, and embedded images
π§© Preserves document structure with rich metadata for improved LLM analysis
π οΈ Custom native converter that directly processes Word XML with zero external dependencies
π Usage#
# Using ContextGem's DocxConverter
from contextgem import DocxConverter
converter = DocxConverter()
# Convert a DOCX file to an LLM-ready ContextGem Document
# from path
document = converter.convert("path/to/document.docx")
# or from file object
with open("path/to/document.docx", "rb") as docx_file_object:
document = converter.convert(docx_file_object)
# You can also use it as a standalone text extractor
docx_text = converter.convert_to_text_format(
"path/to/document.docx",
output_format="markdown", # or "raw"
)
π Conversion Process#
The DocxConverter
performs the following operations when converting a DOCX file to a ContextGem Document:
Elements |
Extraction Details |
---|---|
Text |
Extracts the full document text as either raw text or markdown format (controlled by |
Paragraphs |
Extracts |
Headings |
Preserves heading levels and formats as markdown headings when in markdown mode |
Lists |
Maintains list hierarchy, numbering, and formatting with proper indentation and list type information |
Tables |
Preserves table structure and formats tables in markdown mode (can be excluded using |
Headers & Footers |
Captures document headers and footers with appropriate metadata (can be excluded using |
Footnotes |
Extracts footnotes with references and preserves connection to original text (can be excluded using |
Comments |
Preserves document comments with author information and timestamps (can be excluded using |
Text Boxes |
Extracts text from various text box formats (can be excluded using |
Images |
Extracts embedded images and converts them to |
π₯ Beyond Standard Libraries#
Our evaluation of popular open-source DOCX processing libraries revealed critical limitations: most packages either omit important elements (e.g. comments, textboxes, or embedded images), fail to handle complex structures (such as inconsistently formatted tables), or cannot extract paragraphs with the rich metadata needed for LLM processing.
While it would have been much easier to use an existing open-source package as a dependency, these limitations compelled us to build a custom solution. The DocxConverter
was developed specifically to address these gaps, ensuring extraction of the most commonly occurring DOCX elements with their contextual relationships preserved.
βΉοΈ Current Limitations#
DocxConverter has the following limitations, some of which are intentional:
Character-level styling (e.g., bold, underline, italics, strikethrough) is intentionally skipped to ensure proper matching of processed paragraphs and sentences in the DOCX content.
Nested tables are preserved but may lead to table cell duplication.
Consecutive textboxes are preserved but may lead to textbox content duplication.
Drawings such as charts are skipped as it is challenging to represent them in text format.