DOCX Converter#
ContextGem provides built-in converter to easily transform DOCX files into LLM-ready ContextGem document objects.
π Comprehensive extraction of document elements: paragraphs, headings, lists, tables, comments, footnotes, textboxes, headers/footers, links, embedded images, and inline formatting
π§© Document structure preservation with rich metadata for improved LLM analysis
π οΈ Built-in converter that directly processes Word XML
π Usage#
# Using ContextGem's DocxConverter
from contextgem import DocxConverter
converter = DocxConverter()
# Convert a DOCX file to an LLM-ready ContextGem Document
# from path
document = converter.convert("path/to/document.docx")
# or from file object
with open("path/to/document.docx", "rb") as docx_file_object:
document = converter.convert(docx_file_object)
# Perform data extraction on the resulting Document object
# document.add_aspects(...)
# document.add_concepts(...)
# llm.extract_all(document)
# You can also use DocxConverter instance as a standalone text extractor
docx_text = converter.convert_to_text_format(
"path/to/document.docx",
output_format="markdown", # or "raw"
)
π Conversion Process#
The DocxConverter
performs the following operations when converting a DOCX file to a ContextGem Document with convert()
method:
Elements |
Extraction Details |
Control Parameter (Default) |
---|---|---|
Text |
Extracts the full document text as raw text, and optionally applies markdown processing and formatting while preserving raw text separately |
|
Paragraphs |
Extracts |
Always included |
Headings |
Preserves heading levels and formats as markdown headings when in markdown mode |
Always included |
Lists |
Maintains list hierarchy, numbering, and formatting with proper indentation and list type information |
Always included |
Tables |
Preserves table structure and formats tables in markdown mode |
|
Headers & Footers |
Captures document headers and footers with appropriate metadata |
|
Footnotes |
Extracts footnotes with references and preserves connection to original text |
|
Comments |
Preserves document comments with author information and timestamps |
|
Links |
Processes and formats hyperlinks, preserving both link text and target URLs |
|
Text Boxes |
Extracts text from various text box formats |
|
Inline Formatting |
Applies inline formatting such as bold, italic, underline, etc. when in markdown mode |
|
Images |
Extracts embedded images and converts them to |
|
βΉοΈ Current Limitations#
DocxConverter has the following limitations:
Drawings such as charts are skipped as it is challenging to represent them in text format.
Inline markdown formatting (bold, italic, etc.) and hyperlink formatting are not supported in specially marked sections (headers, footers, footnotes, comments).
Extraction of generated table of contents (ToC) is not supported. (A ToC is an automatically generated list of document headings with page numbers that Word creates based on heading styles.)
When converting very long DOCX files with complex formatting, performance may be very slow. (Our goal is to improve this in the future.)