Converters#
- class contextgem.public.converters.DocxConverter[source]#
Bases:
object
Converter for DOCX files into ContextGem documents.
This class handles extraction of text, formatting, tables, images, footnotes, comments, and other elements from DOCX files by directly parsing Word XML.
The resulting ContextGem document is populated with the following:
Raw text: The raw text of the DOCX file converted to markdown or left as raw text, based on the
raw_text_to_md
flag.Paragraphs: Paragraph objects with the following metadata:
Raw text: The raw text of the paragraph.
Additional context: Metadata about the paragraph’s style, list level, table cell position, being part of a footnote or comment, etc. This context provides additional information that is useful for LLM analysis and extraction.
Images: Image objects constructed from embedded images in the DOCX file.
- Example:
- DocxConverter usage example#
# Using ContextGem's DocxConverter from contextgem import DocxConverter converter = DocxConverter() # Convert a DOCX file to an LLM-ready ContextGem Document # from path document = converter.convert("path/to/document.docx") # or from file object with open("path/to/document.docx", "rb") as docx_file_object: document = converter.convert(docx_file_object) # You can also use it as a standalone text extractor docx_text = converter.convert_to_text_format( "path/to/document.docx", output_format="markdown", # or "raw" )
- convert_to_text_format(docx_path_or_file, output_format='markdown', include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, strict_mode=False)[source]#
Converts a DOCX file directly to text without creating a ContextGem Document.
- Parameters:
docx_path_or_file (
str
|pathlib._local.Path
|typing.BinaryIO
) – Path to the DOCX file (as string or Path object) or a file-like objectoutput_format (
typing.Literal
['raw'
,'markdown'
]) – Output format (“markdown” or “raw”) (default: “markdown”)include_tables (
bool
) – If True, include tables in the output (default: True)include_comments (
bool
) – If True, include comments in the output (default: True)include_footnotes (
bool
) – If True, include footnotes in the output (default: True)include_headers (
bool
) – If True, include headers in the output (default: True)include_footers (
bool
) – If True, include footers in the output (default: True)include_textboxes (
bool
) – If True, include textbox content (default: True)strict_mode (
bool
) – If True, raise exceptions for any processing error instead of skipping problematic elements (default: False)
- Return type:
- Returns:
Text in the specified format
Note
When using markdown output format, the following conditions apply:
Document structure elements (headings, lists, tables) are preserved
Character-level formatting (bold, italic, underline) is intentionally skipped to ensure proper text matching between markdown and DOCX content
Headings are converted to markdown heading syntax (# Heading 1, ## Heading 2, etc.)
Lists are converted to markdown list syntax, preserving numbering and hierarchy
Tables are formatted using markdown table syntax
Footnotes, comments, headers, and footers are included as specially marked sections
- convert(docx_path_or_file, raw_text_to_md=True, include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, include_images=True, strict_mode=False)[source]#
Converts a DOCX file into a ContextGem Document object.
- Parameters:
docx_path_or_file (
str
|pathlib._local.Path
|typing.BinaryIO
) – Path to the DOCX file (as string or Path object) or a file-like objectraw_text_to_md (
bool
) – If True, convert raw text to markdown (default: True)include_tables (
bool
) – If True, include tables in the output (default: True)include_comments (
bool
) – If True, include comments in the output (default: True)include_footnotes (
bool
) – If True, include footnotes in the output (default: True)include_headers (
bool
) – If True, include headers in the output (default: True)include_footers (
bool
) – If True, include footers in the output (default: True)include_textboxes (
bool
) – If True, include textbox content (default: True)include_images (
bool
) – If True, extract and include images (default: True)strict_mode (
bool
) – If True, raise exceptions for any processing error instead of skipping problematic elements (default: False)
- Return type:
- Returns:
A populated Document object