Converters#

class contextgem.public.converters.DocxConverter[source]#

Bases: object

Converter for DOCX files into ContextGem documents.

This class handles extraction of text, formatting, tables, images, footnotes, comments, and other elements from DOCX files by directly parsing Word XML.

The resulting ContextGem document is populated with the following:

  • Raw text: The raw text of the DOCX file converted to markdown or left as raw text, based on the raw_text_to_md flag.

  • Paragraphs: Paragraph objects with the following metadata:

    • Raw text: The raw text of the paragraph.

    • Additional context: Metadata about the paragraph’s style, list level, table cell position, being part of a footnote or comment, etc. This context provides additional information that is useful for LLM analysis and extraction.

  • Images: Image objects constructed from embedded images in the DOCX file.

Example:
DocxConverter usage example#
# Using ContextGem's DocxConverter

from contextgem import DocxConverter

converter = DocxConverter()

# Convert a DOCX file to an LLM-ready ContextGem Document
# from path
document = converter.convert("path/to/document.docx")
# or from file object
with open("path/to/document.docx", "rb") as docx_file_object:
    document = converter.convert(docx_file_object)

# You can also use it as a standalone text extractor
docx_text = converter.convert_to_text_format(
    "path/to/document.docx",
    output_format="markdown",  # or "raw"
)
convert_to_text_format(docx_path_or_file, output_format='markdown', include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, strict_mode=False)[source]#

Converts a DOCX file directly to text without creating a ContextGem Document.

Parameters:
  • docx_path_or_file (str | pathlib._local.Path | typing.BinaryIO) – Path to the DOCX file (as string or Path object) or a file-like object

  • output_format (typing.Literal['raw', 'markdown']) – Output format (“markdown” or “raw”) (default: “markdown”)

  • include_tables (bool) – If True, include tables in the output (default: True)

  • include_comments (bool) – If True, include comments in the output (default: True)

  • include_footnotes (bool) – If True, include footnotes in the output (default: True)

  • include_headers (bool) – If True, include headers in the output (default: True)

  • include_footers (bool) – If True, include footers in the output (default: True)

  • include_textboxes (bool) – If True, include textbox content (default: True)

  • strict_mode (bool) – If True, raise exceptions for any processing error instead of skipping problematic elements (default: False)

Return type:

str

Returns:

Text in the specified format

Note

When using markdown output format, the following conditions apply:

  • Document structure elements (headings, lists, tables) are preserved

  • Character-level formatting (bold, italic, underline) is intentionally skipped to ensure proper text matching between markdown and DOCX content

  • Headings are converted to markdown heading syntax (# Heading 1, ## Heading 2, etc.)

  • Lists are converted to markdown list syntax, preserving numbering and hierarchy

  • Tables are formatted using markdown table syntax

  • Footnotes, comments, headers, and footers are included as specially marked sections

convert(docx_path_or_file, raw_text_to_md=True, include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, include_images=True, strict_mode=False)[source]#

Converts a DOCX file into a ContextGem Document object.

Parameters:
  • docx_path_or_file (str | pathlib._local.Path | typing.BinaryIO) – Path to the DOCX file (as string or Path object) or a file-like object

  • raw_text_to_md (bool) – If True, convert raw text to markdown (default: True)

  • include_tables (bool) – If True, include tables in the output (default: True)

  • include_comments (bool) – If True, include comments in the output (default: True)

  • include_footnotes (bool) – If True, include footnotes in the output (default: True)

  • include_headers (bool) – If True, include headers in the output (default: True)

  • include_footers (bool) – If True, include footers in the output (default: True)

  • include_textboxes (bool) – If True, include textbox content (default: True)

  • include_images (bool) – If True, extract and include images (default: True)

  • strict_mode (bool) – If True, raise exceptions for any processing error instead of skipping problematic elements (default: False)

Return type:

contextgem.public.documents.Document

Returns:

A populated Document object