Converters#
- class contextgem.public.converters.DocxConverter[source]#
Bases:
_DocxConverterBase
Converter for DOCX files into ContextGem documents.
This class handles extraction of text, formatting, tables, images, footnotes, comments, and other elements from DOCX files by directly parsing Word XML.
The converter is read-only and does not modify the source DOCX file in any way. It only extracts content for conversion to ContextGem document object or text formats.
The resulting ContextGem document is populated with the following:
Raw text: The raw text of the DOCX file.
Paragraphs: Paragraph objects with the following metadata:
Raw text: The raw text of the paragraph.
Additional context: Metadata about the paragraph’s style, list level, table cell position, being part of a footnote or comment, etc. This context provides additional information that is useful for LLM analysis and extraction.
Images: Image objects constructed from embedded images in the DOCX file.
- Example:
- DocxConverter usage example#
# Using ContextGem's DocxConverter from contextgem import DocxConverter converter = DocxConverter() # Convert a DOCX file to an LLM-ready ContextGem Document # from path document = converter.convert("path/to/document.docx") # or from file object with open("path/to/document.docx", "rb") as docx_file_object: document = converter.convert(docx_file_object) # Perform data extraction on the resulting Document object # document.add_aspects(...) # document.add_concepts(...) # llm.extract_all(document) # You can also use DocxConverter instance as a standalone text extractor docx_text = converter.convert_to_text_format( "path/to/document.docx", output_format="markdown", # or "raw" )
- convert_to_text_format(docx_path_or_file, output_format='markdown', include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, include_links=True, include_inline_formatting=True, strict_mode=False)[source]#
Converts a DOCX file directly to text without creating a ContextGem Document.
- Parameters:
docx_path_or_file (
str
|pathlib._local.Path
|typing.BinaryIO
) – Path to the DOCX file (as string or Path object) or a file-like objectoutput_format (
typing.Literal
['raw'
,'markdown'
]) – Output format (“markdown” or “raw”) (default: “markdown”)include_tables (
bool
) – If True, include tables in the output (default: True)include_comments (
bool
) – If True, include comments in the output (default: True)include_footnotes (
bool
) – If True, include footnotes in the output (default: True)include_headers (
bool
) – If True, include headers in the output (default: True)include_footers (
bool
) – If True, include footers in the output (default: True)include_textboxes (
bool
) – If True, include textbox content (default: True)include_links (
bool
) – If True, process and format hyperlinks (default: True)include_inline_formatting (
bool
) – If True, apply inline formatting (bold, italic, etc.) in markdown mode (default: True)strict_mode (
bool
) – If True, raise exceptions for any processing error instead of skipping problematic elements (default: False)
- Return type:
- Returns:
Text in the specified format
Note
When using markdown output format, the following conditions apply:
Document structure elements (headings, lists, tables) are preserved
Headings are converted to markdown heading syntax (# Heading 1, ## Heading 2, etc.)
Lists are converted to markdown list syntax, preserving numbering and hierarchy
Tables are formatted using markdown table syntax
Footnotes, comments, headers, and footers are included as specially marked sections
- convert(docx_path_or_file, apply_markdown=True, raw_text_to_md=None, include_tables=True, include_comments=True, include_footnotes=True, include_headers=True, include_footers=True, include_textboxes=True, include_images=True, include_links=True, include_inline_formatting=True, strict_mode=False)[source]#
Converts a DOCX file into a ContextGem Document object.
- Parameters:
docx_path_or_file (
str
|pathlib._local.Path
|typing.BinaryIO
) – Path to the DOCX file (as string or Path object) or a file-like objectapply_markdown (
bool
) – If True, applies markdown processing and formatting to the document content while preserving raw text separately (default: True)raw_text_to_md (
bool
) – [DEPRECATED] Use apply_markdown instead. Will be removed in v1.0.0. Note: This parameter previously controlled whether raw_text would contain raw or markdown text. The new apply_markdown parameter instead controls whether to apply markdown processing while keeping raw text and processed text separate.include_tables (
bool
) – If True, include tables in the output (default: True)include_comments (
bool
) – If True, include comments in the output (default: True)include_footnotes (
bool
) – If True, include footnotes in the output (default: True)include_headers (
bool
) – If True, include headers in the output (default: True)include_footers (
bool
) – If True, include footers in the output (default: True)include_textboxes (
bool
) – If True, include textbox content (default: True)include_images (
bool
) – If True, extract and include images (default: True)include_links (
bool
) – If True, process and format hyperlinks (default: True)include_inline_formatting (
bool
) – If True, apply inline formatting (bold, italic, etc.) in markdown mode (default: True)strict_mode (
bool
) – If True, raise exceptions for any processing error instead of skipping problematic elements (default: False)
- Return type:
- Returns:
A populated Document object