Configuring LLM(s)#

This guide explains how to configure DocumentLLM instances to process documents using various LLM providers. ContextGem uses LiteLLM under the hood, providing uniform access to a wide range of models. For more information on supported LLMs, see Supported LLMs.

πŸš€ Basic Configuration#

The minimum configuration for a cloud-based LLM requires the model parameter and an api_key:

Using a cloud-based LLM#
from contextgem import DocumentLLM


# Pattern for using any cloud LLM provider
llm = DocumentLLM(
    model="<provider>/<model_name>",
    api_key="<api_key>",
)

# Example - Using OpenAI LLM
llm_openai = DocumentLLM(
    model="openai/gpt-4.1-mini",
    api_key="<api_key>",
    # see DocumentLLM API reference for all configuration options
)

# Example - Using Azure OpenAI LLM
llm_azure_openai = DocumentLLM(
    model="azure/o4-mini",
    api_key="<api_key>",
    api_version="<api_version>",
    api_base="<api_base>",
    # see DocumentLLM API reference for all configuration options
)

For local models, usually you need to specify the api_base instead of the API key:

Using a local LLM#
from contextgem import DocumentLLM


local_llm = DocumentLLM(
    model="ollama_chat/<model_name>",
    api_base="http://localhost:11434",  # Default Ollama endpoint
)

# Example - Using Llama 3.1 LLM via Ollama
llm_llama = DocumentLLM(
    model="ollama_chat/llama3.3:70b",
    api_base="http://localhost:11434",
    # see DocumentLLM API reference for all configuration options
)

# Example - Using DeepSeek R1 reasoning model via Ollama
llm_deepseek = DocumentLLM(
    model="ollama_chat/deepseek-r1:32b",
    api_base="http://localhost:11434",
    # see DocumentLLM API reference for all configuration options
)

Note

LM Studio Connection Error: If you encounter a connection error (litellm.APIError: APIError: Lm_studioException - Connection error) when using LM Studio, check that you have provided a dummy API key. While API keys are usually not expected for local models, this is a specific case where LM Studio requires one:

LM Studio with dummy API key#
from contextgem import DocumentLLM


llm = DocumentLLM(
    model="lm_studio/mistralai/mistral-small-3.2",
    api_base="http://localhost:1234/v1",
    api_key="dummy-key",  # dummy key to avoid connection error
)

# This is a known issue with calling LM Studio API in litellm:
# https://github.com/openai/openai-python/issues/961

This is a known issue with calling LM Studio API in litellm: openai/openai-python#961

πŸ“ Configuration Parameters#

The DocumentLLM class accepts the following parameters:

Parameter

Type

Default Value

Description

model

str

(Required)

Model identifier in format <provider>/<model_name>. See LiteLLM Providers for all supported providers.

api_key

str | None

None

API key for authentication. Required for most cloud providers but not for local models.

api_base

str | None

None

Base URL of the API endpoint. Required for local models and some cloud providers (e.g. Azure OpenAI).

deployment_id

str | None

None

Deployment ID for the model. Primarily used with Azure OpenAI.

api_version

str | None

None

API version. Primarily used with Azure OpenAI.

role

str

"extractor_text"

Role type for the LLM. Values: "extractor_text", "reasoner_text", "extractor_vision", "reasoner_vision". The role parameter is an abstraction that can be explicitly assigned to extraction components (aspects and concepts) in the pipeline. ContextGem then routes extraction tasks based on these assigned roles, matching components with LLMs of the same role. This allows you to structure your pipeline with different models for different tasks (e.g., using simpler models for basic extractions and more powerful models for complex reasoning). For more details, see 🏷️ LLM Roles.

system_message

str | None

None

If not provided (or set to None), ContextGem automatically sets a default system message optimized for extraction tasks, rendered based on the configured output_language. This default system message template can be found here in the source code. Note that for certain models (such as OpenAI’s o1-preview), system messages are not supported and will be ignored. Overriding this is typically only necessary for advanced use cases, such as custom priming or non-extraction tasks. For simple chat interactions, consider setting system_message='' to disable the default entirely (meaning no system message will be sent).

max_tokens

int

4096

Maximum tokens in the generated response (applicable to most models).

max_completion_tokens

int

16000

Maximum tokens for output completions in reasoning (CoT-capable) models.

reasoning_effort

str | None

None

Reasoning effort for reasoning (CoT-capable) models. Values: "minimal" (gpt-5 models only), "low", "medium", "high".

timeout

int

120

Timeout in seconds for LLM API calls.

num_retries_failed_request

int

3

Number of retries when LLM request fails.

max_retries_failed_request

int

0

LLM provider-specific retry count for failed requests.

max_retries_invalid_data

int

3

Number of retries when LLM request succeeds but returns invalid data (JSON parsing and validation fails).

pricing_details

LLMPricing | None

None

LLMPricing object with pricing details for cost calculation.

auto_pricing

bool

False

Enable automatic cost calculation using genai-prices based on the configured model. Mutually exclusive with pricing_details. Not supported for local models (e.g., ollama/, ollama_chat/, lm_studio/).

auto_pricing_refresh

bool

False

When auto_pricing is enabled, allow genai-prices to auto-refresh its cached pricing data at runtime.

is_fallback

bool

False

Indicates whether the LLM is a fallback model. Fallback LLMs are optionally assigned to the primary LLM instance and are used when the primary LLM fails.

fallback_llm

DocumentLLM | None

None

DocumentLLM to use as fallback if current one fails. Must have the same role as the primary LLM.

output_language

str

"en"

Language for output text. Values: "en" or "adapt" (adapts to document language). Setting value to "adapt" ensures that the text output (e.g. justifications, conclusions, explanations) is in the same language as the document. This is particularly useful when working with non-English documents. For example, if you’re extracting anomalies from a contract in Spanish, setting output_language="adapt" ensures that anomaly justifications are also in Spanish, making them immediately understandable by local end users reviewing the document.

temperature

float | None

0.3

Sampling temperature (0.0 to 1.0) controlling response creativity. Lower values produce more predictable outputs, higher values generate more varied responses.

top_p

float | None

0.3

Nucleus sampling value (0.0 to 1.0) controlling output focus/randomness. Lower values make output more deterministic, higher values produce more diverse outputs.

seed

int | None

None

Seed for random number generation to help produce more consistent outputs across multiple runs. When set to a specific integer value, the LLM will attempt to use this seed for sampling operations. However, deterministic output is still not guaranteed even with the same seed, as other factors may influence the model’s response.

async_limiter

AsyncLimiter

AsyncLimiter(3, 10)

Relevant when concurrency is enabled for extraction tasks. Controls frequency of async LLM API requests for concurrent tasks. Defaults to allowing 3 acquisitions per 10-second period to prevent rate limit issues. See aiolimiter documentation for AsyncLimiter configuration details. See Optimizing for Speed for an example of how to easily set up concurrency for extraction.

Warning

Auto-pricing accuracy

When using auto_pricing=True, cost estimates are approximate. These prices will not be 100% accurate. The price data cannot be exactly correct because model providers do not provide exact price information for their APIs in a format which can be reliably processed. See Pydantic’s genai-prices for more details.

πŸ’‘ Advanced Configuration Examples#

πŸ”„ Configuring a Fallback LLM#

You can set up a fallback LLM that will be used if the primary LLM fails:

Configuring a fallback LLM#
from contextgem import DocumentLLM


# Primary LLM
primary_llm = DocumentLLM(
    model="openai/gpt-4o-mini",
    api_key="<your-openai-api-key>",
    role="extractor_text",  # default role
)

# Fallback LLM
fallback_llm = DocumentLLM(
    model="anthropic/claude-3-5-haiku",
    api_key="<your-anthropic-api-key>",
    role="extractor_text",  # Must match the primary LLM's role
    is_fallback=True,
)

# Assign fallback LLM to primary
primary_llm.fallback_llm = fallback_llm

# Then use the primary LLM as usual
# document = primary_llm.extract_all(document)

πŸ’° Setting Up Cost Tracking#

You can configure pricing parameters to track costs:

Setting up LLM cost tracking#
from contextgem import DocumentLLM, LLMPricing


# Option 1: Set up a LLM with pricing details
llm = DocumentLLM(
    model="openai/gpt-4o-mini",
    api_key="<your-openai-api-key>",
    pricing_details=LLMPricing(
        input_per_1m_tokens=0.150,  # Cost per 1M input tokens
        output_per_1m_tokens=0.600,  # Cost per 1M output tokens
    ),
)

# Option 2: Set up a LLM with auto-pricing
llm = DocumentLLM(
    model="openai/gpt-4o-mini",
    api_key="<your-openai-api-key>",
    auto_pricing=True,
)

# Perform some extraction tasks

# Later, you can check the cost
cost_info = llm.get_cost()

🧠 Using Model-Specific Parameters#

For reasoning (CoT-capable) models (such as OpenAI’s o1/o3/o4), you can set reasoning-specific parameters:

Using model-specific parameters#
from contextgem import DocumentLLM


llm = DocumentLLM(
    model="openai/o3-mini",
    api_key="<your-openai-api-key>",
    max_completion_tokens=8000,  # Specific to reasoning (CoT-capable) models
    reasoning_effort="medium",  # Optional
)

βš™οΈ Explicit Capability Declaration#

Model vision capabilities are automatically detected using litellm.supports_vision(). If this function does not correctly identify your model’s capabilities, ContextGem will typically issue a warning, and you can explicitly declare the capability by setting _supports_vision=True on the LLM instance:

from contextgem import DocumentLLM

# Example: Explicitly declare vision capability
# Warning will be issued if automatic vision capability detection fails
llm = DocumentLLM(
    model="some_provider/custom_vision_model",
    api_base="http://localhost:3456/v1",
    role="extractor_vision"
)
# Declare capability if automatic detection fails (warning was issued)
llm._supports_vision = True

Warning

Explicit capability declarations should only be used when automatic capability detection fails. Incorrectly setting this flag may lead to unexpected behavior or API errors.

πŸ€–πŸ€– LLM Groups#

For complex document processing, you can organize multiple LLMs with different roles into a group:

Using LLM group#
from contextgem import DocumentLLM, DocumentLLMGroup


# Create LLMs with different roles
text_extractor = DocumentLLM(
    model="openai/gpt-4o-mini",
    api_key="<your-openai-api-key>",
    role="extractor_text",
    output_language="adapt",
)

text_reasoner = DocumentLLM(
    model="openai/o3-mini",
    api_key="<your-openai-api-key>",
    role="reasoner_text",
    max_completion_tokens=16000,
    reasoning_effort="high",
    output_language="adapt",
)

# Create a group
llm_group = DocumentLLMGroup(
    llms=[text_extractor, text_reasoner],
    output_language="adapt",  # All LLMs in the group must share the same output language setting
)

# Then use the group as usual
# document = llm_group.extract_all(document)

See a practical example of using an LLM group in πŸ”„ Using a Multi-LLM Pipeline to Extract Data from Several Documents.

πŸ“Š Accessing Usage and Cost Statistics#

You can track input/output token usage and costs:

Tracking usage and cost#
from contextgem import DocumentLLM


llm = DocumentLLM(
    model="anthropic/claude-3-5-haiku",
    api_key="<your-anthropic-api-key>",
    auto_pricing=True,  # or set `pricing_details=LLMPricing(...)` manually
)

# Perform some extraction tasks

# Get usage statistics
usage_info = llm.get_usage()

# Get cost statistics
cost_info = llm.get_cost()

# Reset usage and cost statistics
llm.reset_usage_and_cost()

# The same methods are available for LLM groups, with optional filtering by LLM role
# usage_info = llm_group.get_usage(llm_role="extractor_text")
# cost_info = llm_group.get_cost(llm_role="extractor_text")
# llm_group.reset_usage_and_cost(llm_role="extractor_text")

The usage statistics include not only token counts but also detailed information about each individual call made to the LLM. You can access the call history, including prompts, responses, and timestamps:

Accessing detailed usage information#
from contextgem import DocumentLLM


llm = DocumentLLM(
    model="openai/gpt-4.1",
    api_key="<your-openai-api-key>",
)

# Perform some extraction tasks

usage_info = llm.get_usage()

# Access the first usage container in the list (for the primary LLM)
llm_usage = usage_info[0]

# Get detailed call information
for call in llm_usage.usage.calls:
    print(f"Prompt: {call.prompt}")
    print(f"Response: {call.response}")  # original, unprocessed response
    print(f"Sent at: {call.timestamp_sent}")
    print(f"Received at: {call.timestamp_received}")