Skip to content

Bug: LiteLLM + Ollama Integration Returns 404 Error #899

@Suidge

Description

@Suidge

Bug: LiteLLM + Ollama Integration Returns 404 Error

Description

When configuring OpenViking to use a local Ollama model as VLM, memory extraction fails with 404 page not found error, even though the Ollama API is working correctly when tested directly.

Environment

  • OpenViking version: 0.2.9
  • Python: 3.12.13
  • OS: macOS (Darwin 25.3.0 arm64)
  • Ollama version: 0.6.x
  • Model: qwen3-vl:2b (local)

Configuration

{
  "vlm": {
    "provider": "litellm",
    "model": "ollama/qwen3-vl:2b",
    "api_base": "http://localhost:11434/v1",
    "api_key": "EMPTY",
    "temperature": 0.3,
    "max_tokens": 512,
    "max_retries": 2,
    "max_concurrent": 10
  }
}

Symptoms

  1. Memory extraction fails with 404 error
  2. extractZeroCount increases rapidly
  3. VLM observer shows "No token usage data available"

Logs

2026-03-23 14:21:58,052 - openviking.session.memory_extractor - ERROR - Memory extraction failed: litellm.APIConnectionError: OllamaException - 404 page not found
2026-03-23 14:21:58,053 - uvicorn.access - INFO - 127.0.0.1:65106 - "POST /api/v1/sessions/fe282187-4e2e-439a-80e9-73fc6c69ff03/extract HTTP/1.1" 200

Verification

Direct Ollama API calls work correctly:

# OpenAI-compatible endpoint - works
curl -s -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-vl:2b", "messages": [{"role": "user", "content": "hello"}]}' 
# Returns: {"id":"chatcmpl-668","object":"chat.completion",...}

# Native Ollama API - works
curl -s -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-vl:2b", "messages": [{"role": "user", "content": "hello"}]}'
# Returns streaming response correctly

# Model list - works
curl -s http://localhost:11434/v1/models
# Returns: {"object":"list","data":[{"id":"qwen3-vl:2b",...}]}

Root Cause Hypothesis

The issue appears to be in how LiteLLM constructs the request to Ollama. Possible causes:

  1. Model name format: LiteLLM may be using ollama/qwen3-vl:2b literally instead of stripping the ollama/ prefix
  2. Endpoint path mismatch: LiteLLM may be calling a different endpoint than /v1/chat/completions
  3. Request format: LiteLLM may be sending parameters that Ollama doesn't recognize

Workaround

Switching to a cloud VLM provider (e.g., DashScope qwen3-vl-flash) resolves the issue, but defeats the purpose of using local models for privacy/cost reasons.

Expected Behavior

LiteLLM should successfully call Ollama's OpenAI-compatible API at http://localhost:11434/v1/chat/completions with the model name qwen3-vl:2b (without the ollama/ prefix if that's the convention).

Additional Context

This issue affects users who want to use local Ollama models as VLM for privacy, cost, or offline scenarios. The integration should work seamlessly since Ollama provides an OpenAI-compatible API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions