Bug: LiteLLM + Ollama Integration Returns 404 Error

## Bug: LiteLLM + Ollama Integration Returns 404 Error

### Description

When configuring OpenViking to use a local Ollama model as VLM, memory extraction fails with `404 page not found` error, even though the Ollama API is working correctly when tested directly.

### Environment

- OpenViking version: 0.2.9
- Python: 3.12.13
- OS: macOS (Darwin 25.3.0 arm64)
- Ollama version: 0.6.x
- Model: `qwen3-vl:2b` (local)

### Configuration

```json
{
  "vlm": {
    "provider": "litellm",
    "model": "ollama/qwen3-vl:2b",
    "api_base": "http://localhost:11434/v1",
    "api_key": "EMPTY",
    "temperature": 0.3,
    "max_tokens": 512,
    "max_retries": 2,
    "max_concurrent": 10
  }
}
```

### Symptoms

1. Memory extraction fails with 404 error
2. `extractZeroCount` increases rapidly
3. VLM observer shows "No token usage data available"

### Logs

```
2026-03-23 14:21:58,052 - openviking.session.memory_extractor - ERROR - Memory extraction failed: litellm.APIConnectionError: OllamaException - 404 page not found
2026-03-23 14:21:58,053 - uvicorn.access - INFO - 127.0.0.1:65106 - "POST /api/v1/sessions/fe282187-4e2e-439a-80e9-73fc6c69ff03/extract HTTP/1.1" 200
```

### Verification

Direct Ollama API calls work correctly:

```bash
# OpenAI-compatible endpoint - works
curl -s -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-vl:2b", "messages": [{"role": "user", "content": "hello"}]}' 
# Returns: {"id":"chatcmpl-668","object":"chat.completion",...}

# Native Ollama API - works
curl -s -X POST http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-vl:2b", "messages": [{"role": "user", "content": "hello"}]}'
# Returns streaming response correctly

# Model list - works
curl -s http://localhost:11434/v1/models
# Returns: {"object":"list","data":[{"id":"qwen3-vl:2b",...}]}
```

### Root Cause Hypothesis

The issue appears to be in how LiteLLM constructs the request to Ollama. Possible causes:

1. **Model name format**: LiteLLM may be using `ollama/qwen3-vl:2b` literally instead of stripping the `ollama/` prefix
2. **Endpoint path mismatch**: LiteLLM may be calling a different endpoint than `/v1/chat/completions`
3. **Request format**: LiteLLM may be sending parameters that Ollama doesn't recognize

### Workaround

Switching to a cloud VLM provider (e.g., DashScope `qwen3-vl-flash`) resolves the issue, but defeats the purpose of using local models for privacy/cost reasons.

### Expected Behavior

LiteLLM should successfully call Ollama's OpenAI-compatible API at `http://localhost:11434/v1/chat/completions` with the model name `qwen3-vl:2b` (without the `ollama/` prefix if that's the convention).

### Additional Context

This issue affects users who want to use local Ollama models as VLM for privacy, cost, or offline scenarios. The integration should work seamlessly since Ollama provides an OpenAI-compatible API.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: LiteLLM + Ollama Integration Returns 404 Error #899