Skip to content

feat(retrieve): add provenance metadata to search results#852

Open
mvanhorn wants to merge 2 commits intovolcengine:mainfrom
mvanhorn:osc/feat-search-provenance-metadata
Open

feat(retrieve): add provenance metadata to search results#852
mvanhorn wants to merge 2 commits intovolcengine:mainfrom
mvanhorn:osc/feat-search-provenance-metadata

Conversation

@mvanhorn
Copy link
Contributor

@mvanhorn mvanhorn commented Mar 21, 2026

Description

Adds an opt-in include_provenance parameter to the search/find API endpoints. When enabled, the response includes a provenance array showing which directories were traversed, which tier (L0/L1/L2) each result came from, match reasons, and the full thinking trace.

The README states "Visualized Retrieval Trajectory" as a core feature, and the internal data structures already collect this information (MatchedContext.level, QueryResult.thinking_trace, QueryResult.searched_directories). This change surfaces it through the API.

Evidence:

Source Evidence
README "Visualized Retrieval Trajectory - Supports visualization of directory retrieval trajectories"
#274 Code retrieval optimization - retrieval quality is a priority
#350 Decoupling ingestion - 3 thumbsup, community wants pipeline visibility

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Added include_provenance: bool = False to FindRequest and SearchRequest in openviking/server/routers/search.py
  • Extended FindResult.to_dict() to accept include_provenance and conditionally include query_results with thinking trace
  • Added _query_result_to_dict() to serialize query results with tier labels (L0/L1/L2)
  • Passed include_provenance through the search/find endpoint handlers

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

5 unit tests added in tests/retrieve/test_provenance.py covering:

  • Default behavior (no provenance)
  • Provenance enabled with full data
  • Provenance enabled without query_results (no crash)
  • Backward compatibility (existing fields unchanged)

Screenshots

Provenance output (installed from modified branch, tested with realistic retrieval data):

provenance example

The provenance section shows: which directories were searched, which tier (L0/L1/L2) each result came from, match reasons, and thinking trace statistics. All opt-in via include_provenance: true - existing clients see no change.

Example JSON response with provenance:

{
  "provenance": [{
    "query": "architecture design patterns",
    "searched_directories": ["resources/", "resources/docs/", "user/default/memories/"],
    "matched_contexts": [
      {"uri": "viking://resources/docs/architecture.md", "tier": "L2", "score": 0.87, "match_reason": "semantic_match"},
      {"uri": "viking://user/.../meeting-notes", "tier": "L1", "score": 0.62, "match_reason": "directory_match"}
    ],
    "thinking_trace": {"statistics": {"directories_searched": 2, "candidates_collected": 1}, "events": [...]}
  }]
}

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

This contribution was developed with AI assistance (Claude Code). The feature was proposed after dogfooding OpenViking's search API and noticing the provenance data was collected internally but not exposed to clients.

Adds an opt-in `include_provenance` parameter to the search/find API
endpoints. When enabled, the response includes a `provenance` array
with per-query retrieval details: which directories were traversed,
which tier (L0/L1/L2) each result came from, match reasons, and the
full thinking trace.

The internal data was already being collected in MatchedContext.level,
MatchedContext.context_type, and QueryResult.thinking_trace. This
change surfaces it through the API for retrieval observability, which
the README lists as a core design goal ("Visualized Retrieval
Trajectory").

Backward compatible: defaults to false, existing clients see no change.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Failed to generate code suggestions for PR

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant