feat(retrieval): add file-level chunked vectorization#860
Open
mildred522 wants to merge 4 commits intovolcengine:mainfrom
Open
feat(retrieval): add file-level chunked vectorization#860mildred522 wants to merge 4 commits intovolcengine:mainfrom
mildred522 wants to merge 4 commits intovolcengine:mainfrom
Conversation
|
Failed to generate code suggestions for PR |
Contributor
Author
|
This PR overlaps with #858 in problem space, but the approach is different. #858 focuses on configurable truncation / text source selection to avoid oversized embedding inputs. If maintainers prefer the smaller config-only step first, I can rebase this PR later. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR contributes to Category 3 ("Embedding Processing & Chunking Strategy") in the token/cost optimization tracker
described in #744.
It focuses on the file-level part of that work. Previously, long text files could be indexed as a single
coarse file-level embedding, which reduced retrieval quality for oversized documents and left file-level
chunking behavior undefined.
This change adds configurable chunked vectorization for long text files and collapses chunk-level hits back
to the base file URI during retrieval, so long files gain finer-grained vector coverage internally while
retrieval results remain file-level externally.
Related Issue
Type of Change
Changes Made
file_chunk_charsandfile_chunk_overlapconfig options, plus validation inOpenVikingConfigembedding_utils.pyHierarchicalRetrieverTesting
Targeted local verification completed:
py -3.11 -m ruff check openviking_cli\utils\config\open_viking_config.py openviking\utils\embedding_utils.py openviking\retrieve\hierarchical_retriever.py tests\misc\test_openviking_config_file_chunking.py tests\misc\test_file_chunk_vectorization.py tests\retrieve\test_hierarchical_retriever_chunk_collapse.pypy -3.11 -m pytest tests\misc\test_openviking_config_file_chunking.py tests\misc\test_file_chunk_vectorization.py tests\retrieve\test_hierarchical_retriever_chunk_collapse.py -qChecklist
Additional Notes
This PR is intentionally scoped to file-level chunked vectorization only.
It does not attempt to define chunking behavior for memory or directory indexing.
Chunk-level candidates are currently collapsed back to file-level results using the generated
chunk URI suffix convention, while preserving
source_uriin metadata.