Skip to content

feat: make file vectorization strategy configurable#858

Open
ningfeemic-dev wants to merge 2 commits intovolcengine:mainfrom
ningfeemic-dev:feat/configurable-file-vectorization
Open

feat: make file vectorization strategy configurable#858
ningfeemic-dev wants to merge 2 commits intovolcengine:mainfrom
ningfeemic-dev:feat/configurable-file-vectorization

Conversation

@ningfeemic-dev
Copy link

Summary

Refs #857.

This PR makes text file vectorization strategy configurable to reduce embedding oversize failures on long text inputs.

What changed

  • add embedding.text_source with supported values:
    • summary_first
    • summary_only
    • content_only
  • add embedding.max_text_chars to cap raw text sent to embeddings
  • update vectorize_file() to respect the new config
  • add minimal validation/unit tests for config and strategy behavior

Why

Current upstream behavior still defaults to full-text embedding for text files. In real deployments using OpenAI-compatible embedding backends, this can trigger repeated oversize failures like input (...) is too large to process, reducing indexing completeness and operational stability.

Making this configurable gives operators a safer, backward-compatible way to balance:

  • stability
  • indexing completeness
  • retrieval quality

Notes

  • This PR intentionally keeps the scope small.
  • It does not introduce more complex chunking logic yet.
  • It focuses on configurable text source selection plus max raw-text length control.

Validation

  • source files and tests pass py_compile
  • targeted runtime-style validation in the current environment was limited by missing test dependencies for the upstream repo checkout, so this PR includes focused unit tests for the new config/strategy paths

@CLAassistant
Copy link

CLAassistant commented Mar 22, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link

Failed to generate code suggestions for PR

@ningfeemic-dev
Copy link
Author

Addressed the review points:

  • renamed to max_input_chars
  • unified summary/content selection via effective_text_source

Please take another look when convenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants