Skip to content

perf(semantic): run batch overview generation and file summaries concurrently#840

Open
ahmedhesham6 wants to merge 2 commits intovolcengine:mainfrom
stakpak:perf/concurrent-semantic-batches
Open

perf(semantic): run batch overview generation and file summaries concurrently#840
ahmedhesham6 wants to merge 2 commits intovolcengine:mainfrom
stakpak:perf/concurrent-semantic-batches

Conversation

@ahmedhesham6
Copy link
Contributor

Description

The semantic processor generates directory overviews by splitting large directories into batches of 50 files and calling the VLM for each batch. Previously, both file summary generation in _process_memory_directory and batch overview generation in _batched_generate_overview ran sequentially — each VLM call blocked the next. For directories with 1000+ files (common in memory directories like entities, events, preferences), this caused a single queue item to take 15+ minutes, blocking the entire semantic queue.

This change runs both operations concurrently using asyncio.gather, bounded by the existing max_concurrent_llm semaphore.

Related Issue

N/A — discovered during production usage with large memory directories (1000+ entity memories).

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • _process_memory_directory: Changed file summary generation for modified/added files to run concurrently via asyncio.gather instead of sequential await in a loop. Cached summaries for unchanged files are still reused. Order is preserved via a pre-allocated indexed list.
  • _batched_generate_overview: All batch prompts are pre-built in the existing loop, then dispatched concurrently via asyncio.gather. Each VLM call is bounded by async with llm_sem to respect max_concurrent_llm. Batch ordering is preserved via an indexed list. The final merge step remains sequential as it depends on all batches completing.

Testing

  • Tested in production with max_concurrent_llm=20 against a directory with 1,214 memory files split into 20 batches
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux

Before (sequential): memories/entities (1,000 files, 20 batches) — ~15 minutes for batch overview step alone
After (concurrent): Same directory — ~23 seconds for batch overview step (~40x improvement)

Directory Files Batches Before After
memories/entities 1,214 20 ~15 min ~90 sec total
memories/cases 398 8 ~5 min ~47 sec total
memories/patterns 73 2 ~2 min ~25 sec total

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

The semantic queue processes items sequentially (one at a time). When a single memory directory with 1000+ files enters the queue, it blocks all other items for the duration of its processing. This change does not alter that single-consumer behavior — it only parallelizes the VLM calls within a single queue item.

The max_concurrent_llm semaphore (configured via vlm.max_concurrent in ov.conf) controls the degree of parallelism. The default of 100 is appropriate for most VLM providers. The change is fully backward-compatible — with max_concurrent_llm=1 the behavior is identical to sequential execution.

…urrently

The semantic processor generates directory overviews by splitting large
directories into batches of 50 files and calling the VLM for each batch.
Previously, both file summary generation in _process_memory_directory and
batch overview generation in _batched_generate_overview ran sequentially,
causing directories with 1000+ files to take 15+ minutes as each VLM call
blocked the next.

This change runs both operations concurrently using asyncio.gather, bounded
by the existing max_concurrent_llm semaphore:

- _process_memory_directory: changed files now generate summaries in parallel
  instead of awaiting each one sequentially. Cached summaries are still
  reused for unchanged files.

- _batched_generate_overview: all batch prompts are pre-built, then
  dispatched concurrently via asyncio.gather with the llm semaphore
  controlling concurrency. Batch ordering is preserved via indexed list.

With max_concurrent_llm=20, a 1000-file directory that previously took
~15 minutes for the batch step now completes in ~23 seconds (~40x
improvement). The final merge step remains sequential as it depends on
all batches completing.
@github-actions
Copy link

Failed to generate code suggestions for PR

…ormatting

Thread llm_sem through _generate_overview and _batched_generate_overview
so callers can share a single semaphore across the full pipeline, preventing
concurrent calls from exceeding the intended concurrency limit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants