Skip to content

fix(session): prevent concurrent commit re-committing old messages#783

Merged
qin-ctx merged 4 commits intovolcengine:mainfrom
deepakdevp:fix/session-commit-race
Mar 23, 2026
Merged

fix(session): prevent concurrent commit re-committing old messages#783
qin-ctx merged 4 commits intovolcengine:mainfrom
deepakdevp:fix/session-commit-race

Conversation

@deepakdevp
Copy link
Contributor

Summary

  • Adds an asyncio.Lock to Session to serialize concurrent commit_async() calls
  • Reorders Phase 1 of commit_async() to clear live messages before the slow LLM summary generation, closing the race window where a second commit could see stale data
  • Includes rollback logic if the file-clear fails, so messages aren't lost

Fixes #580

Root Cause

commit_async() had no synchronization. When called concurrently, both calls would copy the same self._messages, generate separate archives, and trigger duplicate memory extraction. The race window spanned the entire LLM summary generation (seconds), during which the live messages.jsonl still contained the old messages.

Changes Made

  • openviking/session/session.py:
    • Added self._commit_lock = asyncio.Lock() to Session.__init__
    • Wrapped Phase 1 (copy + clear + file write) in async with self._commit_lock
    • Moved empty-check inside the lock to prevent both callers from passing it
    • Lock released before slow operations (LLM summary, memory extraction)
    • Added rollback: if file-clear fails, messages are restored from the copy
  • tests/session/test_session_commit_race.py (new): 2 tests
    • Concurrent commits produce exactly 1 archive (dedup)
    • Messages added during commit are not lost

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Testing

  • 2 new race condition tests
  • Ruff lint + format clean
  • Existing commit tests unaffected (pre-existing VLM config requirement)

🤖 Generated with Claude Code

deepakdevp and others added 2 commits March 19, 2026 22:27
commit_async() now acquires an asyncio.Lock during Phase 1 (copy +
clear + file write). This prevents concurrent commits from re-committing
the same messages. The lock is released before the slow LLM summary and
memory extraction, so it doesn't block other operations.

The phase order is changed: live messages are cleared BEFORE the archive
summary is generated, closing the race window where a second commit
could see stale data. If the file-clear fails, messages are rolled back.

Fixes volcengine#580.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests verify that:
- Two concurrent commit_async() calls on the same session produce
  exactly one archive (the other returns early)
- Messages added while a commit is running are preserved in the
  session and not lost or re-committed

Part of fix for volcengine#580.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@qin-ctx
Copy link
Collaborator

qin-ctx commented Mar 20, 2026

Thanks for the contribution! The race condition analysis is spot-on — there is indeed a window during commit_async() where concurrent calls can re-commit the same messages. However, we'd like to suggest a different approach for the locking mechanism.

Existing protections

The session commit path already has several layers of protection:

  1. HTTP API layer — TaskTracker (server/routers/sessions.py): create_if_no_running("session_commit", session_id) atomically rejects duplicate background commits for the same session at the API entry point.
  2. Phase 2 crash recovery — RedoLog: On restart, LockManager._recover_pending_redo() replays incomplete memory extraction tasks.

What's missing is exactly what you identified: Phase 1 atomicity — the gap between copying messages and clearing them, with a slow LLM summary call in between.

Why asyncio.Lock() is not the right fit here

The asyncio.Lock you added is an in-process, in-memory lock. It only protects concurrent calls within the same Python process on the same Session object. It won't help when:

  • The HTTP server runs multiple workers (each has its own process/memory)
  • Multiple service instances handle the same session
  • Direct Python client usage across threads

Suggested approach: use existing PathLock

We already have a filesystem-based distributed lock infrastructure in openviking/storage/transaction/PathLock + LockManager + LockContext. It:

  • Locks via .path.ovlock files on AGFS — works across processes and instances
  • Has fencing tokens, stale lock cleanup, and timeout support
  • Is already used elsewhere in the codebase for write coordination

The session has self._session_uri which maps to an AGFS path, so it naturally fits into this locking scheme. The Phase 1 critical section would look roughly like:

from openviking.storage.transaction import LockContext, get_lock_manager

session_path = self._viking_fs._uri_to_path(self._session_uri, ctx=self.ctx)
async with LockContext(get_lock_manager(), [session_path], "point"):
    if not self._messages:
        return result
    self._compression.compression_index += 1
    messages_to_archive = self._messages.copy()
    self._messages.clear()
    await self._write_to_agfs_async(messages=[])
# Lock released — slow LLM summary + archive write proceeds without holding the lock

What we do agree on

The reordering of Phase 1 operations (copy + clear before the LLM summary call) is the right call — it closes the race window regardless of lock implementation. The rollback logic on _write_to_agfs_async failure is also a good idea.

Would you be open to reworking this PR to use PathLock instead of asyncio.Lock? Happy to help if you have questions about the transaction module.

Replaces the in-process asyncio.Lock with the existing PathLock
(distributed filesystem lock via LockContext) for Phase 1 of
commit_async(). This ensures commit serialization works across
multiple HTTP workers and service instances, not just within a
single Python process.

Addresses review feedback from qin-ctx on PR volcengine#783.
@deepakdevp
Copy link
Contributor Author

Thanks @qin-ctx for the thorough review and the suggestion! You're absolutely right that asyncio.Lock is insufficient for multi-worker deployments.

I've replaced it with LockContext(get_lock_manager(), [session_path], lock_mode="point") — same pattern used in the content router and resource processor. The Phase 1 reordering and rollback logic remain unchanged.

Changes in the latest push:

  • Removed import asyncio and self._commit_lock = asyncio.Lock()
  • Phase 1 now uses async with LockContext(...) with the session's AGFS path
  • Docstring updated to reflect PathLock

Please take another look when you get a chance!

# Use filesystem-based distributed lock so this works across workers/processes.
session_path = self._viking_fs._uri_to_path(self._session_uri, ctx=self.ctx)
async with LockContext(get_lock_manager(), [session_path], lock_mode="point"):
if not self._messages:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing — the old code returned early on empty _messages before acquiring any lock. Now every commit_async() call on an empty session takes a filesystem PathLock just to check the list length and return. Probably negligible in practice, but if something is calling commit_async frequently (e.g. a keep-alive or periodic flush), the lock contention could add up. Worth moving the empty check back above the async with, or is there a reason it needs to be inside the lock now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — applied a double-check locking pattern in 2f85807. A fast pre-check for not self._messages now sits above the async with LockContext, so empty sessions skip the filesystem lock entirely (common case, zero I/O). The authoritative check inside the lock is still there to handle the race where two concurrent callers both pass the pre-check but only the first should archive. Thanks for the suggestion!

Add a double-check locking pattern to commit_async(): a fast pre-check
for empty _messages before acquiring the PathLock, with an authoritative
check inside the lock to handle concurrent callers.

This avoids unnecessary filesystem lock acquisition (and the associated
AGFS .path.ovlock round-trip) for the common case where commit_async()
is called on a session with no pending messages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@qin-ctx qin-ctx merged commit 186157f into volcengine:main Mar 23, 2026
6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Bug: async session commit can re-commit old messages before live session switch completes

3 participants