fix(session): prevent concurrent commit re-committing old messages by deepakdevp · Pull Request #783 · volcengine/OpenViking

deepakdevp · 2026-03-19T13:30:27Z

Summary

Adds an asyncio.Lock to Session to serialize concurrent commit_async() calls
Reorders Phase 1 of commit_async() to clear live messages before the slow LLM summary generation, closing the race window where a second commit could see stale data
Includes rollback logic if the file-clear fails, so messages aren't lost

Fixes #580

Root Cause

commit_async() had no synchronization. When called concurrently, both calls would copy the same self._messages, generate separate archives, and trigger duplicate memory extraction. The race window spanned the entire LLM summary generation (seconds), during which the live messages.jsonl still contained the old messages.

Changes Made

openviking/session/session.py:
- Added self._commit_lock = asyncio.Lock() to Session.__init__
- Wrapped Phase 1 (copy + clear + file write) in async with self._commit_lock
- Moved empty-check inside the lock to prevent both callers from passing it
- Lock released before slow operations (LLM summary, memory extraction)
- Added rollback: if file-clear fails, messages are restored from the copy
tests/session/test_session_commit_race.py (new): 2 tests
- Concurrent commits produce exactly 1 archive (dedup)
- Messages added during commit are not lost

Type of Change

Bug fix (non-breaking change which fixes an issue)

Testing

2 new race condition tests
Ruff lint + format clean
Existing commit tests unaffected (pre-existing VLM config requirement)

🤖 Generated with Claude Code

commit_async() now acquires an asyncio.Lock during Phase 1 (copy + clear + file write). This prevents concurrent commits from re-committing the same messages. The lock is released before the slow LLM summary and memory extraction, so it doesn't block other operations. The phase order is changed: live messages are cleared BEFORE the archive summary is generated, closing the race window where a second commit could see stale data. If the file-clear fails, messages are rolled back. Fixes volcengine#580. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests verify that: - Two concurrent commit_async() calls on the same session produce exactly one archive (the other returns early) - Messages added while a commit is running are preserved in the session and not lost or re-committed Part of fix for volcengine#580. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

qin-ctx · 2026-03-20T03:06:10Z

Thanks for the contribution! The race condition analysis is spot-on — there is indeed a window during commit_async() where concurrent calls can re-commit the same messages. However, we'd like to suggest a different approach for the locking mechanism.

Existing protections

The session commit path already has several layers of protection:

HTTP API layer — TaskTracker (server/routers/sessions.py): create_if_no_running("session_commit", session_id) atomically rejects duplicate background commits for the same session at the API entry point.
Phase 2 crash recovery — RedoLog: On restart, LockManager._recover_pending_redo() replays incomplete memory extraction tasks.

What's missing is exactly what you identified: Phase 1 atomicity — the gap between copying messages and clearing them, with a slow LLM summary call in between.

Why `asyncio.Lock()` is not the right fit here

The asyncio.Lock you added is an in-process, in-memory lock. It only protects concurrent calls within the same Python process on the same Session object. It won't help when:

The HTTP server runs multiple workers (each has its own process/memory)
Multiple service instances handle the same session
Direct Python client usage across threads

Suggested approach: use existing `PathLock`

We already have a filesystem-based distributed lock infrastructure in openviking/storage/transaction/ — PathLock + LockManager + LockContext. It:

Locks via .path.ovlock files on AGFS — works across processes and instances
Has fencing tokens, stale lock cleanup, and timeout support
Is already used elsewhere in the codebase for write coordination

The session has self._session_uri which maps to an AGFS path, so it naturally fits into this locking scheme. The Phase 1 critical section would look roughly like:

from openviking.storage.transaction import LockContext, get_lock_manager

session_path = self._viking_fs._uri_to_path(self._session_uri, ctx=self.ctx)
async with LockContext(get_lock_manager(), [session_path], "point"):
    if not self._messages:
        return result
    self._compression.compression_index += 1
    messages_to_archive = self._messages.copy()
    self._messages.clear()
    await self._write_to_agfs_async(messages=[])
# Lock released — slow LLM summary + archive write proceeds without holding the lock

What we do agree on

The reordering of Phase 1 operations (copy + clear before the LLM summary call) is the right call — it closes the race window regardless of lock implementation. The rollback logic on _write_to_agfs_async failure is also a good idea.

Would you be open to reworking this PR to use PathLock instead of asyncio.Lock? Happy to help if you have questions about the transaction module.

Replaces the in-process asyncio.Lock with the existing PathLock (distributed filesystem lock via LockContext) for Phase 1 of commit_async(). This ensures commit serialization works across multiple HTTP workers and service instances, not just within a single Python process. Addresses review feedback from qin-ctx on PR volcengine#783.

deepakdevp · 2026-03-20T03:37:25Z

Thanks @qin-ctx for the thorough review and the suggestion! You're absolutely right that asyncio.Lock is insufficient for multi-worker deployments.

I've replaced it with LockContext(get_lock_manager(), [session_path], lock_mode="point") — same pattern used in the content router and resource processor. The Phase 1 reordering and rollback logic remain unchanged.

Changes in the latest push:

Removed import asyncio and self._commit_lock = asyncio.Lock()
Phase 1 now uses async with LockContext(...) with the session's AGFS path
Docstring updated to reflect PathLock

Please take another look when you get a chance!

Bortlesboat · 2026-03-22T05:15:22Z

openviking/session/session.py

+        # Use filesystem-based distributed lock so this works across workers/processes.
+        session_path = self._viking_fs._uri_to_path(self._session_uri, ctx=self.ctx)
+        async with LockContext(get_lock_manager(), [session_path], lock_mode="point"):
+            if not self._messages:


Small thing — the old code returned early on empty _messages before acquiring any lock. Now every commit_async() call on an empty session takes a filesystem PathLock just to check the list length and return. Probably negligible in practice, but if something is calling commit_async frequently (e.g. a keep-alive or periodic flush), the lock contention could add up. Worth moving the empty check back above the async with, or is there a reason it needs to be inside the lock now?

Good catch — applied a double-check locking pattern in 2f85807. A fast pre-check for not self._messages now sits above the async with LockContext, so empty sessions skip the filesystem lock entirely (common case, zero I/O). The authoritative check inside the lock is still there to handle the race where two concurrent callers both pass the pre-check but only the first should archive. Thanks for the suggestion!

Add a double-check locking pattern to commit_async(): a fast pre-check for empty _messages before acquiring the PathLock, with an authoritative check inside the lock to handle concurrent callers. This avoids unnecessary filesystem lock acquisition (and the associated AGFS .path.ovlock round-trip) for the common case where commit_async() is called on a session with no pending messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

deepakdevp and others added 2 commits March 19, 2026 22:27

github-project-automation bot added this to OpenViking project Mar 19, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 19, 2026

Bortlesboat reviewed Mar 22, 2026

View reviewed changes

qin-ctx approved these changes Mar 23, 2026

View reviewed changes

qin-ctx merged commit 186157f into volcengine:main Mar 23, 2026
6 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session): prevent concurrent commit re-committing old messages#783

fix(session): prevent concurrent commit re-committing old messages#783
qin-ctx merged 4 commits intovolcengine:mainfrom
deepakdevp:fix/session-commit-race

deepakdevp commented Mar 19, 2026

Uh oh!

qin-ctx commented Mar 20, 2026

Uh oh!

deepakdevp commented Mar 20, 2026

Uh oh!

Bortlesboat Mar 22, 2026

Uh oh!

deepakdevp Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deepakdevp commented Mar 19, 2026

Summary

Root Cause

Changes Made

Type of Change

Testing

Uh oh!

qin-ctx commented Mar 20, 2026

Existing protections

Why asyncio.Lock() is not the right fit here

Suggested approach: use existing PathLock

What we do agree on

Uh oh!

deepakdevp commented Mar 20, 2026

Uh oh!

Bortlesboat Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

deepakdevp Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Why `asyncio.Lock()` is not the right fit here

Suggested approach: use existing `PathLock`