Store Durable Object alarms per-namespace on disk by threepointone · Pull Request #6104 · cloudflare/workerd

threepointone · 2026-02-18T14:22:14Z

(thanks opus)

Alarms in workerd are currently stored in a single global in-memory SQLite database. This means all scheduled alarms are lost on process restart, which prevents testing alarm resiliency scenarios and doesn't match production behavior.

This PR moves alarm scheduler ownership from Server into ActorNamespace, so each DO namespace gets its own AlarmScheduler backed by metadata.sqlite in the namespace's storage directory. On-disk namespaces get persistent alarms; in-memory namespaces get in-memory alarms.

Follows up on the design discussion in #605 (comment).

Design decisions

Per-namespace, not global. Kenton's original feedback on #605 was that a single global alarm database is problematic: it decouples alarm storage from namespace data, it makes splitting/combining configs lossy, and it creates the confusing possibility of on-disk DOs with in-memory alarms (or vice versa). Each namespace now owns its scheduler and stores alarms alongside its actor .sqlite files.

No new configuration. Alarm storage mode is inferred from the existing durableObjectStorage setting. If the namespace uses localDisk, alarms go on disk. If it's inMemory or none, alarms stay in memory. This was an explicit goal from the #605 discussion -- no new config knobs needed.

metadata.sqlite as the filename. Named generically (rather than alarms.sqlite) so it can hold other per-namespace metadata in the future, as Kenton suggested.

AlarmScheduler class unchanged. The class already accepted a VFS + path in its constructor, so no API changes were needed. The only change is where schedulers are created and who owns them.

Changes

server.h: Removed global AlarmScheduler member, startAlarmScheduler() declaration, and the alarm-scheduler.h include (moved to .c++).
server.c++: ActorNamespace now owns a kj::Maybe<kj::Own<AlarmScheduler>>. Created in link() using the namespace's own VFS (on-disk) or an in-memory VFS (fallback). The namespace self-registers with its scheduler at link time. Removed the global scheduler, LinkedIoChannels::alarmScheduler, and all related wiring.
server-test.c++: Added a new "Durable Object alarm persistence (on disk)" test that sets an alarm, tears down the server, restarts with the same storage directory, and verifies the alarm survived. Updated two existing tests whose file-count assertions now include metadata.sqlite.

Notes for reviewers

alarm-scheduler.h and alarm-scheduler.c++ are completely untouched.
The kj::systemPreciseCalendarClock() call in the ActorNamespace constructor is a pre-existing pattern -- the old startAlarmScheduler() called it the same way. Threading a calendar clock through the Server constructor would be cleaner but significantly more churn for no immediate benefit.
Declaration order in ActorNamespace matters: ownAlarmScheduler is declared before actors so it outlives all actors that reference it (same constraint as the existing actorStorage field).
The this capture in the registerNamespace lambda is safe because the scheduler is owned by the namespace and destroyed first. This matches the original pattern which captured &actorNs.

github-actions · 2026-02-18T14:22:27Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

threepointone · 2026-02-18T14:23:53Z

I have read the CLA Document and I hereby sign the CLA

threepointone · 2026-02-18T14:24:01Z

recheck

threepointone · 2026-02-18T14:53:20Z

the workers-sdk test failure seems to be just about fixing the file count on the workers-sdk side
I don't understand the windows failure tho

southpolesteve · 2026-02-23T20:11:50Z

/bigbonk roast this PR. Hi Sunil. testing bonk

threepointone · 2026-02-25T11:33:02Z

/bigbonk roast this PR

ask-bonk · 2026-02-25T11:34:25Z

OK, I've seen enough. Time to roast.

The Roast

"(thanks opus)" -- The PR description is literally better written than most theses I've reviewed, and you're crediting it to a chatbot. Sunil, this is the equivalent of showing up to a potluck with a Michelin-star catered dish and going "oh yeah my microwave did this."

The actual code though? Let's talk about it.

The link() method is doing a three-act play. Lines 2211-2249 of server.c++: Act 1: maybe create an on-disk scheduler. Act 2: if that didn't work, maybe create an in-memory scheduler. Act 3: if either worked, register the namespace. This method is begging to be broken up. It reads like a choose-your-own-adventure book where every path ends at the same page anyway.

You check config.tryGet<Durable>() THREE separate times in link(). Once at line 2213, again at line 2226, and a third time at line 2239. The config isn't changing between those calls. You could just... save it in a variable. The Durable config didn't run away.

The this capture in the registerNamespace lambda (line 2242) -- you wrote a whole paragraph in the PR description justifying why it's safe, which is the universal code smell for "I know this looks sketchy." The old code captured &actorNs (a reference to a local) which was also sketchy, so I guess you're maintaining tradition.

kj::systemPreciseCalendarClock() is now threaded into ActorNamespace as a const kj::Clock& member (line 2906), but you openly admit in the PR description this is tech debt you're choosing to perpetuate. Bold move to document your own sins and ship them anyway.

The test (line 2293+) sets an alarm for 4102444800000 (January 1, 2100). Optimistic that workerd will still be around in 74 years. Even more optimistic that someone won't break this test before then. At least you didn't pick Date.now() + 1000 and create a flaky test, so credit where it's due.

File count assertions -- you bumped 4 to 6 and 2 to 3 across multiple tests. These are the kind of magic-number assertions that will haunt every future contributor who adds a new file to namespace storage. The next person to touch this will spend 30 minutes figuring out why the count is off by one, look at the git blame, and curse your name.

metadata.sqlite as a filename -- the PR description says "named generically so it can hold other per-namespace metadata in the future." This is the architectural equivalent of buying a 5-bedroom house "in case we have kids someday." You might, but right now it's just one alarm table rattling around in there.

The deletion of startAlarmScheduler() and the global alarmScheduler member from Server is clean, I'll give you that. Removing ~30 lines of global state wiring and the LinkedIoChannels::alarmScheduler field is the kind of simplification that actually makes the codebase better. This is the one part Opus can't take credit for because it required understanding the existing architecture.

The facets test fix (line 5031+) where you added || name.startsWith("metadata.sqlite") to the assertion loop -- this is the "I broke an existing test and patched it" special. Functional? Yes. Elegant? It's duct tape on a tuxedo.

Bottom line: The design is sound -- per-namespace alarm storage is clearly the right call, the PR description is immaculate (thanks, Opus), and the test coverage is genuinely good. But the implementation has the telltale signs of "an AI wrote the structure and a human glued it together" -- slightly redundant control flow, a few too many "trust me, this is fine" comments, and magic numbers scattered like confetti.

7/10 -- would merge after a cleanup pass. The architecture carries it.

github run

kentonv

Sorry for slow review.

kentonv · 2026-02-25T23:26:20Z

src/workerd/server/server.c++

          this->actorStorage.emplace(dir.openSubdir(
              kj::Path({d.uniqueKey}), kj::WriteMode::CREATE | kj::WriteMode::MODIFY));
+
+          // Create per-namespace alarm scheduler backed by on-disk storage in the
+          // namespace directory, alongside the per-actor .sqlite files.
+          auto& as = KJ_ASSERT_NONNULL(this->actorStorage);


Suggested change

this->actorStorage.emplace(dir.openSubdir(

kj::Path({d.uniqueKey}), kj::WriteMode::CREATE | kj::WriteMode::MODIFY));

// Create per-namespace alarm scheduler backed by on-disk storage in the

// namespace directory, alongside the per-actor .sqlite files.

auto& as = KJ_ASSERT_NONNULL(this->actorStorage);

auto& as = this->actorStorage.emplace(dir.openSubdir(

kj::Path({d.uniqueKey}), kj::WriteMode::CREATE | kj::WriteMode::MODIFY));

// Create per-namespace alarm scheduler backed by on-disk storage in the

// namespace directory, alongside the per-actor .sqlite files.

kentonv · 2026-02-25T23:38:36Z

src/workerd/server/server.c++

+
+        KJ_IF_SOME(d, config.tryGet<Durable>()) {
+          auto idFactory = kj::heap<ActorIdFactoryImpl>(d.uniqueKey);
+          sched->registerNamespace(d.uniqueKey,


So, AlarmScheduler is currently designed to handle multiple namespaces registered at once, and to that end the table it creates has a column called actor_unique_key which identifies the DO namespace.

But now that we store a separate database per namespace, we don't really want that anymore. It's just wasting bytes and potentially causing headaches if someone wants to change keys or something.

Seems like we should update the AlarmScheduler implementation to remove the redundant functionality.

This is something that's best to do now, not later after a bunch of these tables have already been written to disk...

threepointone · 2026-02-27T15:15:30Z

got the clanker to rebase and force push a rewrite, 🏓

kentonv · 2026-03-04T14:09:02Z

LGTM but how do we make Bonk re-review?

jasnell · 2026-03-05T03:24:18Z

/bigbonk review this pr again

ask-bonk · 2026-03-05T03:24:46Z

ResolveMessage: Cannot find module '@opencode-ai/plugin' from '/home/runner/work/workerd/workerd/.opencode/tools/bazel-deps.ts'

github run

ask-bonk · 2026-03-05T03:24:48Z

@threepointone Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

threepointone · 2026-03-09T12:10:24Z

/bigbonk review this pr again

ask-bonk · 2026-03-09T12:10:45Z

error: Your local changes to the following files would be overwritten by checkout:
.opencode/package.json
Please commit your changes or stash them before you switch branches.
Aborting

github run

kentonv · 2026-03-12T02:50:29Z

Maybe you need to rebase on main to pick up bonk fixes?

kentonv · 2026-03-15T20:40:46Z

/bigbonk review this pr again

github-actions · 2026-03-15T20:43:07Z

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

ask-bonk

Review Summary

kentonv's two review comments are addressed: the actor_unique_key column has been removed from the _cf_ALARM schema, and the registerNamespace() / multi-namespace machinery has been stripped from AlarmScheduler in favor of a single GetActorFn passed at construction time. kentonv marked LGTM.

Findings below are minor — the design is sound and the implementation is correct. The per-namespace ownership model is clean, lifetime safety looks good, and the persistence test is a valuable addition.

This review was generated by an AI assistant (ask-bonk) and may contain errors. Please verify any suggestions before applying them.

ask-bonk

Review Summary

kentonv's two review comments are addressed: the actor_unique_key column has been removed from the _cf_ALARM schema, and the registerNamespace() / multi-namespace machinery has been stripped from AlarmScheduler in favor of a single GetActorFn passed at construction time. kentonv marked LGTM.

Findings below are minor -- the design is sound and the implementation is correct. The per-namespace ownership model is clean, lifetime safety looks good, and the persistence test is a valuable addition.

This review was generated by an AI assistant (ask-bonk) and may contain errors. Please verify any suggestions before applying them.

ask-bonk · 2026-03-15T20:47:40Z

src/workerd/server/server.c++

      }

-      this->alarmScheduler = alarmScheduler;
+      KJ_IF_SOME(d, config.tryGet<Durable>()) {


Nit: link() calls config.tryGet<Durable>() twice -- once at line 2229 inside the storage block, and again here for the alarm scheduler block. kentonv's original suggestion was to fold these together. The second check could be nested inside the first to avoid the redundant dispatch:

void link(kj::Maybe<const kj::Directory&> serviceActorStorage) { KJ_IF_SOME(d, config.tryGet<Durable>()) { KJ_IF_SOME(dir, serviceActorStorage) { this->actorStorage.emplace(dir.openSubdir( kj::Path({d.uniqueKey}), kj::WriteMode::CREATE | kj::WriteMode::MODIFY)); } auto idFactory = kj::heap<ActorIdFactoryImpl>(d.uniqueKey); // ... rest of alarm scheduler setup } }

That said, kentonv already marked LGTM on the current structure, so this is just a nit.

ask-bonk · 2026-03-15T20:47:40Z

src/workerd/server/server-test.c++


  // Having torn everything down, the WAL files should be gone.
-  KJ_EXPECT(dir->openSubdir(kj::Path({"mykey"}))->listNames().size() == 2);
+  KJ_EXPECT(dir->openSubdir(kj::Path({"mykey"}))->listNames().size() == 3);


After teardown the count is 3 (two actor .sqlite files + metadata.sqlite), but metadata.sqlite isn't explicitly asserted here unlike the two actor files on lines 2271-2274. Consider adding:

KJ_EXPECT(dir->exists(kj::Path({"mykey", "metadata.sqlite"})));

for symmetry with the assertions above and to make the test self-documenting about what the third file is.

ask-bonk · 2026-03-15T20:47:40Z

src/workerd/server/server-test.c++

+  auto dir = kj::newInMemoryDirectory(kj::nullClock());
+
+  // A far-future alarm time (won't fire during the test).
+  kj::StringPtr alarmTime = "4102444800000";


Nit: 4102444800000 is January 1, 2100 UTC in milliseconds. A comment or named constant would make this clearer at a glance, e.g.:

// January 1, 2100 00:00:00 UTC in milliseconds since epoch kj::StringPtr alarmTime = "4102444800000";

The comment on line 2359 helps, but embedding the date in a constant name or a more explicit inline comment would help future readers.

ask-bonk · 2026-03-15T20:48:13Z

Review posted on PR #6104. Here's a summary of what I found:

Status: kentonv's two review comments are both addressed in the second commit. He asked to (1) restructure link() to use the emplace() return value directly, and (2) remove the redundant actor_unique_key column from the alarm schema. The second ask was fully implemented; the first was partially addressed through a restructuring. kentonv marked LGTM.

Findings posted (all minor nits):

server.c++:2235 -- config.tryGet<Durable>() is called twice in link(). Could be consolidated into a single outer KJ_IF_SOME block.
server-test.c++:2270 -- After teardown, the file count asserts 3 but only 2 files are explicitly checked. Adding an explicit assertion for metadata.sqlite would make the test self-documenting.
server-test.c++:2360 -- The magic number 4102444800000 (Jan 1, 2100 UTC ms) could use a more explicit inline comment about what date it represents.

No blocking issues found. The design is sound -- per-namespace alarm storage with correct lifetime ordering, safe kj::Clock& reference (backed by static constexpr), proper VFS scoping, and the this capture in the getActor lambda is safe because the scheduler is owned by the namespace.

github run

kentonv · 2026-03-15T20:55:12Z

I rebased on main. But it appears that #1918 was merged recently and it conflicts with this PR -- not explicitly like a merge conflict, but the resulting code doesn't compile.

At a glance I'm not really sure what the right answer is. That PR makes abortAllDurableObjects() delete alarms, and says this is important for vitest-pool-workers.

But in the abstract, it doesn't seem like "abort all durable objects" should imply "delete all alarms"! Aborting is not deleting!

Perhaps in practice this abortAllDurableObjects() is only used from vitest? And in that context, deleting the alarms makes sense? I don't know.

Do we really just want to delete the alarms, or would it make more sense to delete all the data? I feel like "delete alarms but not content" just means your DOs are left in an inconsistent state. Maybe vitest is already deleting content in these cases, after it aborts the DOs?

Can someone with context on vitest suggest what is really the right thing to do here? @petebacondarwin, @dario-piotrowicz, @penalosa?

petebacondarwin · 2026-03-18T09:56:10Z

I rebased on main. But it appears that #1918 was merged recently and it conflicts with this PR -- not explicitly like a merge conflict, but the resulting code doesn't compile.

At a glance I'm not really sure what the right answer is. That PR makes abortAllDurableObjects() delete alarms, and says this is important for vitest-pool-workers.

But in the abstract, it doesn't seem like "abort all durable objects" should imply "delete all alarms"! Aborting is not deleting!

Perhaps in practice this abortAllDurableObjects() is only used from vitest? And in that context, deleting the alarms makes sense? I don't know.

Do we really just want to delete the alarms, or would it make more sense to delete all the data? I feel like "delete alarms but not content" just means your DOs are left in an inconsistent state. Maybe vitest is already deleting content in these cases, after it aborts the DOs?

Can someone with context on vitest suggest what is really the right thing to do here? @petebacondarwin, @dario-piotrowicz, @penalosa?

I think for Vitest we just want to cancel a "session", which involves aborting DOs but also deleting alarms. How about we split the API into two (abort DOs and delete alarms) then Vitest can call both and abortAllDurableObjects would no longer delete the alarms?

Previously, all alarms were stored in a single global in-memory SQLite database, meaning they were lost on process restart. This made it impossible to test alarm resiliency scenarios. Move alarm scheduler ownership from Server into ActorNamespace, so each DO namespace gets its own AlarmScheduler backed by metadata.sqlite in the namespace's storage directory. On-disk namespaces get persistent alarms; in-memory namespaces get in-memory alarms. No new configuration is needed -- alarm storage follows the existing durableObjectStorage setting. Ref: #605 (comment) Co-authored-by: Cursor <cursoragent@cursor.com>

Address review feedback from kentonv: since each DO namespace now has its own separate SQLite database for alarm storage, the actor_unique_key column in the _cf_ALARM table is redundant. Previously, AlarmScheduler was designed to handle multiple namespaces in a single database, using actor_unique_key to distinguish between them. Now that we store a separate database per namespace, this column just wastes bytes and could cause headaches if someone wants to change keys later. As kentonv noted, this is best fixed now before any of these tables are written to disk in the wild. Changes: - Simplified ActorKey struct to contain only actorId (removed uniqueKey) - Simplified _cf_ALARM schema from (actor_unique_key, actor_id, scheduled_time) with a compound primary key to (actor_id, scheduled_time) with actor_id as the sole primary key - Removed registerNamespace() -- the GetActorFn is now passed directly to the AlarmScheduler constructor, since each scheduler serves exactly one namespace - Replaced the namespaces HashMap with a single GetActorFn member - Restructured ActorNamespace::link() in server.c++ to create the GetActorFn once and pass it to whichever storage path (on-disk or in-memory) is used, eliminating duplication - Removed unused setAlarmInMemory declaration and unused byte type alias

kentonv · 2026-03-18T23:15:35Z

OK, #1918 was reverted in #6355 as it turns out it's no longer relevant -- vitest-pool-workers actually stopped using the relevant API last week.

I've rebased this on top of the revert, so hopefully it is now able to compile again.

penalosa · 2026-03-19T00:57:41Z

I think the workers-sdk test failures are genuine. I imagine we'll need to filter out the metadata.sqlite filename when listing created DOs, which should be simple enough: cloudflare/workers-sdk#12960

An upcoming workerd change (cloudflare/workerd#6104) stores per-namespace alarm metadata in a metadata.sqlite file alongside per-actor .sqlite files. Filter it out so it isn't counted as a Durable Object instance.

kentonv · 2026-03-19T15:28:21Z

The Windows failure looks genuine but unfortunately the stack trace isn't symbolized which makes it pretty hard for me to debug. :/

penalosa · 2026-03-19T16:30:47Z

I think the workers-sdk test failures are genuine. I imagine we'll need to filter out the metadata.sqlite filename when listing created DOs, which should be simple enough: cloudflare/workers-sdk#12960

These are now green after the workers-sdk fix

kentonv · 2026-03-20T20:18:43Z

I've sent both Opus 4.6 and GPT 5.4 off to carefully read the code and try to blind-debug the Windows issue, since I don't have a Windows machine available.

If that doesn't work, I'm going to need someone with a Windows machine to try to do some debugging.

kentonv · 2026-03-20T20:32:29Z

They both got it.

ChatGPT 5.4: 99k tokens (7 minutes)
Claude Opus 4.6: 129k tokens (23 minutes)

I think Opus is running a little slow today, maybe overloaded, but GPT is the clear winner either way.

GPT:

Opus:

threepointone requested review from a team as code owners February 18, 2026 14:22

github-actions bot added a commit that referenced this pull request Feb 18, 2026

@threepointone has signed the CLA in #6104

10cac96

kentonv self-requested a review February 18, 2026 14:40

kentonv reviewed Feb 25, 2026

View reviewed changes

threepointone force-pushed the alarm-storage branch from 95b7ec1 to a99f831 Compare February 27, 2026 15:14

kentonv force-pushed the alarm-storage branch from a99f831 to e371917 Compare March 15, 2026 20:39

ask-bonk bot reviewed Mar 15, 2026

View reviewed changes

threepointone mentioned this pull request Mar 18, 2026

scheduleEvery idempotent path doesn't re-set the alarm after DO restart cloudflare/agents#1119

Closed

penalosa mentioned this pull request Mar 18, 2026

Removed section on Durable Object alarms from known issues cloudflare/cloudflare-docs#28706

Open

4 tasks

This was referenced Mar 18, 2026

Delete all alarms on abortAllDurableObjects() #1918

Merged

Add deleteAllDurableObjects() to workerd:unsafe #6357

Draft

threepointone and others added 2 commits March 18, 2026 18:14

kentonv force-pushed the alarm-storage branch from e371917 to fa67b7c Compare March 18, 2026 23:14

penalosa mentioned this pull request Mar 19, 2026

[miniflare] Exclude metadata.sqlite from DO object listing cloudflare/workers-sdk#12960

Merged

5 tasks

Fix Windows build -- evaluation order is reversed.

8f36800

Conversation

threepointone commented Feb 18, 2026

Design decisions

Changes

Notes for reviewers

Uh oh!

github-actions bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

threepointone commented Feb 18, 2026

Uh oh!

threepointone commented Feb 18, 2026

Uh oh!

threepointone commented Feb 18, 2026

Uh oh!

southpolesteve commented Feb 23, 2026

Uh oh!

threepointone commented Feb 25, 2026

Uh oh!

ask-bonk bot commented Feb 25, 2026

The Roast

Uh oh!

kentonv left a comment

Choose a reason for hiding this comment

Uh oh!

kentonv Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

kentonv Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

threepointone commented Feb 27, 2026

Uh oh!

kentonv commented Mar 4, 2026

Uh oh!

jasnell commented Mar 5, 2026

Uh oh!

ask-bonk bot commented Mar 5, 2026

Uh oh!

ask-bonk bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

threepointone commented Mar 9, 2026

Uh oh!

ask-bonk bot commented Mar 9, 2026

Uh oh!

kentonv commented Mar 12, 2026

Uh oh!

kentonv commented Mar 15, 2026

Uh oh!

github-actions bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

ask-bonk bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot commented Mar 15, 2026

Uh oh!

kentonv commented Mar 15, 2026

Uh oh!

petebacondarwin commented Mar 18, 2026

Uh oh!

kentonv commented Mar 18, 2026

Uh oh!

penalosa commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Feb 18, 2026 •

edited

Loading

ask-bonk bot commented Mar 5, 2026 •

edited

Loading

github-actions bot commented Mar 15, 2026 •

edited

Loading

penalosa commented Mar 19, 2026 •

edited

Loading

kentonv commented Mar 20, 2026 •

edited

Loading