Skip to content

feat: Uptime Monitoring Alarm Integration#328

Open
dagangtj wants to merge 2 commits intodatabuddy-analytics:mainfrom
dagangtj:feat/uptime-alarm-integration
Open

feat: Uptime Monitoring Alarm Integration#328
dagangtj wants to merge 2 commits intodatabuddy-analytics:mainfrom
dagangtj:feat/uptime-alarm-integration

Conversation

@dagangtj
Copy link

Closes #268

Summary

Implements uptime monitoring alarm integration as requested in bounty #268 ().

Changes

Database Schema

  • Added alarms table with support for multiple notification channels (Slack, Discord, Email, Webhook)
  • Added alarm_trigger_history table for audit trail
  • Proper indexes on user_id, organization_id, website_id, and enabled fields
  • Foreign key constraints with cascade delete

Uptime Service Integration

  • State Tracking: Implemented in-memory state tracker to monitor consecutive failures and status changes
  • Alarm Processing: Integrated alarm trigger logic into uptime check workflow
  • Smart Notifications: Only triggers on status changes (up ↔ down) or threshold breaches
  • Duplicate Prevention: State tracker prevents spam notifications

Notification System

  • Uses @databuddy/notifications package for Slack and Discord webhooks
  • Down Notification: Includes URL, HTTP status, downtime start, consecutive failures, error details
  • Up Notification: Includes URL, recovery time, downtime duration, response time
  • Proper error handling - notification failures don't crash uptime service

Features Implemented

✅ Uptime trigger integration (down/up events)
✅ Consecutive failures threshold support
✅ Response time threshold support (optional/stretch)
✅ Alarm assignment to websites via websiteId
✅ Duplicate notification prevention
✅ Alarm trigger history logging
✅ Multi-channel support (Slack, Discord ready; Email/Webhook structure in place)

Technical Details

  • Follows existing codebase patterns in apps/uptime/
  • Uses @databuddy/notifications package helpers
  • TypeScript with strict types
  • Proper error handling with tracing integration
  • Non-blocking alarm processing (failures logged but don't affect uptime checks)

Files Changed

  • packages/db/src/drizzle/schema.ts - Database schema for alarms
  • apps/uptime/src/alarms.ts - Alarm trigger and notification logic
  • apps/uptime/src/state-tracker.ts - State tracking for consecutive failures
  • apps/uptime/src/index.ts - Integration into uptime service

Testing Notes

  • Alarm processing only runs when websiteId is present in schedule
  • State tracker maintains in-memory state for each monitor
  • Notifications sent via Promise.allSettled (failures don't block)
  • All errors captured via tracing system

Next Steps (UI - Not in Scope)

This PR provides the backend foundation. UI implementation would include:

  • Alarm management page in dashboard settings
  • Alarm assignment UI in uptime monitoring section
  • Test notification button
  • Alarm trigger history view

Dependencies

Notes

  • Email and custom webhook providers are structured but not fully implemented (awaiting email service configuration)
  • Focused on core uptime alarm functionality as requested in bounty

- Add alarms and alarm_trigger_history tables to database schema
- Implement alarm trigger logic in uptime service
- Add state tracking for consecutive failures and status changes
- Integrate with @databuddy/notifications package for Slack/Discord
- Support configurable thresholds and notification channels
- Log alarm trigger history for audit trail
- Prevent duplicate notifications with smart state tracking

Closes databuddy-analytics#268
@vercel
Copy link

vercel bot commented Feb 26, 2026

@dagangtj is attempting to deploy a commit to the Databuddy OSS Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot
Copy link

dosubot bot commented Feb 26, 2026

Related Documentation

Checked 1 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Greptile Summary

This PR implements uptime monitoring alarm integration, adding two new database tables (alarms, alarm_trigger_history), an in-memory state tracker for consecutive failure counting, and alarm/notification dispatch logic wired into the existing uptime check flow. The overall architecture is sound and follows codebase conventions, but there are two functional bugs that need to be fixed before merging.

Key changes:

  • packages/db/src/drizzle/schema.ts — New alarms and alarm_trigger_history tables with indexes and FK constraints (missing FK on alarm_trigger_history.website_id)
  • apps/uptime/src/state-tracker.ts — In-memory singleton tracking status changes and consecutive failure counts; contains a critical bug where downtimeDuration always evaluates to 0
  • apps/uptime/src/alarms.ts — Alarm query, condition evaluation, notification dispatch, and history logging; the consecutive-failures threshold path has no deduplication and will spam notifications once exceeded
  • apps/uptime/src/index.ts — Clean integration of state tracker + alarm processing as a non-blocking side-effect of each uptime check

Issues found:

  • Critical: downtimeDuration in state-tracker.ts always calculates as 0 because lastStatusChange is overwritten to timestamp before the duration formula runs, making every recovery notification report "0 minutes" of downtime
  • Significant: The consecutive-failures threshold in shouldTriggerAlarm fires on every poll once consecutiveFailures >= threshold with no "already alerted" guard, causing notification spam — contradicting the stated duplicate-prevention behaviour
  • Notable: The in-memory StateTracker loses all state on service restart or when multiple replicas run, which can cause missed or duplicate alarms; this should at minimum be documented
  • No database migration file was included alongside the schema changes

Confidence Score: 2/5

  • Not safe to merge — a critical calculation bug means recovery notifications always report 0 minutes of downtime, and the threshold path will spam users once triggered.
  • Two functional bugs require fixes before this is production-ready: the downtimeDuration always-zero bug makes recovery notifications misleading, and the missing deduplication on the consecutive-failures threshold directly contradicts the advertised spam-prevention feature. The integration layer itself is clean and non-breaking, but the core alarm logic cannot ship as-is.
  • apps/uptime/src/state-tracker.ts (critical downtimeDuration bug) and apps/uptime/src/alarms.ts (threshold deduplication gap) need the most attention.

Important Files Changed

Filename Overview
apps/uptime/src/state-tracker.ts Contains a critical bug where lastStatusChange is overwritten before being used to compute downtimeDuration, causing the recovery notification to always report 0 minutes of downtime. Also stores all state in-memory with no persistence, losing context on restart or across replicas.
apps/uptime/src/alarms.ts Alarm processing and notification dispatch logic is generally sound, but the consecutive-failures threshold path has no deduplication guard, meaning an alarm fires on every poll once the threshold is exceeded — contradicting the stated spam-prevention behaviour. Minor: uses Promise<unknown>[] in violation of the project style guide.
packages/db/src/drizzle/schema.ts New alarms and alarm_trigger_history tables added with proper indexes and most foreign keys; alarm_trigger_history.website_id is missing a FK constraint. No accompanying migration file was added.
apps/uptime/src/index.ts Integration of state tracker and alarm processing is clean and non-blocking; errors are captured and logged without affecting the primary uptime check flow.

Sequence Diagram

sequenceDiagram
    participant QStash
    participant index.ts
    participant StateTracker
    participant alarms.ts
    participant DB
    participant Slack/Discord

    QStash->>index.ts: POST / (uptime check trigger)
    index.ts->>index.ts: lookupSchedule(scheduleId)
    index.ts->>index.ts: checkUptime(url)
    index.ts->>index.ts: sendUptimeEvent(result)

    alt websiteId present
        index.ts->>StateTracker: updateState(monitorId, status, timestamp)
        StateTracker-->>index.ts: { previousStatus, consecutiveFailures, downtimeDuration }

        index.ts->>alarms.ts: processUptimeAlarms(context)
        alarms.ts->>DB: getWebsiteAlarms(websiteId)
        DB-->>alarms.ts: alarms[]

        loop for each alarm
            alarms.ts->>alarms.ts: shouldTriggerAlarm(alarm, context)
            alt alarm triggers
                alarms.ts->>alarms.ts: buildDownNotification / buildUpNotification
                alarms.ts->>Slack/Discord: sendNotification (Promise.allSettled)
                alarms.ts->>DB: logAlarmTrigger (alarm_trigger_history)
            end
        end
    end

    index.ts-->>QStash: 200 OK
Loading

Last reviewed commit: 61fa51d

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 53 to 68
if (statusChanged) {
// Status changed
consecutiveFailures = currentStatus === MonitorStatus.DOWN ? 1 : 0;
lastStatusChange = timestamp;
} else if (currentStatus === MonitorStatus.DOWN) {
// Still down, increment failures
consecutiveFailures += 1;
} else {
// Still up, reset failures
consecutiveFailures = 0;
}

const downtimeDuration =
statusChanged && currentStatus === MonitorStatus.UP
? timestamp - lastStatusChange
: undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downtimeDuration always calculates as zero

lastStatusChange is overwritten to timestamp inside the statusChanged block on line 56 — before being used in the downtime duration calculation on line 65. This means timestamp - lastStatusChange is always 0 when the site recovers, and the recovery notification will always show "0 minutes" for the downtime duration.

The original lastStatusChange value (from existing.lastStatusChange) must be captured before the local variable is reassigned:

Suggested change
if (statusChanged) {
// Status changed
consecutiveFailures = currentStatus === MonitorStatus.DOWN ? 1 : 0;
lastStatusChange = timestamp;
} else if (currentStatus === MonitorStatus.DOWN) {
// Still down, increment failures
consecutiveFailures += 1;
} else {
// Still up, reset failures
consecutiveFailures = 0;
}
const downtimeDuration =
statusChanged && currentStatus === MonitorStatus.UP
? timestamp - lastStatusChange
: undefined;
const previousLastStatusChange = existing.lastStatusChange;
if (statusChanged) {
// Status changed
consecutiveFailures = currentStatus === MonitorStatus.DOWN ? 1 : 0;
lastStatusChange = timestamp;
} else if (currentStatus === MonitorStatus.DOWN) {
// Still down, increment failures
consecutiveFailures += 1;
} else {
// Still up, reset failures
consecutiveFailures = 0;
}
const downtimeDuration =
statusChanged && currentStatus === MonitorStatus.UP
? timestamp - previousLastStatusChange
: undefined;

Comment on lines 77 to 85
// Check consecutive failures threshold
if (
conditions?.consecutiveFailuresThreshold &&
context.consecutiveFailures
) {
if (context.consecutiveFailures >= conditions.consecutiveFailuresThreshold) {
return true;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consecutive-failures threshold causes spam notifications

Once consecutiveFailures >= consecutiveFailuresThreshold, this condition is true on every subsequent check for as long as the site stays down. There is no "already triggered" guard for this path, so every uptime poll after the threshold is exceeded will fire a new notification — directly contradicting the "Duplicate Prevention" claim in the PR description.

The status-change path is fine (it only fires when previousStatus !== currentStatus), but this threshold path needs its own deduplication. One approach is to record a lastThresholdAlertAt (or a flag like thresholdAlerted: boolean) in MonitorState and reset it only when the site recovers. For example, in StateTracker:

// In MonitorState
thresholdAlerted: boolean;

// In updateState – reset on recovery
if (statusChanged && currentStatus === MonitorStatus.UP) {
  thresholdAlerted = false;
}

Then gate the threshold check in shouldTriggerAlarm so it only fires once per downtime episode.

Comment on lines 154 to 155
const channels = alarm.notificationChannels;
const promises: Promise<unknown>[] = [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unknown type violates project style guide

The project's style guide explicitly prohibits using any, unknown, or never — prefer an explicit type. Since all notification helpers return void (they're fire-and-forget), Promise<void> is more accurate here.

Suggested change
const channels = alarm.notificationChannels;
const promises: Promise<unknown>[] = [];
const promises: Promise<void>[] = [];

Context Used: Context from dashboard - Basic guidelines for the project so vibe coders don't fuck it up (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +14 to +15
class StateTracker {
private states = new Map<string, MonitorState>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-memory state is lost on service restart

StateTracker stores all monitor state in a process-local Map. Any restart (deploy, crash, scale-down) wipes the entire state. After a restart:

  • A site that is already down will be treated as a first-ever check (previousStatus: undefined), so the shouldTriggerAlarm status-change condition (previousStatus && status !== previousStatus) will be false and the alarm won't fire until the next check.
  • The consecutiveFailures counter resets to 1, meaning a site that was at failure Truncat too long Organisation Names #10 will briefly stop triggering threshold-based alarms.

This is a known trade-off for in-memory state, but it should be documented in the class or flagged as a known limitation, especially since the PR advertises duplicate-prevention semantics that are violated after restarts.

If the service runs multiple replicas, each instance has a completely independent state — the same event may fire an alarm on one instance but be silently ignored by another, or fire redundantly on all.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Additional Comments (1)

packages/db/src/drizzle/schema.ts
alarm_trigger_history.website_id has no foreign key constraint

The alarms table has a properly constrained websiteId with a CASCADE DELETE foreign key to websites.id. alarmTriggerHistory also carries a websiteId column but has no corresponding constraint — orphaned history rows will accumulate if a website is deleted. Given that the alarms FK already cascades, the corresponding history rows would survive the website deletion but their websiteId would point to a non-existent row.

Consider adding:

foreignKey({
  columns: [table.websiteId],
  foreignColumns: [websites.id],
  name: "alarm_trigger_history_website_id_fkey",
}).onDelete("set null"),

- Fix downtimeDuration calculation bug in state-tracker.ts (was always 0)
- Add deduplication for consecutive failure threshold alerts
- Add missing foreign key constraint on alarm_trigger_history.website_id with CASCADE DELETE
- Fix Promise type annotation in alarms.ts (Promise<void>[] instead of Promise<unknown>[])
@dagangtj
Copy link
Author

Greptile Review Issues Fixed

Thanks for the detailed review! I've addressed all the issues mentioned:

1. ✅ Fixed calculation bug (Critical)

Problem: was being updated before calculating , causing recovery notifications to always report 0 minutes of downtime.

Fix: Moved the duration calculation to happen BEFORE updating in (line 51-54).

2. ✅ Added deduplication for consecutive failure threshold (Significant)

Problem: Once , the alarm would fire on every poll, causing notification spam.

Fix: Added Map to track which alarms have already fired at specific failure counts. The alarm now only triggers once when the threshold is first reached, and resets on status change (line 38, 75-90 in ).

3. ✅ Added missing foreign key constraint (Notable)

Problem: had no FK constraint, allowing orphaned history rows when websites are deleted.

Fix: Added foreign key constraint with to → in (line 1074-1078).

4. ✅ Fixed type annotation

Changed to in to match project style guide.


Note on in-memory state: The in-memory design is intentional for this initial implementation. For production deployments with multiple replicas, we can add Redis/database persistence in a follow-up PR if needed.

All fixes are now pushed to this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🎯 Bounty: Uptime Monitoring Alarm Integration

2 participants