Skip to content

Add Batch Upgrade Operations and Fix Reconciler/Orphan Handling Bugs#58

Merged
krutten merged 17 commits intoblacksmith-community:mainfrom
haochenhu233:main
Feb 13, 2026
Merged

Add Batch Upgrade Operations and Fix Reconciler/Orphan Handling Bugs#58
krutten merged 17 commits intoblacksmith-community:mainfrom
haochenhu233:main

Conversation

@haochenhu233
Copy link
Contributor

This PR introduces batch upgrade capabilities for service instances and fixes critical bugs in the reconciler's orphan detection logic.

New Features

Batch Service Upgrades

  • Upgrade existing services: Operators can now upgrade service instances to newer stemcells directly from the Web UI
  • Stemcell filtering by CPI: Only shows stemcells matching the correct Cloud Provider Interface (e.g., *.bosh suffix), avoiding duplicates from legacy CPIs
  • Separate BatchDirector connection pool: Dedicated BOSH connection pool for batch operations to avoid impacting normal broker operations
  • Stemcell information display: Batch Operation tab now shows current stemcell info for each service instance

VM Monitor Enhancements

  • VM monitor status endpoint: New /b/vm-monitor-status endpoint for debugging VM health tracking
  • Stemcell info fallback: VM monitor provides stemcell information when manifest data is unavailable

Bug Fixes

Reconciler & Orphan Handling (Critical)

  • Fixed vicious cycle bug: Services marked orphaned for >24 hours were incorrectly treated as "deleted", causing FilterServiceDeployments() to skip their existing BOSH deployments permanently. They could never be un-orphaned.
  • Fixed empty scan marking all as orphaned: When GetDeployments() failed or returned empty (BOSH temporarily unavailable), ALL services were incorrectly marked as orphaned.
  • Added un-orphan logic: Services are now un-orphaned when their BOSH deployment is found in subsequent scans.
  • Added orphan timer reset on startup: Each Blacksmith redeploy resets orphan timers, giving operators 24 hours to investigate before cleanup.

Web UI Fixes

  • Fixed display for new/deleted services: Corrected rendering issues for services in transitional states
  • Fixed manifest display: Resolved bug where manifest data wasn't displaying correctly
  • Fixed timestamp display: Service instance timestamps now display correctly

Files Changed

  • pkg/reconciler/reconciler.go - Orphan detection fixes, startup timer reset
  • pkg/reconciler/synchronizer.go - Un-orphan logic, validation against BOSH deployments
  • internal/handlers/bosh/handler.go - VM monitor status endpoint, stemcell filtering
  • internal/bosh/director.go - Batch director pool
  • ui/* - Web UI enhancements for batch operations

Problem: BOSH connection pool (default 4) was shared between general API
operations and batch upgrades. When multiple batch jobs ran simultaneously,
they exhausted the pool, causing API calls like GetDeployment to fail.

Solution:
- Create BatchDirector with its own semaphore-based connection pool
- Only UpdateDeployment (the blocking operation) uses the batch pool
- Other methods pass through to base director without rate limiting

Additional changes:
- Link max_batch_jobs setting to BatchDirector pool size
- Sync pool with Vault settings at startup
- Add MaxBatchConnections config option (default: 10)
@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Copy link
Contributor

@krutten krutten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional functionality looks good. Addition to Mocks present and abstraction seems clean. LGTM

@krutten krutten merged commit aff56b0 into blacksmith-community:main Feb 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants