Skip to content

feat: backward and forward LET testing + runbook + tool #1502

Open
arnaubennassar wants to merge 25 commits intodevelopfrom
feat/back-and-for-let
Open

feat: backward and forward LET testing + runbook + tool #1502
arnaubennassar wants to merge 25 commits intodevelopfrom
feat/back-and-for-let

Conversation

@arnaubennassar
Copy link
Collaborator

@arnaubennassar arnaubennassar commented Mar 2, 2026

🔄 Changes Summary

  • Added support for debugging aggkit using VSCode on the E2E envs
  • BackwardLET/ForwardLET bug fixes: Fixed two bugs in bridgesync/processor.go that caused the L2 bridge syncer to halt during BackwardLET+ForwardLET recovery: off-by-one in
    handleForwardLETEvent (PreviousDepositCount is already the next leaf index, no +1 needed) and incorrect > vs >= in archiveAndDeleteBridgesAbove (bridges at exactly DC=N must also be
    archived after a BackwardLET to DC=N).
  • BackwardLET to DC=0: Special-cased BackwardToIndex for DC=0 to use exitTree.Reorg(0) (full clear) since BackwardToIndex(-1) is invalid.
  • GetBridgeByDepositCount query fix: Removed AND origin_network = 0 filter — deposit_count is a unique monotonic counter per bridge event and filtering by origin_network=0 incorrectly
    excluded L2-native token bridges.
  • Empty LER sanity check: Added special-case handling in sanityCheckLatestLER and handleForwardLETEvent for the empty LER (empty Merkle tree root maps to 0x00...0, not to the standard
    empty hash).
  • aggsender fromBlock derivation fix: flow_base.go now re-derives fromBlock from on-chain/bridgesync data for settled certificates using the same logic as the local validator, preventing
    stale ToBlock=0 values (e.g. from the debug endpoint) from breaking subsequent certs.
  • Debug SendCertificate RPC endpoint: Added opt-in aggsender_debugSendCertificate RPC method for manual recovery; when enabled, the normal cert-sending loop is disabled. Gated behind
    EnableDebugSendCertificate config flag (default false; never enable in production).
  • New aggsender_getCertificateBridgeExits RPC: Returns stored bridge exits for a given certificate height, used by the recovery tool to reconstruct certificates without agglayer access.
  • BackwardForwardLET recovery tool (tools/backward_forward_let/): New CLI tool that diagnoses BackwardLET/ForwardLET divergences and drives automated recovery by replaying bridge exits
    from the aggsender DB or agglayer admin API.
  • E2E tests (test/e2e/backwardforwardlet_test.go): 5 new E2E tests covering BackwardLET/ForwardLET scenarios (no divergence, Case 1–4, and aggsender API fallback).
  • MintableERC20 test contract: Added for use in BFL E2E tests that need L2-native token bridges.

⚠️ Breaking Changes

  • 🛠️ Config: Two new optional fields added to [AggSender] in TOML — no breaking change if omitted (defaults to false/zero address).

📋 Config Updates

  [AggSender]
  # Optional — default false. NEVER enable in production.
  EnableDebugSendCertificate = false
  # Address whose signature is required to use the debug endpoint.
  DebugSendCertificateAuthAddress = "0x0000000000000000000000000000000000000000"

✅ Testing

  • 🤖 Automatic:
    • TestBackwardForwardLET_NoDivergence — baseline: normal operation, no LET events
    • TestBackwardForwardLET_Case1 — single BackwardLET then ForwardLET, no divergence
    • TestBackwardForwardLET_Case2 — BackwardLET to DC=0 (full tree wipe) then ForwardLET
    • TestBackwardForwardLET_Case3 — multiple BackwardLET/ForwardLET cycles
    • TestBackwardForwardLET_Case4 — divergence requiring the recovery tool
    • TestBackwardForwardLET_AggsenderAPIFallback — recovery with empty aggsender DB, falling back to agglayer admin API

🐞 Issues

🔗 Related PRs

📝 Notes

  • The debug SendCertificate endpoint is intended for manual recovery only (BackwardLET/ForwardLET scenarios where the aggsender is stuck). When EnableDebugSendCertificate = true, the
    normal sending loop is fully disabled — this is intentional to prevent conflicts between the operator's manual certificates and the automatic sender.
  • docs/backward_forward_let_runbook.md and tools/backward_forward_let/RECOVERY_PROCEDURE.md contain full operator runbooks.
  • The deposit_count uniqueness assumption (used to drop origin_network = 0 from GetBridgeByDepositCount) holds because the L1 escape hatch contract assigns a strictly monotonic global
    counter regardless of token type.

@arnaubennassar arnaubennassar marked this pull request as draft March 2, 2026 10:05
@arnaubennassar arnaubennassar self-assigned this Mar 2, 2026
@arnaubennassar arnaubennassar changed the title Feat/back and for let feat: backward and forward LET testing + runbook + tool Mar 2, 2026
@agglayer agglayer deleted a comment from claude bot Mar 5, 2026
@claude
Copy link

claude bot commented Mar 6, 2026

Claude finished @arnaubennassar's task in 3m 30s —— View job


Code Review Complete


🚀 What's New

This PR introduces comprehensive BackwardLET/ForwardLET recovery functionality for handling Local Exit Tree (LET) divergences between L2 and L1. It includes critical bug fixes in bridgesync, a new recovery tool with automated diagnosis, debug RPC endpoints for manual certificate submission, and extensive E2E test coverage across 6 test scenarios.

🐛 Bug Fixes

  • Off-by-one fix in handleForwardLETEvent (bridgesync/processor.go:1935): PreviousDepositCount already represents the next leaf index, no +1 needed
  • Comparison fix in archiveAndDeleteBridgesAbove (bridgesync/processor.go:1838): Changed from > to >= so bridges at exactly DC=N are archived after BackwardLET to DC=N
  • BackwardLET to DC=0 special case (bridgesync/processor.go:1749-1753): Uses exitTree.Reorg(0) for full tree clear instead of invalid BackwardToIndex(-1)
  • GetBridgeByDepositCount query fix: Removed incorrect AND origin_network = 0 filter that excluded L2-native token bridges
  • Empty LER handling: Added special-case logic for empty Merkle tree root (0x00...0) in sanityCheckLatestLER and handleForwardLETEvent
  • aggsender fromBlock derivation: Re-derives fromBlock from on-chain/bridgesync data for settled certificates to prevent stale ToBlock=0 values from breaking subsequent certs

📋 Config Updates

[AggSender]
# Optional — default false. NEVER enable in production.
EnableDebugSendCertificate = false
# Address whose signature is required to use the debug endpoint (default: 0x00...0).
DebugSendCertificateAuthAddress = "0x0000000000000000000000000000000000000000"

⚠️ Breaking Changes

None — the new config fields are optional with safe defaults.


Code Review Feedback

🔴 Required Issues

1. Security: Debug endpoint authorization check incomplete

File: aggsender/rpc/aggsender_rpc.go:164-166

The debug endpoint checks if the signer matches debugAuthAddress, but if debugAuthAddress is set to the zero address (the default), any signature will be accepted since crypto.PubkeyToAddress will never return the zero address.

if signer != b.debugAuthAddress {
    return nil, rpc.NewRPCError(rpc.DefaultErrorCode,
        fmt.Sprintf("unauthorized: signer %s does not match auth address %s", signer.Hex(), b.debugAuthAddress.Hex()))
}

Impact: With the default config, any attacker can send arbitrary certificates to the AggLayer.

Recommendation: Add validation that rejects requests when debugAuthAddress == common.Address{}:

if b.debugAuthAddress == (ethCommon.Address{}) {
    return nil, rpc.NewRPCError(rpc.DefaultErrorCode, 
        "debug endpoint requires DebugSendCertificateAuthAddress to be configured")
}

Fix this →

2. Security: Signature malleability not handled

File: aggsender/rpc/aggsender_rpc.go:159

The code uses crypto.SigToPub without checking the signature's v value for malleability. Ethereum signatures can be malleable (v can be 27, 28, or higher values for EIP-155).

Recommendation: Use crypto.Ecrecover instead or validate that v is in the expected range (27-28 for non-EIP-155, or use EIP-191 for signing messages). Also consider using eth_sign message format with proper prefix.

Fix this →

3. Critical: Off-by-one in handleForwardLETEvent deposit count initialization

File: bridgesync/processor.go:1933-1936

var newDepositCount uint32
if event.PreviousRoot != bridgesynctypes.EmptyLER {
    newDepositCount = uint32(event.PreviousDepositCount.Uint64())
}

The PR description states "PreviousDepositCount is already the next leaf index, no +1 needed", but this code does not add 1 when the tree is empty (PreviousRoot == EmptyLER). This seems correct for the empty tree case (start at 0), but the comment/logic should be clearer.

Issue: If the first ForwardLET after genesis has PreviousRoot=EmptyLER and PreviousDepositCount=0, the code will start newDepositCount=0 (correct). But if PreviousDepositCount is actually meant to be the "previous" count (not "next"), then starting at 0 would mean the first bridge gets DC=0, which conflicts with the fix description.

Recommendation: Add explicit comment clarifying the semantics and verify with a test case for ForwardLET after genesis (DC=0→N).

4. Data Integrity: archiveAndDeleteBridgesAbove >= change needs verification

File: bridgesync/processor.go:1838

Changed from WHERE deposit_count > $1 to WHERE deposit_count >= $1. The PR description says "bridges at exactly DC=N must also be archived after a BackwardLET to DC=N."

Concern: After BackwardLET to DC=N, the tree should contain leaves 0..N-1 (N total leaves). Bridge at DC=N would be the (N+1)th bridge, which should indeed be removed. But if the tree is supposed to have N leaves (0..N-1), then DC=N is the next deposit to be added, not an existing one.

Verification needed: Confirm that "BackwardLET to DC=N" means "tree has N leaves (0..N-1)" vs "tree has leaves up to index N (N+1 total)". The code comment at line 1745-1746 says "BackwardToIndex(N) keeps positions 0..N (N+1 leaves)", which suggests the former interpretation is correct.

🟡 Suggested Improvements

5. Code Quality: Inconsistent error handling in recovery tool

File: tools/backward_forward_let/recovery.go:54-56

defer func() {
    if deactivateErr := stepDeactivateEmergency(ctx, env, unpauserAuth, callOpts); deactivateErr != nil {
        fmt.Printf("WARNING: failed to deactivate emergency state: %v\n", deactivateErr)
    }
}()

Emergency state deactivation failure is only logged but not propagated. If the recovery succeeds but emergency state remains active, the bridge is unusable.

Recommendation: Return an aggregated error or at least panic so operators know to manually deactivate.

6. Test Coverage: Missing test for DC=0 BackwardLET

Files: bridgesync/processor_test.go, test/e2e/backwardforwardlet_test.go

The PR description mentions "BackwardLET to DC=0: Special-cased BackwardToIndex for DC=0 to use exitTree.Reorg(0)" but I don't see a unit test specifically for this edge case in processor_test.go. TestBackwardForwardLET_Case2 covers it in E2E, but a unit test would be valuable.

Recommendation: Add unit test: TestProcessor_BackwardLET_ToZero that verifies:

  • Reorg(0) is called instead of BackwardToIndex
  • All bridges are archived
  • Tree is fully cleared

7. Performance: Recovery tool could batch bridge fetches

File: tools/backward_forward_let/helpers.go

fetchL2LeafHashesUpTo and similar functions fetch bridges one-by-one. For large deposit counts, this is inefficient.

Recommendation: Add a batched bridge fetch API or query multiple deposit counts in parallel.

8. Documentation: Empty LER value needs clarification

Files: bridgesync/types/types.go, agglayer/types/types.go

The PR introduces special handling for "empty LER" (0x27ae5ba08d7291c96c8cbddcc148bf48a6d68c7974b94356f53754ef6171d757) vs zero hash (0x00...0). This distinction is subtle and not well-documented.

Recommendation: Add package-level documentation explaining:

  • Why empty LER != zero hash
  • When each is used
  • How they map to Merkle tree states

9. Logging: Debug endpoint usage should be audited

File: aggsender/rpc/aggsender_rpc.go:169-196

When debug endpoint is used, the certificate is sent to AggLayer and stored in DB, but there's only a logger.Warnf if storage fails. This should be logged at INFO level for audit trail.

Recommendation: Add structured logging at INFO level when debug endpoint is called:

b.logger.Infof("Debug endpoint: sending certificate height=%d certID=%s signer=%s", 
    signedRequest.Certificate.Height, certHash.Hex(), signer.Hex())

10. Config Validation: Debug endpoint auth address should be validated

File: aggsender/config/config.go:189-212

The Validate() function doesn't check if EnableDebugSendCertificate=true while DebugSendCertificateAuthAddress is zero.

Recommendation: Add validation:

if c.EnableDebugSendCertificate && c.DebugSendCertificateAuthAddress == (ethCommon.Address{}) {
    return fmt.Errorf("DebugSendCertificateAuthAddress must be set when EnableDebugSendCertificate is enabled")
}

🟢 Optional Improvements

11. Code Style: Use constants for magic numbers

File: bridgesync/processor.go:1749

if leafIndex == 0 {
    if err := p.exitTree.Reorg(tx, 0); err != nil {

The value 0 for Reorg is used to mean "delete all". Consider a named constant like ReorgFullClear = 0.

12. Test Quality: E2E test timeouts are aggressive

File: test/e2e/backwardforwardlet_test.go:33-35

const (
    bflCertSettleTimeout = 2 * time.Minute
    bflRestartTimeout    = 2 * time.Minute
    bflBridgeIndexWait   = 2 * time.Minute
)

2 minutes might be too short for CI environments under load. Consider making these configurable or increasing to 5 minutes.

13. Documentation: MANUAL_RECOVERY_PLAN.md should be removed

File: MANUAL_RECOVERY_PLAN.md

This 673-line file appears to be developer notes rather than production documentation. The formal runbook is in docs/backward_forward_let_runbook.md.

Recommendation: Delete MANUAL_RECOVERY_PLAN.md or move it to docs/archive/ if needed for reference.

14. Code Organization: coverage_aggsender.out should be gitignored

File: coverage_aggsender.out

This 2357-line coverage file was committed by accident.

Recommendation: Remove from git and add to .gitignore:

git rm coverage_aggsender.out
echo "coverage*.out" >> .gitignore

15. Code Organization: TODO.md should not be in repo root

File: TODO.md

This file with 16 lines of todos should either be converted to GitHub issues or removed.


Summary

This is a critical bug fix PR that addresses serious LET divergence issues. The core fixes are sound, but the debug endpoint has significant security issues that must be addressed before merge.

Must Fix Before Merge (🔴):

  1. Add zero-address check for debug auth address
  2. Fix signature malleability in debug endpoint
  3. Clarify/verify ForwardLET deposit count initialization logic
  4. Verify >= vs > semantics in archiveAndDeleteBridgesAbove

Recommended (🟡):

5-10: Error handling, test coverage, performance, documentation improvements

Nice to Have (🟢):

11-15: Code cleanup, style improvements

Overall Assessment: ⚠️ Conditional Approve — Core functionality is excellent, but security issues in the debug endpoint must be fixed. The debug endpoint should never be enabled in production, but defense-in-depth requires proper validation even for "never enable" features.


| Branch: feat/back-and-for-let

arnaubennassar and others added 7 commits March 6, 2026 09:47
- Add nolint:dogsled for triple blank identifier in ComputeBackwardLETParams test
- Add nolint:dupl for structurally similar test functions in rpcclient and helpers_test
- Fix gci import ordering (stdlib → external/internal sorted)
- Extract repeated address literals into constants to fix goconst
- Add nolint:lll for long interface method signatures in run.go

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@arnaubennassar arnaubennassar marked this pull request as ready for review March 6, 2026 14:45
@@ -0,0 +1,18 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this file is local, must not be part of git (in my case I have my own configurations there that will be overwritten)

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants