Skip to content

Fuzz Network Test Framework#326

Merged
yacovm merged 17 commits intomainfrom
random-network
Mar 20, 2026
Merged

Fuzz Network Test Framework#326
yacovm merged 17 commits intomainfrom
random-network

Conversation

@samliok
Copy link
Collaborator

@samliok samliok commented Jan 14, 2026

No description provided.

@samliok samliok changed the base branch from main to refactor-testutil January 20, 2026 22:21
@samliok samliok marked this pull request as ready for review January 21, 2026 20:05
@samliok samliok linked an issue Jan 22, 2026 that may be closed by this pull request
@yacovm yacovm mentioned this pull request Feb 2, 2026
}

func (n *Network) StartInstances() {
panic("Call Run() Instead")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method isn't used. Why is it needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because BasicNetwork is embedded in Network i don't want the caller accidentally calling StartInstances thinking its a valid method

@yacovm
Copy link
Collaborator

yacovm commented Feb 2, 2026

Can we rebase on top of main?

Base automatically changed from refactor-testutil to main February 3, 2026 16:07
@yacovm
Copy link
Collaborator

yacovm commented Feb 25, 2026

This is still somewhat non-deterministic to some degree:

When I create a digest of the crashed and recovered node IDs in the schedule I get two possible outcomes: 9b9336fc010a3ccf10d41bb4344e36f2b2ec5c1c2f3940c76a4994005c1560c8 and 753a40743c43bf352998aad8d5a12df4df01fcd001d0baff1d513f3e85741a20

Patch is below:

diff --git a/testutil/random_network/network.go b/testutil/random_network/network.go
index e381fc0..188c7e9 100644
--- a/testutil/random_network/network.go
+++ b/testutil/random_network/network.go
@@ -4,6 +4,8 @@
 package random_network
 
 import (
+       "crypto/sha256"
+       "encoding/hex"
        "math/rand"
        "sync"
        "testing"
@@ -27,6 +29,9 @@ type Network struct {
 
        // tx stats
        allIssuedTxs int
+
+       recoveredNodes []simplex.NodeID
+       crashedNodes   []simplex.NodeID
 }
 
 func NewNetwork(config *FuzzConfig, t *testing.T) *Network {
@@ -224,6 +229,7 @@ func (n *Network) crashAndRecoverNodes() {
                        // randomly decide to recover based on NodeRecoverPercentage
                        if n.randomness.Float64() < n.config.NodeRecoverPercentage {
                                n.startNode(i)
+                               n.recoveredNodes = append(n.recoveredNodes, node.E.ID)
                                recoveredNodes = append(recoveredNodes, node.E.ID)
                                maxLeftToCrash++
                        }
@@ -294,6 +300,18 @@ func (n *Network) PrintStatus() {
                node.PrintMessageTypesSent()
        }
 
+       digest := sha256.New()
+
+       for _, node := range n.crashedNodes {
+               digest.Write([]byte(node.String()))
+       }
+
+       for _, node := range n.recoveredNodes {
+               digest.Write([]byte(node.String()))
+       }
+
+       n.logger.Info("Crashed and Recovered Nodes Digest", zap.String("digest", hex.EncodeToString(digest.Sum(nil))))
+
        // prints total issued txs and failed txs
        n.logger.Info("Transaction Stats", zap.Int("total issued txs", n.allIssuedTxs))
 }
@@ -302,6 +320,7 @@ func (n *Network) crashNode(idx int) {
        n.logger.Debug("Crashing node", zap.Stringer("nodeID", n.nodes[idx].E.ID))
        n.nodes[idx].isCrashed.Store(true)
        n.nodes[idx].Stop()
+       n.crashedNodes = append(n.crashedNodes, n.nodes[idx].E.ID)
 }
 
 func (n *Network) startNode(idx int) {

When we have 9b9336fc010a3ccf10d41bb4344e36f2b2ec5c1c2f3940c76a4994005c1560c8 we have 50 iterations of the loop or 51.

When we have 753a40743c43bf352998aad8d5a12df4df01fcd001d0baff1d513f3e85741a20 we have 49 iterations of the loop.

Can we make this fully deterministic? 🙏

It looks like we're very close to have it deterministic.


}

func (m *Mempool) moveTxsToUnaccepted(block *Block) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do this? we only delete from unacceptedTxs in AcceptBlock so if a transaction is in unacceptedTxs in a child block or in a grandchild block, it won't be removed from it anyway.

node.mempool.lock.Lock()
n.logger.Debug("Not all txs accepted yet by node", zap.Stringer("nodeID", node.E.ID), zap.Int("unaccepted txs in mempool", len(node.mempool.unacceptedTxs)))
node.mempool.lock.Unlock()
allAccepted = false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just break here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can we break here? this is when allAccepted is false and we want to break when all are accepted

return minHeightNodeID
}

func (n *Network) recoverToHeight(height uint64) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The below way we are more deterministic (written by me, not thanks to CC):

func (n *Network) recoverToHeight(height uint64) {
	// Decide which crashed nodes to recover and in what order.
	// This keeps all randomness calls deterministic (fixed count per invocation),
	// independent of the non-deterministic AdvanceTime loop below.
	var nodesToRecover []int
	for i, node := range n.nodes {
		if node.isCrashed.Load() {
			nodesToRecover = append(nodesToRecover, i)
		}
	}
	n.randomness.Shuffle(len(nodesToRecover), func(i, j int) {
		nodesToRecover[i], nodesToRecover[j] = nodesToRecover[j], nodesToRecover[i]
	})

	// Recover nodes one at a time in the chosen order, advancing time until
	// all nodes have caught up before recovering the next one.
	for _, idx := range nodesToRecover {
		n.logger.Debug("Recovering node", zap.Stringer("nodeID", n.nodes[idx].E.ID))
		n.startNode(idx)

		for n.nodes[idx].storage.NumBlocks() < height {
			n.logger.Debug("Advancing network time", zap.Uint64("num crashed nodes", n.numCrashedNodes()),
				zap.Uint64("min height", n.getMinHeight()),
				zap.Uint64("max height", n.getMaxHeight()),
				zap.Stringer("Smallest node ID", n.getMinHeightNodeID()),
				zap.Uint64("target height", height),
			)

			for _, node := range n.nodes {
				n.lock.Lock()
				node.AdvanceTime(n.config.AdvanceTimeTickAmount)
				n.lock.Unlock()
			}
	
		}
	}

	// Advance time until all remaining nodes reach the target height.
	for n.getMinHeight() < height {
		n.logger.Debug("Advancing network time", zap.Uint64("num crashed nodes", n.numCrashedNodes()),
			zap.Uint64("min height", n.getMinHeight()),
			zap.Uint64("max height", n.getMaxHeight()),
			zap.Stringer("Smallest node ID", n.getMinHeightNodeID()),
			zap.Uint64("target height", height),
		)

		n.lock.Lock()
		n.BasicInMemoryNetwork.AdvanceTime(n.config.AdvanceTimeTickAmount)
		n.lock.Unlock()

	}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for _, node := range n.nodes {
n.lock.Lock()
node.AdvanceTime(n.config.AdvanceTimeTickAmount)
n.lock.Unlock()
}

why not just advance the time using

		n.BasicInMemoryNetwork.AdvanceTime(n.config.AdvanceTimeTickAmount)

also i think this change is good, can we add it as a follow up though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@yacovm
Copy link
Collaborator

yacovm commented Feb 25, 2026

will continue the review tomorrow

return &TX{ID: idArr, shouldFailVerification: aTX.ShouldFailVerification}
}

func CreateNewTX() *TX {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the randomness here have any influence on the outcome? It sure has influence on the block hashes.

Should we perhaps inject randomness from our PRG?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, i think it's fine to not include our PRG here. I've found block hashes aren't that helpful to keep deterministic imo, at least when i'm debugging.

}

// go through any blocks that build off of this one and move their txs
func (m *Mempool) purgeChildren(block *Block) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we missing a call to m.moveTxsToUnaccepted(block) ?

Because what about the transactions of the sibling block itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I think this commit fixes that issue
b2f99b1

m.lock.Lock()
defer m.lock.Unlock()

// Ensure the block has not already been verified or accepted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should fail the test if we encounter this and the below... not just return an error. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added error logs which should fail the test

@yacovm
Copy link
Collaborator

yacovm commented Mar 9, 2026

Will make another pass tomorrow, I think that's enough comments for now.

return n.BasicNode.HandleMessage(&msgCopy, from)
}

// copyMessage creates a copy of the message and its relevant fields to avoid mutating shared state in the in-memory network
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so basically we just need to assign our own mempool, is that it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@yacovm
Copy link
Collaborator

yacovm commented Mar 11, 2026

I think that's all comments I have for now.

@yacovm yacovm merged commit afe0932 into main Mar 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a testing framework that generates random test cases

2 participants