Skip to content

DAOS-18593 test: replace sleep with retry in rebuild/interactive.py#17559

Open
daltonbohning wants to merge 3 commits intomasterfrom
dbohning/daos-18593
Open

DAOS-18593 test: replace sleep with retry in rebuild/interactive.py#17559
daltonbohning wants to merge 3 commits intomasterfrom
dbohning/daos-18593

Conversation

@daltonbohning
Copy link
Contributor

@daltonbohning daltonbohning commented Feb 13, 2026

Replace arbitrary sleep with a retry on expected DER_NONEXIST.
This improves a race condition where even when dmg pool query shows rebuild is busy,
it hasn't "actually" started yet.
So when dmg pool rebuild stop fails with DER_NONEXIST, we simply wait and retry.

Test-repeat: 10
Test-tag: RbldInteractive
Skip-unit-tests: true
Skip-fault-injection-test: true

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@daltonbohning daltonbohning self-assigned this Feb 13, 2026
@github-actions
Copy link

Ticket title is 'rebuild/interactive.py: remove arbitrary sleep'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-18593

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17559/1/execution/node/791/log

Replace arbitrary sleep with a retry on expected DER_NONEXIST.

Test-repeat: 10
Test-tag: RbldInteractive
Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Test-repeat: 10
Test-tag: RbldInteractive
Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
@daltonbohning
Copy link
Contributor Author

daltonbohning commented Feb 18, 2026

This is improving a race condition where even when dmg pool query shows rebuild is busy, it hasn't "actually" started yet. So when dmg pool rebuild stop fails with DER_NONEXIST, we simply wait and retry.
It is hard to reproduce this race condition, so for testing purposes I removed the rebuild busy detection to verify the DER_NONEXIST handling is working.

This sample run shows the DER_NONEXIST handling is working
https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17559/3/artifact/Functional%20Hardware%20Large%20MD%20on%20SSD/rebuild/interactive.py/repeat001/job.log

2026-02-17 22:18:56,586 process          L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1'
2026-02-17 22:18:56,722 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.722631 main.go:228: debug output enabled
2026-02-17 22:18:56,723 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.723297 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml
2026-02-17 22:18:56,730 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.729999 rpc.go:278: request hosts: [hdr-112:10001 hdr-113:10001 hdr-114:10001 hdr-115:10001 hdr-116:10001]
2026-02-17 22:18:56,779 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.779013 response.go:179: hdr-112:10001: err: DER_NONEXIST(-1005): The specified entity does not exist
2026-02-17 22:18:56,779 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.779179 response.go:179: hdr-112:10001: err: DER_NONEXIST(-1005): The specified entity does not exist
2026-02-17 22:18:56,779 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:56.779235 pool.go:1084: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist
2026-02-17 22:18:56,779 process          L0416 DEBUG| [stderr] ERROR: dmg: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist
2026-02-17 22:18:56,782 process          L0686 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1' finished with 1 after 0.19339227676391602s
2026-02-17 22:18:56,802 general_utils    L0176 INFO | Error occurred running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1': Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1' failed.
stdout: b''
stderr: b'DEBUG 2026/02/17 22:18:56.722631 main.go:228: debug output enabled\nDEBUG 2026/02/17 22:18:56.723297 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml\nDEBUG 2026/02/17 22:18:56.729999 rpc.go:278: request hosts: [hdr-112:10001 hdr-113:10001 hdr-114:10001 hdr-115:10001 hdr-116:10001]\nDEBUG 2026/02/17 22:18:56.779013 response.go:179: hdr-112:10001: err: DER_NONEXIST(-1005): The specified entity does not exist\nDEBUG 2026/02/17 22:18:56.779179 response.go:179: hdr-112:10001: err: DER_NONEXIST(-1005): The specified entity does not exist\nDEBUG 2026/02/17 22:18:56.779235 pool.go:1084: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist\nERROR: dmg: pool-rebuild stop failed: DER_NONEXIST(-1005): The specified entity does not exist\n'
additional_info: None
2026-02-17 22:18:56,802 interactive      L0107 INFO | Assuming rebuild is not started yet. Retrying in 3 seconds...
2026-02-17 22:18:59,805 command_utils_ba L0203 DEBUG| Updated param pool => TestPool_1
2026-02-17 22:18:59,805 command_utils_ba L0203 DEBUG| Updated param force => False
2026-02-17 22:18:59,806 general_utils    L0151 INFO | Command environment vars:
  {}
2026-02-17 22:18:59,806 process          L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1'
2026-02-17 22:18:59,945 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.944884 main.go:228: debug output enabled
2026-02-17 22:18:59,946 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.946206 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml
2026-02-17 22:18:59,950 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.950520 rpc.go:278: request hosts: [hdr-112:10001 hdr-114:10001 hdr-116:10001 hdr-117:10001 hdr-118:10001]
2026-02-17 22:18:59,998 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.998910 response.go:179: hdr-112:10001: *mgmt.DaosResp status:DER_SUCCESS(0): Success
2026-02-17 22:18:59,999 process          L0416 DEBUG| [stderr] DEBUG 2026/02/17 22:18:59.999101 pool.go:1091: Pool-rebuild stop request succeeded
2026-02-17 22:18:59,999 process          L0416 DEBUG| [stdout] Pool-rebuild stop request succeeded
2026-02-17 22:19:01,002 process          L0686 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d pool rebuild stop TestPool_1' finished with 0 after 1.1937801837921143s

This reverts commit 7318e84.

Test-repeat: 10
Test-tag: RbldInteractive
Skip-unit-tests: true
Skip-fault-injection-test: true
Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
@daltonbohning daltonbohning marked this pull request as ready for review February 18, 2026 21:44
@daltonbohning daltonbohning requested review from a team as code owners February 18, 2026 21:44
Copy link
Collaborator

@jamesanunez jamesanunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need 'break'. Please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments