Skip to content

DAOS-18425 rebuild: coalesce tasks for multi-rank operations#17382

Draft
kccain wants to merge 2 commits intomasterfrom
kccain/daos_18425
Draft

DAOS-18425 rebuild: coalesce tasks for multi-rank operations#17382
kccain wants to merge 2 commits intomasterfrom
kccain/daos_18425

Conversation

@kccain
Copy link
Contributor

@kccain kccain commented Jan 15, 2026

Many multiple rank failure events or those initiated by control plane commands (e.g., dmg) are processed in a rank-by-rank fashion. This leads to a (rapid) sequence of pool map updates and rebuild scheduling activities. It can lead to serialized rebuilds, one per rank, rather than consolidating the related operations into a single rebuild job. This can result in slower execution, and cause a confusion for an administrator trying to monitor rebuild progress via pool query, and possibly wanting to use interactive rebuild controls (e.g., rebulid stop|start).

With this change, exclude, reintegration, and drain pool map updates are changed to schedule a rebuild with a 5 second delay, causing the resulting entry in the rebuild_gst.rg_queue_list to specify a scheduling time (dst_schedule_time) in the near-future. And, logic in rebuild_ults() is changed to not dequeue a rebuild task if it has a future scheduling time. This allows compatible changes (that may be processed imminently) to be merged by their ds_rebuild_schedule() call.

Features: rebuild

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Ticket title is 'interactive rebuild: "dmg system rebuild stop" not working in case of rank reintegration'
Status is 'In Progress'
Labels: 'Rebuild,test_2.8'
https://daosio.atlassian.net/browse/DAOS-18425

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/

@daosbuild3
Copy link
Collaborator

daosbuild3 commented Jan 16, 2026

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/

rebuild/widely_striped.py failure (pool query timed out with 5 minute deadline) seems to be an instance of existing issue https://daosio.atlassian.net/browse/DAOS-18302

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/3/execution/node/468/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/3/testReport/

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@kccain kccain force-pushed the kccain/daos_18425 branch from 1f838a6 to 1853800 Compare February 7, 2026 03:23
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1281/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1322/log

@kccain kccain force-pushed the kccain/daos_18425 branch from 1853800 to 3af9e59 Compare February 9, 2026 16:31
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1338/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1348/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/10/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/10/execution/node/657/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/456/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/466/log

Many multiple rank failure events or those initiated by control
plane commands (e.g., dmg) are processed in a rank-by-rank fashion.
This leads to a (rapid) sequence of pool map updates and rebuild
scheduling activities. It can lead to serialized rebuilds, one
per rank, rather than consolidating the related operations into
a single rebuild job. This can result in slower execution, and
cause a confusion for an administrator trying to monitor rebuild
progress via pool query, and possibly wanting to use interactive
rebuild controls (e.g., rebulid stop|start).

With this change, exclude, reintegration, and drain pool map updates
are changed to schedule a rebuild with a 5 second delay, causing the
resulting entry in the rebuild_gst.rg_queue_list to specify a
scheduling time (dst_schedule_time) in the near-future. And, logic in
rebuild_ults() is changed to not dequeue a rebuild task if it has a
future scheduling time. This allows compatible changes (that may be
processed imminently) to be merged by their ds_rebuild_schedule() call.

Features: rebuild

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
and minor changes to ds_rebuild_admin_stop for rgt==NULL

Features: rebuild

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@daosbuild3
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants