DAOS-18425 rebuild: coalesce tasks for multi-rank operations#17382
DAOS-18425 rebuild: coalesce tasks for multi-rank operations#17382
Conversation
|
Ticket title is 'interactive rebuild: "dmg system rebuild stop" not working in case of rank reintegration' |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/ |
|
Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/2/testReport/ rebuild/widely_striped.py failure (pool query timed out with 5 minute deadline) seems to be an instance of existing issue https://daosio.atlassian.net/browse/DAOS-18302 |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/3/execution/node/468/log |
|
Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/3/testReport/ |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/4/testReport/ |
|
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/4/execution/node/1013/log |
1f838a6 to
1853800
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/5/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1281/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/5/execution/node/1322/log |
1853800 to
3af9e59
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/7/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1338/log |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/7/execution/node/1348/log |
3af9e59 to
7f880e3
Compare
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/10/testReport/ |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/10/execution/node/657/log |
7f880e3 to
551ce8e
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/456/log |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17382/12/execution/node/466/log |
Many multiple rank failure events or those initiated by control plane commands (e.g., dmg) are processed in a rank-by-rank fashion. This leads to a (rapid) sequence of pool map updates and rebuild scheduling activities. It can lead to serialized rebuilds, one per rank, rather than consolidating the related operations into a single rebuild job. This can result in slower execution, and cause a confusion for an administrator trying to monitor rebuild progress via pool query, and possibly wanting to use interactive rebuild controls (e.g., rebulid stop|start). With this change, exclude, reintegration, and drain pool map updates are changed to schedule a rebuild with a 5 second delay, causing the resulting entry in the rebuild_gst.rg_queue_list to specify a scheduling time (dst_schedule_time) in the near-future. And, logic in rebuild_ults() is changed to not dequeue a rebuild task if it has a future scheduling time. This allows compatible changes (that may be processed imminently) to be merged by their ds_rebuild_schedule() call. Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
and minor changes to ds_rebuild_admin_stop for rgt==NULL Features: rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
551ce8e to
da370d3
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17382/13/testReport/ |
Many multiple rank failure events or those initiated by control plane commands (e.g., dmg) are processed in a rank-by-rank fashion. This leads to a (rapid) sequence of pool map updates and rebuild scheduling activities. It can lead to serialized rebuilds, one per rank, rather than consolidating the related operations into a single rebuild job. This can result in slower execution, and cause a confusion for an administrator trying to monitor rebuild progress via pool query, and possibly wanting to use interactive rebuild controls (e.g., rebulid stop|start).
With this change, exclude, reintegration, and drain pool map updates are changed to schedule a rebuild with a 5 second delay, causing the resulting entry in the rebuild_gst.rg_queue_list to specify a scheduling time (dst_schedule_time) in the near-future. And, logic in rebuild_ults() is changed to not dequeue a rebuild task if it has a future scheduling time. This allows compatible changes (that may be processed imminently) to be merged by their ds_rebuild_schedule() call.
Features: rebuild
Steps for the author:
After all prior steps are complete: