vmm_tests: fix create_io_queue validation in vmm testing#2966
vmm_tests: fix create_io_queue validation in vmm testing#2966gurasinghMS merged 18 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes the servicing_keepalive_create_io_queue_on_new_cpu VMM test so it reliably exercises and validates NVMe create_io_queue() behavior by ensuring storvsp exposes enough channels for per-CPU IO and by asserting that an IO issuer is created on the targeted CPU.
Changes:
- Configure the VM/VTL2 settings to provide multiple SCSI sub-channels (matching VP count) so IO can be issued on all CPUs.
- Refactor “find CPUs with IO issuers” inspect parsing into a helper and use it to select a target CPU deterministically.
- Add a post-IO assertion that the target CPU gained an IO issuer after pinning IO.
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates an OpenHCL servicing regression test to reliably trigger and verify NVMe create_io_queue() after keepalive restore by ensuring per-CPU IO can be driven (via additional storvsp channels) and by asserting the IO-issuer appears for the selected CPU.
Changes:
- Configure the test VM/storage setup to provide enough SCSI sub-channels for per-CPU IO (matching
vp_count). - Refactor CPU-pinned IO + inspect parsing into helper functions and add an explicit post-IO assertion that the target CPU gained an IO issuer.
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates an OpenHCL servicing vmm test to actually validate that create_io_queue() runs by forcing I/O on a CPU that initially has no NVMe IO issuer, and adjusts VTL2 SCSI channel settings so the scenario can occur on multi-CPU VMs.
Changes:
- Configure VTL2 SCSI sub-channels to allow I/O across all VPs (enabling per-CPU queue creation coverage).
- Refactor CPU-pinned I/O and NVMe issuer discovery into helper functions and add an assertion that an issuer appears on the targeted CPU.
- Set processor topology (
vp_count,vps_per_socket) to make CPU pinning deterministic for this test.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Improves the reliability of the servicing_keepalive_create_io_queue_on_new_cpu VMM test by ensuring the guest storage stack can create per-CPU IO queues and by explicitly validating that queue/issuer creation occurs after pinned IO.
Changes:
- Configure VTL2 SCSI to use multiple channels (sub-channels) so IO can occur on all vCPUs.
- Refactor the test to (a) pick a CPU without an existing IO issuer via inspect, (b) run CPU-pinned IO, and (c) assert the IO issuer appears afterward.
- Add helper functions to run CPU-pinned IO via cpuset cgroups and to query CPUs with IO issuers via inspect.
You can also share your feedback on Copilot code review. Take the survey.
| // channel which is always present. so vp_count - 1 yeilds a total | ||
| // of vp_count channels. | ||
| v.fixed.as_mut().unwrap().scsi_sub_channels = Some(vp_count - 1); | ||
| }) |
There was a problem hiding this comment.
This approach of creating the Vtl2FixedSettings struct and defining scsi_sub_channels is hacky but, in the moment, the only way I was able to get this working. The struct seems to be defined in some proto files and clicking on here took me to the rust_analyzer folder instead of a crate that we own. Maybe I missed something here though.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates an OpenHCL servicing keepalive VMM test to reliably exercise and validate NVMe create_io_queue() on a CPU that initially lacks an IO issuer, addressing a prior gap where the test didn’t actually validate queue creation due to single-channel storvsp behavior.
Changes:
- Configure the VM topology explicitly (
vp_count,vps_per_socket) and adjust VTL2 SCSI channel configuration to provide one channel per vCPU. - Refactor CPU-pinned IO into a reusable helper and add a helper to query IO issuer CPUs via inspect.
- Add post-IO validation that the targeted CPU now has an IO issuer (i.e., queue creation occurred).
You can also share your feedback on Copilot code review. Take the survey.
| // Uses inspect to find and return a vector of issuers in the nvme driver. Proto | ||
| // queues are not reported in the returned value. | ||
| async fn find_cpus_with_io_issuers( | ||
| vm: &PetriVm<OpenVmmPetriBackend>, | ||
| ) -> Result<Vec<String>, anyhow::Error> { | ||
| // Query inspect to find which CPUs have IO issuers after boot. | ||
| // Only CPUs with initialized issuers appear in the per_cpu map (unset | ||
| // OnceLock entries are absent). This makes the test deterministic — | ||
| // we guarantee we target a CPU that triggers create_io_queue(). | ||
| let devices = vm.inspect_openhcl("vm/nvme/devices", None, None).await?; | ||
| let devices: serde_json::Value = serde_json::from_str(&format!("{}", devices.json()))?; | ||
| let device = devices | ||
| .as_object() | ||
| .expect("devices should be an object") | ||
| .values() | ||
| .next() | ||
| .expect("should have at least one NVMe device"); | ||
| let per_cpu = &device["driver"]["driver"]["io_issuers"]["per_cpu"]; | ||
| let per_cpu_map = per_cpu.as_object().expect("per_cpu should be an object"); | ||
| Ok(per_cpu_map.keys().cloned().collect::<Vec<_>>()) |
There was a problem hiding this comment.
Any reason to NOT return a Vec<u32>? Vec<u32> seems like a natural container for a list of CPUs (or a set), if one ignores the implementation detail that the numbers come as strings from inspect.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates an OpenHCL servicing vmm test to reliably exercise and validate NVMe create_io_queue() on a CPU that did not previously have an IO issuer, by ensuring storvsp provides enough channels for per-CPU IO and by asserting issuer creation after pinned IO.
Changes:
- Configure processor topology (incl.
vps_per_socket: Some(1)) and setscsi_sub_channelsto provide one storvsp channel per VP. - Refactor CPU selection logic into an inspect-based helper and add an assertion that pinned IO results in issuer creation on the targeted CPU.
- Extract cpuset-cgroup pinned IO into a reusable helper used across both servicing cycles.
You can also share your feedback on Copilot code review. Take the survey.
alandau
left a comment
There was a problem hiding this comment.
LGTM, thanks! A few nits, mainly the u32 thing. Also, copilot has a lot to say on this PR, feel free to take or ignore its feedback.
| // Uses inspect to find and return a vector of issuers in the nvme driver. Proto | ||
| // queues are not reported in the returned value. | ||
| async fn find_cpus_with_io_issuers( | ||
| vm: &PetriVm<OpenVmmPetriBackend>, | ||
| ) -> Result<Vec<String>, anyhow::Error> { | ||
| // Query inspect to find which CPUs have IO issuers after boot. | ||
| // Only CPUs with initialized issuers appear in the per_cpu map (unset | ||
| // OnceLock entries are absent). This makes the test deterministic — | ||
| // we guarantee we target a CPU that triggers create_io_queue(). | ||
| let devices = vm.inspect_openhcl("vm/nvme/devices", None, None).await?; | ||
| let devices: serde_json::Value = serde_json::from_str(&format!("{}", devices.json()))?; | ||
| let device = devices | ||
| .as_object() | ||
| .expect("devices should be an object") | ||
| .values() | ||
| .next() | ||
| .expect("should have at least one NVMe device"); | ||
| let per_cpu = &device["driver"]["driver"]["io_issuers"]["per_cpu"]; | ||
| let per_cpu_map = per_cpu.as_object().expect("per_cpu should be an object"); | ||
| Ok(per_cpu_map.keys().cloned().collect::<Vec<_>>()) |
There was a problem hiding this comment.
Any reason to NOT return a Vec<u32>? Vec<u32> seems like a natural container for a list of CPUs (or a set), if one ignores the implementation detail that the numbers come as strings from inspect.
There was a problem hiding this comment.
Pull request overview
This PR makes the servicing_keepalive_create_io_queue_on_cpu VMM test validate that an IO queue is actually created after pinning IO to a CPU that initially lacks an issuer, and adjusts VTL2/SCSI channel configuration to enable per-CPU IO.
Changes:
- Configure VTL2 SCSI sub-channels so the guest can perform IO on all vCPUs (one channel per CPU).
- Refactor CPU-selection logic into
find_cpus_with_io_issuers()and add an assertion that pinning IO results in an issuer on the target CPU. - Extract cpuset-based pinned IO logic into
run_cpu_pinned_io().
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the OpenHCL servicing keepalive NVMe vmm test to reliably exercise and validate the NVMe driver’s create_io_queue() path by ensuring StorVSP/SCSI has enough channels to allow IO dispatch across all vCPUs, and by asserting that the targeted CPU gains an IO issuer after pinned IO.
Changes:
- Configure VM topology and VTL2 SCSI settings so total SCSI channels match
vp_count(enabling IO on all CPUs). - Refactor the test logic to use helpers for (a) finding CPUs with NVMe IO issuers via inspect and (b) running cpuset-pinned IO.
- Strengthen validation by asserting inspect reflects IO issuer creation on the targeted CPU after pinned IO.
You can also share your feedback on Copilot code review. Take the survey.
| // Ensure the agent is responsive after the restart before returning. | ||
| agent.ping().await?; |
There was a problem hiding this comment.
A general comment, not for this PR: These 2 lines seem like a worthwhile change for any servicing test. Instead of repeating and synchronizing them between all the tests, do we want to create a mini-framework that handles that and all the other pre- and post- things? E.g. one that either gets a callback that runs the test and does the setup and teardown, or (preferably) something like the function that creates the common config that will do all the other pre- and post- things.
| @@ -1226,7 +1234,7 @@ async fn run_cpu_pinned_io( | |||
| // command didn't complete (the exact production failure scenario). | |||
| let pinned_dd = format!( | |||
| "echo $$ > /sys/fs/cgroup/pin{target_cpu}/cgroup.procs; \ | |||
| dd if={disk_path} of=/dev/null bs=4k count=256 iflag=direct" | |||
| dd if={disk_path} of=/dev/null bs=4k count=256 iflag=direct status=none" | |||
There was a problem hiding this comment.
Do we expect the dd to fail? I guess not, but just double checking. If yes, status=none would hide the stats dd prints in the end telling how much it was able to copy. IIRC, adding it was copilot's suggestion, so up to you what to do here.
There was a problem hiding this comment.
It really should not be failing imo. Unless someone is messing around with the test tbh.
The
servicing_keepalive_create_io_queue_on_cpudoes not validate the creation of the io queue (and previously the queue was not being created because of storvsp operating on a single channel). This PR adds a total of 4 channels to storvsp, one per CPU and then updates the vmm test to verify queue creation.