Debugging: add builtin gdbstub component.#12771
Debugging: add builtin gdbstub component.#12771cfallin wants to merge 3 commits intobytecodealliance:mainfrom
Conversation
|
This is stacked on top of #12756 until that one lands; only the last commit is new. I haven't added end-to-end tests that spawn/interact with LLDB yet; depending on how that goes I might be able to include that here or might defer to another PR if that's OK. |
d7959df to
fc1f75a
Compare
2719201 to
71bd19d
Compare
71bd19d to
34e9d51
Compare
This adds a debug component that makes use of the debug-main world defined in bytecodealliance#12756 and serves the gdbstub protocol, with Wasm extensions, compatible with LLDB. This component is built and included inside the Wasmtime binary, and is loaded using the lower-level `-D debugger=...` debug-main option; the user doesn't need to specify the `.wasm` adapter component. Instead, the user simply runs `wasmtime run -g <PORT> program.wasm ...` and Wasmtime will load and prepare to run `program.wasm` as the debuggee, waiting for a gdbstub connection on the given TCP port before continuing. The workflow is: ``` $ wasmtime run -g 1234 program.wasm [ wasmtime starts and waits for connection ] $ /opt/wasi-sdk/bin/lldb # use LLDB from wasi-sdk release 32 or later (lldb) process connect --plugin wasm connect://localhost:1234 Process 1 stopped * thread #1, stop reason = signal SIGTRAP frame #0: 0x40000000000001cc -> 0x40000000000001cc: unreachable 0x40000000000001cd: end 0x40000000000001ce: local.get 0 0x40000000000001d0: call 13 (lldb) si Process 1 stopped * thread #1, stop reason = instruction step into frame #0: 0x4000000000000184 -> 0x4000000000000184: block 0x4000000000000186: block 0x4000000000000188: global.get 1 0x400000000000018e: i32.const 3664 [ ... ] ``` This makes use of the `gdbstub` third-party crate, into which I've upstreamed support for the Wasm extensions in daniel5151/gdbstub#188, daniel5151/gdbstub#189, daniel5151/gdbstub#190, and daniel5151/gdbstub#192. (I'll add vets as part of this PR.)
34e9d51 to
c0c1f02
Compare
|
Rebased out #12756; should be good to review now. |
Subscribe to Label Actioncc @fitzgen DetailsThis issue or pull request has been labeled: "wizer"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
| authors.workspace = true | ||
| edition.workspace = true | ||
| license = "Apache-2.0 WITH LLVM-exception" | ||
| publish = false |
There was a problem hiding this comment.
Hm ok here's a sticking point I didn't realize: this makes the wasmtime-cli crate non-publishable because there's a dependency on something that doesn't exist on crates.io. I'm also realizing now that it's not as simple as publishing this crate since it's fundamentally not publishable, it relies on being part of this workspace to build a sibling crate, which isn't present when built as a dep from crates.io.
The "true fix" for this is https://doc.rust-lang.org/nightly/cargo/reference/unstable.html#artifact-dependencies, an unstable Cargo feature. Unfortunately we can't rely on that even in a nightly-conditional context I believe, if we tried to use that it would mean that Wasmtime would always require nightly.
Some possible ideas:
- Check in the gdbstub binary and verify it's built in CI. I suspect it's a bit large and will receive many changes, so not my first choice.
- Download gdbstub from github releases for released wasmtime-cli artifacts. We don't currently have HTTP/network dependencies in the CLI outside of WASI impls, so that'll be hard.
- Publish this crate to crates.io, but handle failures in the
build.rsscript. That'd mean that the gdbstub binary would have to be an off-by-default optional feature which we enable for our release builds but would be off-by-default for crates.io.
Well, "some possible ideas" aka I think randomly in real time until something semi-reasonable pops out... Maybe that last one?
There was a problem hiding this comment.
Oh, hmm, I definitely didn't foresee this one being a problem either!
Question on the last option: what do you mean by "handle failures in build.rs"; in other words, why would such a failure be any different than some other build failure for a crate on crates.io (some of which are e.g. crates that wrap C code and use cc, or do other build-time shenanigans)?
There was a problem hiding this comment.
Sort of two things:
- I forget what cwd is used for
build.rsso I don't know what effectcargo build -p thingwill have when something is published to crates.io. If that tries to build other random crates in a workspace, for example, I think that'd be bad. - I think this would ideally have a better error message than "package
thingnot found" when built from crates.io, aka something like "you can't enable the gdbstub component when wasmtime is built from crates.io" or similar.
Basically, yeah, build-time weirdness is expected, but I'd like to ideally tame it. Another example would be printing a better error if wasm32-wasip2 weren't installed, but rustc does a decent job of this already.
| let wasm = out_dir | ||
| .join("wasm32-wasip2") | ||
| .join("release") | ||
| .join("wasmtime_internal_gdbstub_component.wasm"); |
There was a problem hiding this comment.
To gauge, how big is this binary? Are we talking ~30M or something more like ~1M? Given that this is included in the CLI uncompressed it might be reasonable to try to apply simple size optimizations where possible if it's extra large.
| [dependencies] | ||
| wit-bindgen = { workspace = true, features = ["macros"] } | ||
| anyhow = { workspace = true } | ||
| structopt = { workspace = true } |
There was a problem hiding this comment.
Instead of pulling in structopt, can you use clap? They should be pretty similar and easy to transition between, but my impression is that structopt is more-or-less deprecated in favor of clap
| bail!("-g/--gdb cannot be combined with -Ddebugger="); | ||
| } | ||
| self.run.common.debug.debugger = Some("<built-in gdbstub>".into()); | ||
| self.run.common.debug.arg.push(format!("0.0.0.0:{port}")); |
There was a problem hiding this comment.
To me defaulting to localhost feels more natural, but at the same time if doing so we should probably have the option to handle both 0.0.0.0 and localhost. Maybe the gdbstub_port option is renamed to just gdbstub, and then internally the wasm does parsing to figure out if it's a port or port-and-address?
| } | ||
| self.run.common.debug.debugger = Some("<built-in gdbstub>".into()); | ||
| self.run.common.debug.arg.push(format!("0.0.0.0:{port}")); | ||
| Some(gdbstub_component_artifact::GDBSTUB_COMPONENT.to_vec()) |
There was a problem hiding this comment.
To avoid .to_vec() on a big thing, could the field in RunCommand change to &'static [u8]?
|
Also, to clarify, @cfallin what depth would you like me to review the gdbstub component code itself? I'm happy more-or-less not reviewing it at all in the sense that it's well-sequestered, low-risk, and we'll likely iterate a lot on it in-tree. If you'd prefer though I could give it a closer look in any particular areas of interest. |
I guess my default answer is "to whatever extent allows us to fulfill policy and be comfortable having this code in-repo" :-) I agree that since it's sandboxed, the bar could be lower than for core runtime code. I guess the spirit of our code-review policies is still that someone should give it a once-over -- but up to you how deep you take that! |
| let listener = TcpListener::bind(&self.options.tcp_address) | ||
| .await | ||
| .expect("Could not bind to TCP port"); | ||
|
|
There was a problem hiding this comment.
I forget what QEMU does, but this might be a reasonable place to print "Hey I'm listening on address A.B.C.D:XXXX for a debugger" both to signify that's why the process is halted and also be a sort of "join point" for "if you're a test, now's when you can make your TCP connection" synchronization point.
| if inner.borrow_conn().flush().await.is_err() { | ||
| // Connection closed or other outbound error. | ||
| break 'mainloop; | ||
| } | ||
|
|
There was a problem hiding this comment.
This may want to print something about an error perhaps? Otherwise the guest is still running and it may be "violently deleted" when dropping
| impl Connection for Conn { | ||
| type Error = anyhow::Error; | ||
|
|
||
| fn write(&mut self, byte: u8) -> std::result::Result<(), Self::Error> { | ||
| self.buf.push(byte); | ||
| Ok(()) | ||
| } | ||
|
|
||
| fn flush(&mut self) -> std::result::Result<(), Self::Error> { | ||
| // We cannot flush synchronously; we leave this to the `async | ||
| // fn flush` method called within the main loop. Fortunately | ||
| // the gdbstub cannot wait for a response before returning to | ||
| // the main loop, so we cannot introduce any deadlocks by | ||
| // failing to flush synchronously here. | ||
| Ok(()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Question for you: how come these aren't implemented with blocking I/O? My rough assumption is that blocking I/O is expected while gdbstub is doing stuff and the guest is halted, and the only async-y bit is "wait for I/O on the TCP connection or the guest to hit an event"
| Err(TargetError::NonFatal) | ||
| } | ||
|
|
||
| #[inline(always)] |
There was a problem hiding this comment.
stray debugging annotation?
| fn add_sw_breakpoint(&mut self, addr: u64, _kind: usize) -> TargetResult<bool, Self> { | ||
| let Some(wasm_addr) = WasmAddr::from_raw(addr) else { | ||
| return Ok(false); | ||
| }; | ||
| let debuggee = self.debuggee; | ||
| if let AddrSpaceLookup::Module { module, .. } = self.addr_space.lookup(wasm_addr, debuggee) | ||
| { | ||
| module | ||
| .add_breakpoint(debuggee, wasm_addr.offset()) | ||
| .map_err(|_| TargetError::NonFatal)?; | ||
| Ok(true) | ||
| } else { | ||
| Ok(false) | ||
| } | ||
| } | ||
|
|
||
| fn remove_sw_breakpoint(&mut self, addr: u64, _kind: usize) -> TargetResult<bool, Self> { | ||
| let Some(wasm_addr) = WasmAddr::from_raw(addr) else { | ||
| return Ok(false); | ||
| }; | ||
| let debuggee = self.debuggee; | ||
| if let AddrSpaceLookup::Module { module, .. } = self.addr_space.lookup(wasm_addr, debuggee) | ||
| { | ||
| module | ||
| .remove_breakpoint(debuggee, wasm_addr.offset()) | ||
| .map_err(|_| TargetError::NonFatal)?; | ||
| Ok(true) | ||
| } else { | ||
| Ok(false) | ||
| } | ||
| } |
There was a problem hiding this comment.
Could this perhaps support dual setting breakpoints on mapped addresses and seeing breakpoints on raw addresses in the code section?
…g forward to first opcode. LLDB, when instructed to `break main`, looks at the DWARF metadata for `main` and finds its PC range, then sets a breakpoint at the first PC. This is reasonable behavior for native ISAs! That PC better be a real instruction! On Wasm, however, (i) toolchains typically emit the PC range as *including* the *locals count*, a leb128 value that precedes the first opcode and any types of locals; (ii) our gdbstub component that bridges LLDB to our debug APIs (bytecodealliance#12771) only supports *exact* PCs for breakpoints, so when presented with a PC that does not actually point to an opcode, setting the breakpoint is effectively a no-op. There will always be a difference of at least 1 byte between the start-of-function offset and first-opcode offset (for a leb128 of `0` for no locals), so a breakpoint "on" a function will never work. I initially prototyped a fix that adds a sequence point at the start of every function (which, again, is *guaranteed* to be distinct from the first opcode), and the branch is [here], but I didn't like the developer experience: this meant that when a breakpoint at a function start fired, LLDB had a weird interstitial state where no line-number applied. The behavior that would be closer in line with "native" debug expectations is that we add a bit of fuzzy-ish matching: setting a breakpoint at function start should break at the first opcode, even if that's a few (or many) bytes later. There are two options here: special-case function start, or generally change the semantics of our breakpoint API so that "add breakpoint at `pc`" means "add breakpoint at next opcode at or after `pc`". I opted for the latter in this PR because it's more consistent. The logic is a little subtle because we're effectively defining an n-to-1 mapping with this "snap-to-next" behavior, so we have to refcount each breakpoint (consider setting a breakpoint at function start *and* at the first opcode, then deleting them, one at a time). I believe the result is self-consistent, even if a little more complicated now. And, importantly, with bytecodealliance#12771 on top of this change, it produces the expected behavior for the (very simple!) debug script "`b main`; `continue`". [here]: https://github.com/cfallin/wasmtime/tree/breakpoint-at-func-start
This adds a debug component that makes use of the debug-main world defined in #12756 and serves the gdbstub protocol, with Wasm extensions, compatible with LLDB.
This component is built and included inside the Wasmtime binary, and is loaded using the lower-level
-D debugger=...debug-main option; the user doesn't need to specify the.wasmadapter component. Instead, the user simply runswasmtime run -g <PORT> program.wasm ...and Wasmtime will load and prepare to runprogram.wasmas the debuggee, waiting for a gdbstub connection on the given TCP port before continuing.The workflow is:
This makes use of the
gdbstubthird-party crate, into which I've upstreamed support for the Wasm extensions in daniel5151/gdbstub#188, daniel5151/gdbstub#189, daniel5151/gdbstub#190, and daniel5151/gdbstub#192. (I'll add vets as part of this PR.)