153   blog.trailofbits.com

Refreshing Comments...

Is the fuzzing use case burying the lede here? The ability to translate an aarch64 binary so it can run on a GPU seems like it might be interesting in its own right? Maybe the fact that I'm not coming up with obvious reasons is a clue that there aren't but it's just remarkable to me that it's possible.

Great job folks! Very cool.

Thanks so much for checking this out!

Like tyoma said in the previous comment, if you had a use case where you needed to run lots of things in parallel, then this would be useful. Latency is much higher on GPUs (clock speeds are lower and memory access latencies are higher), and system call support will make this even worse, so this probably wouldn't fare well unless you had a use case that could utilize that high a degree of parallelism.

Seems like it would be good for anything you would put in a stream e.g. encryption/decryption, encode/decode, compress/decompress, parsing, filtering, routing, etc.
It would only be beneficial to do the translation if the underlying aarch64 (or x86 or other) could be run in parallel on multiple data elements to begin with. Fuzzing naturally has that property, but certainly there are other uses.
Hi HN! I'm the intern that worked on this project, and I would be happy to answer any questions here!
How hard would it be to adapt this to use when source is available? Obviously one could just use the binary, but being able to skip the lifting phase could reduce complexity. If you can compile code to LLVM IR (say, with Clang) anyway it'd be nice if the resulting tool could take that as input.
It would be doable but not trivial. We're depending on remill not only to lift binaries, but also to add instrumentation for interposing on and translate memory accesses and function calls. We could use uninstrumented LLVM IR as input, but would need to write an LLVM pass to add in equivalent instrumentation. This shouldn't be terribly hard, but we're currently focused on getting everything working with remill.
Thanks for the really interesting work and article. A couple of quick questions:

- How does the generated ptx code interface with the rest of the system. Is it embedded into some CUDA code?

- Any plans to open source?

Thanks for checking it out!

1) The generated PTX is written to a file and then dynamically loaded into the fuzzer, which is a CUDA program. Specifically, the cuModuleLoad function can be called to load a ptx file, and then cuModuleGetFunction can be used kind of like dlsym to get pointers to functions that were loaded from the ptx.

2) We do plan to open source! Currently the code is definitely research grade and needs some more work.

How many compiler bugs have you hit so far :) ?
Haha... More than I had expected. We've hit two confirmed + one possible bug in LLVM and one bug in the PTX assembler. LLVM's PTX backend isn't fully mature yet, and I think the kind of PTX we're generating is very different from what people traditionally do with CUDA, so we are exposing quite a few edge cases in compilers that haven't been dealt with.
How does one determine when a bug is in the compiler vs. just a dumb code error? Examining compiler output?
That's been one of the biggest challenges of this internship, since I'm so used to assuming that any bugs are problems with my code or some library I'm using. In general, I'll first try to debug as I would normally debug my own code, but if inexplicable behavior keeps happening, I try to strip the code down to as small of an example as possible and then look at the compiler output. In some cases (e.g. bugs with LLVM), I can just try a different compiler and see if it works (e.g. nvcc), but ptxas is the only PTX assembler out there, so confirming ptxas bugs requires much more work.

Edit: another indicator is if something works at -O0 but breaks at higher optimization levels. That could be undefined behavior in your code, but it could also suggest a bug in the optimizer. Sometimes it's helpful to fiddle with the code to figure out what causes the compiler to break. For example, with the ptxas bug, our code would work fine unless we had a long chain of function calls (even if the functions in the call chain weren't doing anything interesting). That sounds more like a compiler bug than a logic error on our part. Sometimes, you can even figure out which specific pass of the optimizer is breaking the code; LLVM has a bisect tool that allows you to run optimization passes individually until you observe the output breaking.

How's the fidelity of code that's lifted through LLVM IR and then lowered back down to PTX?
The process is a little brittle right now, but when it works, it works. Remill (the binary lifter) sometimes has issues with certain constructs such as switch statements, and we've hit a number of LLVM and ptxas (PTX assembler) bugs as well, since LLVM's PTX backend isn't fully mature and most CUDA kernels are light on function calls and don't look like typical application code. However, when the process works, the PTX doesn't look too terribly different from the original code.
Disclosure: I work on Google Cloud.

Cool use of preemptible T4s! The Chrome Clusterfuzz folks were a launch partner for Preemptible VMs, so I have a soft spot for preemptible fuzzing :).

Let me know if you need more quota or have any feedback / questions. We recently improved the preemption rate for preemptible GPUs drastically, so I hope you’ve experienced that.

After watching some of gamozolabs fuzz week a few months ago I've been wondering how well it would work to write a simple RISC-V emulator on a GPU for parallelized fuzzing. It sounds like a fun learning project that I hope I can get to eventually.

It's very interesting seeing just how much performance can be squeezed out like this.