231   github.com

Rust-GPU

Refreshing Comments...

We've been trying to design a safe and portable language in WebGPU (called WGSL), but it turns out to be really hard. For example, subgroup ops discussion [1] so far concluded that even the most limited form still has portability concerns. Even things like texture sampling (that depend on control flow uniformity) could be tricky with regards to how the uniformity is defined. And at least in WGSL we can be conservative, but Rust-GPU does probably not have this luxury.

Thus, I'm curious to see how Rust will tackle it. Will it try to enforce the portability and safety guarantees with a type system, or will it specifically lower the guarantees compared to native code (thus keeping the syntax of Rust but not the philosophy).

Anyhow, amazing work by Embark, and a stellar example how to collaborate in open source!

[1] https://github.com/gpuweb/gpuweb/pull/954

> it turns out to be really hard

Everyone was yelling this at you since the beginning. Ever since Apple proposed "safe HLSL", and then it morphed into WHLSL and then WLSL and then the competing Tint proposal which is now WGSL, everyone (Adobe, Valve, Epic, Unity, NVIDIA) was saying "none of this will do what you think it does, we've all been here before and it's very very hard, at the very least, get some IHVs or at least a team that does port work / emulation tech on board".

The committee chose not to listen, instead forcing WebGPU down this hell that nobody really wants or appreciates. The whole WebGPU is three years late and still has no users, despite being one of the most complicated specs to land in years. And yeah, I'm still a bit bitter about it.

I am also skeptical of rust-gpu's approach, because I don't think it's possible to take a CPU-inspired language design and blindly adapt it to GPUs without some difficult questions to answer (they still don't have flow control, but I'm told that's coming).

But at the very least, I think repi and Embark know what they're doing and know about the tough problems, rather than feigning being blindsighted.

Today, one of the main drivers of WGSL/Tint is David Neto - the former SPIR-V WG chair. We know very well the problems we are facing, it's not the ignorance.
So what are you saying here? WebGPU should have been just Apple Metal?
There were working SPIR-V+WebGPU prototypes quite a while ago, and plenty of people would've been happy with SPIR-V+WebGPU. But a particular big company's petty fights with Khronos derailed that.
Why not call them out? It was Apple that decided that standards weren't for them.

Sure, it might be more of an Apple legal team problem vs programmer problem. None-the-less they are the ones that decided to derail the standard.

SPIR-V by itself does not have the needed safety or interoperability properties for the web, or the advantages of text formats that the web has benefitted from. These are the more important factors, not IPR considerations.

Fortunately, we landed on a good compromise with a textual language that maps closely onto SPIR-V semantics. A lot of the hard work remaining is to achieve the level of safety and interoperability required for the web. Even in an approach of using SPIR-V as the syntax, a similar level of effort would have been required on safety and interop aspects.

> SPIR-V by itself does not have the needed safety or interoperability properties for the web

Safety, maybe not 100%, but this could be fixed with extensions, and transformations. The same transformations that WGSL compilers will apply. I'm unsure of what you mean by "interoperability".

> or the advantages of text formats that the web has benefitted from.

??? Nobody has ever enumerated these to me. Or told me why if these advantages are so great, why Apple has not advocated for a text-based formats for <audio> and <video>. Binary formats are widely becoming more standard on the web for all sorts of reasons (images, fonts, video, audio, even code in the form of WebAssembly, and Bloomberg/Facebook's JS AST proposal improved performance by removing parsing cost. MapBox ships its vectors in binary format. glTF is a binary format.), and even in the graphics space, the release of the new SPIR-V format over having to ingest GLSL is seen as one of the greatest advantages of Vulkan over OpenGL.

If you want a text format, take the text serialization of SPIR-V. It's there, it's standard, it exists.

But you won't bother answering me, because every time you bring this up, we ask the same questions, repeat the same evidence, and you ignore us, preferring just to reply "text is more webby".

Not to mention that you're currently pressuring the WG to add new APIs to avoid the heavy front-end cost of Apple's own MSL shader compiler.

https://github.com/gpuweb/gpuweb/issues/1064

> "text is more webby".

TBF, having to include a binary shader blob in a small WebGL-style demo (e.g. some interactive math visualization embedded in a blog post) would suck compared to just adding a few lines of text.

It would be possible to load a WASM module with a text-to-SPIRV compiler, but that would most likely be a few megabytes.

IMHO one shouldn't discard those small shadertoy-style demos, this is what made WebGL popular, not the "AAA games on the web" pipe dream.

It would be possible to load a WASM module with a text-to-SPIRV compiler, but that would most likely be a few megabytes.

My understanding is that this was discussed, and the answer is: glslang is much smaller than this. (Glslang now has a bunch of #ifndef WEB to facilitate precisely this.)

I don’t think anyone exhibited a version that got to acceptable binary size and startup/execution time. Might be that it would have been possible with way more effort.
Going from "a few megabytes I'd rather not include, for a simple use case" to "let's design our own shading language, derail the project for three years, and discover from first principles all the reasons that this is actually a hard problem, while telling everyone else it's actually easy!" is quite the leap. I sympathize with people who believe that page sizes are getting larger and we should do something to prevent it, but the latter is not a worthwhile solution that meets the tradeoffs IMO.
Google folks who have been primary designers of the new shader language have been deeply involved in SPIR-V; and Apple has made its own shading language, Metal. Apple also produces GPUs that ship in very high volume via A-series SoCs.

I’m not sure why you think the people involved in this effort are clueless about shader languages. That does not seem supported by the evidence.

Looking at that issue, seems perfectly fine to raise it to me. That's exactly the kind of input I would expect from a member of the working group, and it just increases the quality of the end product.

Let's face it, Metal is nicer than Vulkan. And I bet there are more Metal programmers out there by now than there are Vulkan programmers.

> Safety, maybe not 100%, but this could be fixed with extensions, and transformations. The same transformations that WGSL compilers will apply.

What I’m telling you is that defining these checks and transformations is the hard part, not the syntax. If your claim is that WebGPU is taking longer because of WGSL, I don’t think that holds up.

> I'm unsure of what you mean by "interoperability".

Consistent behavior across browsers and platforms. Some thing in SPIR-V are not specified with as much precision as modern web standards like HTML or ECMAScript.

>> or the advantages of text formats that the web has benefitted from. > ??? Nobody has ever enumerated these to me.

These have been discussed to death in the WebGPU WG. I’m not sure why you think someone owes you personally a direct explanation. But here are some. Ease of debugging. Learning via “view source” ability to develop and publish with no compile step if desired.

> why Apple has not advocated for a text-based formats for <audio> and <video>

There is no textual source format for a media file. Media files also need bespoke compression to transmit efficiently. Gripping a text format would not work. Gzip transfer-encoded WGSL is more compact than gzip transfer-encoded SPIR-V, so the transfer size advantage cuts the other way (but not clear this really matters for shader programs)

> Binary formats are widely becoming more standard on the web ... even code in the form of WebAssembly

WebAssembly is intended for a very specific use case, porting native apps to the web. Even so, many in the WebAssembly WG think it was a mistake to make it a binary format instead of text, in retrospect.

> Bloomberg/Facebook's JS AST proposal improved performance by removing parsing cost

We (WebKit team) think they got their results only be starting with a JS engine that had a slow parser. JavaScriptCore parses much faster than SpiderMonkey and is faster than SpiderMonkey + AST prototype. And the proposal is not really getting buy-in at ECMA.

> If you want a text format, take the text serialization of SPIR-V. It's there, it's standard, it exists.

There is no standard text serialization format of SPIR-V. There’s an unofficial format supported by spirv-cross, but it’s not a standard. WGSL was created by starting with that serialization format, and progressively adding concessions to human-authorability while preserving a mapping to SPIR-V semantics. We (WebGPU CG) hope to make it a standard and perhaps even make that standard usable beyond the domain of WebGPU.

> But you won't bother answering me, because every time you bring this up, we ask the same questions, repeat the same evidence, and you ignore us, preferring just to reply "text is more webby".

I just answered you comprehensively. I feel like 0% of what I said is new information beyond what has been stated many times before in many venues. So I think the problem here may be on the reader side, not the writer side.

> Not to mention that you're currently pressuring the WG to add new APIs to avoid the heavy front-end cost of Apple's own MSL shader compiler.

This has nothing to do with share language choice per se. That said, the WebGPU CG charter includes the goal of working naturally and efficiently on top of Metal, along with Vulkan and DirectX. This proposal, based on perf measurements, is a way of removing overhead for Metal that won’t affect the other two WebGPU target underlying APIs. Do you feel that WebGPU should not work efficiently on top of Metal? You may not care about this personally, but the CG charter disagrees. Is it bad faith “pressuring” to share perf info about how WebGPU maps to Metal, or just the normal way the standards process is supposed to work? To me it looks like the latter.

> we've all been here before and it's very very hard, at the very least, get some IHVs or at least a team that does port work / emulation tech on board
Yep, if you want safety you have to make hardware impliment it at an op level. Anything else is a false promise that will be exploited and unfixable. GPU is far worse than CPU in this, and even still intel has entirely unfixable spectre vulnerabilities.

We need to stop trying to make software mitigations and start putting pressure on hardware designers.

There are many safe programming languages that do not rely on extra safety features of the hardware. Calling it "false promise that will be exploited and unfixable" is not supported by the evidence.

That is to say, WebGPU is working with IHVs indirectly via Khronos, Microsoft, etc, and is putting pressure to introduce the safety features at the driver/hardware level. We've been arguing for more defined UB in Vulkan specs, initialization of workgroup memory, device memory ownership, and other things.

That's quite interesting, but I wonder what Rust brings to the table here.

From my understanding of GLSL, it seemed like the hardware was in control of scheduling, and the driver that compiled the shaders was the one inserting locks, fences, etc.

GPUs seem like a bad fit for run-time exceptions and return value checking (? operator?) since they are so poor at branching, being SIMD. The more you can check ahead of time, the better, and Rust isn't such a bad fit there.

But given that Rust's main advantages seem to be the ability to safely control multiple execution threads, and its handling of edge cases, I wonder if the first is going to be used at all, and if the second isn't going to lead to extra instructions?

Another advantage (for Embark) is their main engine codebase is also in Rust. This allows for sharing of some code on GPU and CPU. This isn't uncommon in the game industry, for example header files that compile in both C++ and HLSL in order to share struct definitions for CPU code and GPU code, and CPU math libraries that use HLSL types and semantics.

Rust is different enough from GLSL and HLSL (most game engines use HLSL as "native" and cross compile to GLSL, MSL or use header defines to build PSSL) that introducing a new shading language compatible with Rust is probably worth the effort for them.

Rust has hygienic macros and a package manager, at least. It'd also be nice to have a common language that you could run on the CPU for debugging purposes. I see this more as a moonshot type project though, where if you could accomplish it, it would mostly just speed up Rust's time to utility for other similar use cases like embedded development.
The main benefit is that the GPU-side code can better integrate with the CPU-side code, and that you don't need separate tools and a complex build process integration.

For instance it's possible to share data structure between the CPU and GPU side directly from the same source code, or you can run the same code on the CPU or GPU for whatever reason (debugging, unit testing etc...).

The Metal API uses C++ (with some extensions) as shader language, which IMHO is complete overkill too, but one nice side effect is that you can put shared data structures into headers, and then include the same shared headers on the CPU and GPU side.

Ideally it should be possible to mix GPU and CPU code in the same source file, and only (for instance) annotate GPU functions with an attribute which causes it to be compiled as shader code.

Will this only target graphics operations, or will it also be suitable for general purpose computations (matrices, tensors, etc)?
From the readme

Focus on Vulkan graphics shaders first, then after Vulkan compute shaders

So no compute support yet. Also, Rust-GPU's focus is writing shaders in Rust, in order to support a game engine, so the focus will be real-time graphics and simulation. It seems like it will never evolve into a CUDA like heterogeneous environment that's more common in scientific computing and ML.

> It seems like it will never evolve into a CUDA like heterogeneous environment that's more common in scientific computing and ML.

I don't think anyone would be opposed to evolving the project into a strong alternative for scientific computing and machine learning!

> If we do this project right, developers could use open-source crates that provide the graphical effects needed to create beautiful experiences. Instead of sharing snippets of code on forum posts, they could simply add the right crates.

That’s not a very compelling value proposition. You don’t need to switch languages to use a package manager.

Plus, not all engines have the same data layout, so an “effect crate” would either require your engine to work a very specific way, or would require integration work that would probably defeat the purpose of using a package manager in the first place.

And what does Rust itself add that GLSL/HLSL don’t offer? It can’t be worth the headaches, especially when you consider how a TON of devices still only support OpenGL, which not only requires runtime compilation of shaders (so only GLSL unless your app ships a cross compiler), but different drivers behave differently, requiring ugly hacks for the sack of performance in some cases.

> Plus, not all engines have the same data layout, so an “effect crate” would either require your engine to work a very specific way, or would require integration work that would probably defeat the purpose of using a package manager in the first place.

You could say the same thing about nodes in shader graph systems, but in practice they work quite well.

> It can’t be worth the headaches, especially when you consider how a TON of devices still only support OpenGL, which not only requires runtime compilation of shaders (so only GLSL unless your app ships a cross compiler)

You can AOT compile SPIR-V to GLSL with SPIRV-Cross.

imagining a simple x86/64 machine that also has a simple GPU...

what does the amd64 assembly look like for "telling the GPU to do something"?

Is there an amd64 instruction that says "Tell the GPU to start executing here?"

Or is it at the OS layer instead of the assembly layer, with an OS syscall that defers to the GPU?

CPU peripherals are memory, so it just looks like copying a GPU "kernel" (program) from a memory area (from disk) to another (GPU, via PCIe).

I don't know enough about PCIe to be an authoritative source, but I'll just go along and describe the way most microcontrollers and peripherals work. I doubt x86 has instructions dedicated to PCIe. There will likely be a PCIe controller memory-mapped to some hardware address, that the OS will write to as if it was regular memory (that address is physically wired to the controller, be it on the CPU die or on the motherboard). The address is generally specified by the BIOS/UEFI (devicetrees on most embedded platforms).

Of course, what you will write will depend on the controller itself, but the OS will have a driver that contains that information. You have a back-and-forth to list peripherals on the bus, identify the one you want to communicate to, set-up access mode (direct memory access where you map an area of your memory to the peripheral, or just sequential access trough the controller). Once you can communicate with the GPU, you start the same dance to give it a program you compiled ahead of time, have it run, and fetch the result.

I don't have low-level experience with that specific part of the system, but that kind of stuff is generic enough that you find the same patterns everywhere, at least for turing machines, which x86 CPUs are.

This is a bit of an oversimplification, because I don't actually know the details, but it works roughly like this: GPUs are devices on the PCI Express bus, without an operating system getting in the way everything on the PCI Express bus is organised into something called ECAM (https://en.wikipedia.org/wiki/PCI_configuration_space). Each device there specifies memory regions it supports. These memory regions map to a large number of configuration registers and on-device memory the GPU supports. Since GPUs are really complex most of its behaviour is actually not directly controlled but instead predetermined by its firmware (basically an operating system in its own right). Finally GPUs really execute multiple instruction streams with proprietary instruction formats and can independently load data from main memory with their build in DMA engines. These operations are typically kicked off by pointing various registers on the GPU to memory regions in main memory to tell it where to fetch instructions and data from. Everything else happens for the most part as consequences of instructions executing on the GPU.
This specific component is a SPIR-V backend for Rust. In practice, how you use this is you use a native API (Vulkan), give it your generated SPIR-V program, and it "runs it" using its own semantics. This is a combination of a userspace API talking to a kernel driver, and the kernel driver using a combination of IN/OUT instructions (unlikely) and memory-mapped I/O (much more likely) to talk to the GPU.

The actual details of the communication is documented by some IHVs, either through PDFs or source code. Normally, the CPU doesn't wait for programs to complete, but instead, the GPU driver has a list of tasks to run (a "command buffer") that it itself can schedule across many different pieces of hardware, and a combination of the driver and GPU itself determine scheduling, execution, and so on. Note that there's still a lot of work for the driver to do, including managing the device's memory (textures, buffers, so on), compiling any programs into machine language (similar to a JIT), and coordinating multiple different programs trying to use the GPU.

CPU code calls in to drivers which talk to the GPU over the PCIe bus. So assembly code that does stuff with the GPU just looks like loading a shared library, passing in some shader code, and then passing in some drawing commands. This project looks like it aims to compile rust to a shader intermediate format so you can use it instead of more common shading languages like GLSL or HLSL. Basically your HLSL/GLSL/(now Rust) code ends up as an intermediate format, which the CPU passes in to a graphics driver before asking that driver to draw things.
> Or is it at the OS layer instead of the assembly layer, with an OS syscall that defers to the GPU?

Yes. Modern OSes don’t allow userspace code to directly access peripheral devices, GPUs included. Due to the enormous complexity of modern GPUs (GPUs typically have more transistors, consume more electricity, and have comparable amount of memory), the API surface is huge.

On Windows, the native API surface is Direct3D. It has two large pieces, user-mode and kernel mode. The vendor-provided GPU driver is similarly split in two.

User-mode parts implement HLSL compiler (Microsoft), another downstream compiler to compile DXBC into proprietary byte code GPUs actually run (vendors), implement other higher-level stuff like mipmap generation (vendors, at least for nVidia), and expose user-facing APIs, i.e. multiple versions of D3D (Microsoft).

Kernel mode driver interfaces with actual hardware, the Microsoft’s vendor-agnostic part of that is in dxgkrnl.sys.

Various GPU drivers often expose extra user-facing APIs in addition to Direct3D (CUDA, Vulkan, OpenGL, OpenCL), but all of them are optional.

> Modern OSes don’t allow userspace code to directly access peripheral devices, GPUs included

This depends a lot on the OS and on the peripheral, but this is certainly not true across the board.

RDMA verbs, for example, definitely involve userspace directly communicating with the NIC, with no kernel involvement.

At least some of the AMD GPU graphics stacks support userspace submission to the GPU also, see http://www.hsafoundation.com/html/Content/Runtime/Topics/02_....

This requires hardware support, like an MMU and ensuring bad submissions from userspace can't do bad things to other processes, but lots of hardware supports these things.

The windows graphics stack does require going into the kernel to submit work to the GPU, but that isn't the case for all operating systems and all peripherals.

somewhat related: GPUs started getting support for virtual memory in ~2006 or so [1] but the windows graphics stack was improved to actually use that capability in Windows 10, ~10 years later [2].

[1] https://www.anandtech.com/show/2116/2 [2] https://en.wikipedia.org/wiki/Windows_Display_Driver_Model#W...

> this is certainly not true across the board.

Interesting.

> but the windows graphics stack was improved to actually use that capability in Windows 10

No, it happened much earlier, in WDDM 1.0 i.e. Windows Vista: https://en.wikipedia.org/wiki/Windows_Display_Driver_Model#V...

> No, it happened much earlier, in WDDM 1.0

WDDM 1.0 supported virtualized video memory, but it did it in software by patching command buffers to reference the new physical memory addresses.

Not until WDDM 2.0 did WDDM actually support using the hardware virtual memory capabilities of GPUs.

> A new memory model is implemented that gives each GPU a per-process virtual address space. Direct addressing of video memory is still supported by WDDMv2 for graphics hardware that requires it, but that is considered a legacy case. IHVs are expected to develop new hardware that supports virtual addressing.

i.e. prior to WDDM 2.0, each process did not have a separate virtual address space on the GPU, and all hardware was assumed to use "direct" (i.e. physical) addressing of memory.

https://docs.microsoft.com/en-us/windows-hardware/drivers/di...

> Under Windows Display Driver Model (WDDM) v1.x, the device driver interface (DDI) is built such that graphics processing unit (GPU) engines are expected to reference memory through segment physical addresses. As segments are shared across applications and over committed, resources gets relocated through their lifetime and their assigned physical addresses change. This leads to the need to track memory references inside command buffers through allocation and patch location lists, and to patch those buffers with the correct physical memory reference before submission to a GPU engine

> it is necessary to eliminate the need for the video memory manager to inspect and patch every command buffer before submission to a GPU engine.

> To achieve this, WDDM v2 supports GPU virtual addressing

> Modern OSes don’t allow userspace code to directly access peripheral devices, GPUs included

I was under the impression that user-space drivers (like Linux UIO) were given an address that they can mmap for direct reads/writes from/to the peripherals address space. Is this not "direct"?

> were given an address that they can mmap for direct reads/writes from/to the peripherals address space

Kernel-mode drivers indeed do that under the hood, but the majority of GPU I/O bandwidth doesn’t go that way.

With DMA, GPUs have full access to system memory. They have specialized piece of hardware (exposed to programmers as copy command queue in D3D12, or transfer queue in Vulcan) to move large blocks of data in either direction between system RAM and VRAM.

Simplifying a lot, both the CPU and the GPU have access to a few regions of shared memory, which both can read from and write to. Some of it is on the GPU (device local memory) and some of it is on the CPU (host local memory). The CPU writes a list of commands to one of these shared memory regions, then writes to a special register on the GPU (these GPU registers are visible to the CPU as yet another region of shared memory) telling it where that list of commands is.

That is, from a x86-64 assembly point of view, it's just writing to memory. Usually, only the final write to the special register is privileged, so that write will need an OS syscall into the graphics driver; the rest can be written directly by the userspace driver (which used beforehand another OS syscall to get access to some of that memory shared with the GPU). The userspace driver also compiles the SPIR-V programs into native GPU code, which can be loaded and executed as instructed by the list of commands.

The OS level is purely so that multiple programs can talk to the same hardware at the same time. For example, your Firefox Browser may talk to the GPU to render something (H.264 video), or to send audio over the HDMI cable. But at the same time, you're playing some video game (Civ-6), and the OS needs to prioritize which gets access.

------------

When it comes to I/O, there are two strategies: memory mapped I/O, and I/O ports (the "in" and "out" assembly instructions). I'm pretty sure most modern hardware is memory-mapped type, where you just read to (or write from) certain "memory" locations.

Its not really memory (it doesn't go to RAM), but instead to some other component, like PCIe root complex for further processing.

------

Something to note: the whole OSI level exists on the motherboard. You've got physical connections, you've got link-layer connections, and even a network similar to TCP/IP. In fact, there are multiple networks: SATA networks, PCIe networks, and USB Networks, that have their own addressing rules and physical protocols. Different USB Ports, different PCIe addresses, and more!

From the perspective of the CPU, I expect any "GPU Command" to simply be a link-level call to the PCIe root complex. You send data to the PCIe controller, which then forwards the data to the GPU. Or NVMe RAM, or your network card.

To fully make a GPU Command get to the GPU, you need to know the PCIe address, and route the message appropriately (there could be PCIe switches in between). Finally, there's a protocol, similar to application-level code (HTTP) where applications talk to the GPU.

Given the similarities between DirectX, OpenCL, Vulkan, Metal, and CUDA/HIP, I expect that GPUs all have roughly the same "application" interface. Shaders are compiled into machine code. Machine code is loaded into GPU VRAM. Command Queues issue remote-function calls with an event-driven dependency graph that the GPU selects kernels from. Etc. etc.

> developers could use open-source crates that provide the graphical effects needed to create beautiful experiences

That HLSL/GLSL code is tightly coupled to format and bindings of the input and output data (textures, samplers, constant/uniform buffers, render targets, etc.) These are defined by the combination of non-shader assets, and the CPU-running code which implements the rest of the graphics engine.

I’m not sure how different shader language is going to help?

Oh, I was waiting for something like this to appear. So this allows using Rust to generate SPIR-V for Vulkan instead of GLSL or HLSL?

Can Rust be used also for compute? It would be a great way to dislodge CUDA lock-in, since something in Rust could be a lot better option than either CUDA or OpenCL.

> Can Rust be used also for compute?

It can, but it's worth noticing that the programming paradigm is different. For this reason, I would assume you are still writing a DSL within Rust which can use Rust's type checker to help you figure out what to write and how, but Rust in itself cannot represent the same semantics as-is that is required to write efficient parallel code. So, you still need to be a domain expert. But, automating the bytecode writing is alleviated by Rust's ability to provide packages etc.

I refactored some OpenMP code to run on SPIR-V using the Vulkan API the OP project uses, and I must say that I cannot see myself doing it in any other language than Rust.

From the link: "Focus on Vulkan graphics shaders first, then after Vulkan compute shaders"
How does using shader in language X in Vulkan compare to approaches of OpenCL and CUDA? I.e. is it ergonomic to use for compute?
I don't think anyone on HN that regularly uses CUDA would call graphics compute ergonomic, and I often see complaints about it. I also see graphics programmers using CUDA complain about graphics interop (although Opti-X seems pretty good), so it might be the result of different use cases, rather than anything better or worse about either model.
> I don't think anyone on HN that regularly uses CUDA would call graphics compute ergonomic

I use CUDA, but DirectCompute is OK too.

> I often see complaints about it

People complain about all technologies they use. The more popular the technology is, the more complaints on the internets you gonna see.

My largest complaint about graphics compute is inability to spawn threads in runtime. DispatchIndirect and append buffers often help to workaround, but still limited, not really an equivalent to CUDA’s dynamic parallelism.

> rather than anything better or worse about either model

Not sure what do you mean by that?

CUDA exposes more stuff (dynamic parallelism mentioned above, intrinsics to shuffle data across threads, and a few others) but still, the model is pretty much the same. These CUDA’s triple angle brackets directly map to Dispatch arguments, and [numthreads] values in HLSL.

Instrinics to shuffle data around threads exist in graphics compute but only are really supported in DX12 (WaveInstrinics) and Vulkan (subgroup). They were first introduced to game developers widely on consoles (Xbox One and PS4), but also exist in a more limited form as vendor specific extensions in DX11 (and I think OpenGL). Callable shaders (https://docs.microsoft.com/en-us/windows/win32/direct3d12/ca...) in DX12 and Vulkan might give you the dynamic dispatch functionality you want. They are usable in ray generation shaders but I'm not sure if they can be used in general purpose compute yet.

The better or worse comment was more about having a heterogeneous environment vs using a graphics API to dispatch compute. I agree that generally speaking they are the same though, after all it's the same hardware.

So you think there is no need to make something different for Rust than using it as a shading language for Vulkan, or GPU programming can be improved with different approaches?

Basically, I'd like there to be something that's ergonomic, using Rust and can eventually help get rid of CUDA lock-in.

I think there is a need for it. I'm not sure if this will become that, although it might be a piece of it. There seem to be multiple projects to build CUDA like environments (or CUDA itself to Rust.