Large Language Models like Phi-3 generate text one word at a time. For each word, the model runs 342 GPU compute operations ("dispatches") across 32 neural network layers. Each layer does matrix multiplication, attention computation, and activation functions — all running on your GPU via WebGPU.
Normally, a framework called TVM manages these operations. TVM's runtime (written in WASM) decides which GPU shader to run, writes the parameters, submits the work, and reads the result — 342 times per word.
I intercepted TVM's GPU calls, captured every shader and buffer, decoded the architecture, then threw away the 85 TVM shaders and wrote 10 of my own (27 WGSL files, 3,078 LOC) that drive the GPU directly. Same weights, same math — 228 dispatches per token instead of 342, fully readable in one sitting, 22% slower than TVM's autotuned kernels on M2 Pro.