WebGPU Shader Interception
for Browser LLMs

I intercepted a 3.8B parameter AI model running in your browser, captured every GPU operation, decoded its architecture, and rewrote the hot path in 10 hand-written WGSL kernels — 228 dispatches per token at ~40 tok/s on M2 Pro, 22% behind WebLLM's TVM-autotuned 85 kernels on identical weights.

85 → 10 TVM shaders → kernel roles
342 → 228 dispatches per token
728 GPU buffers mapped
12,962 → 3,078 LOC WGSL
~40 tok/s decode (M2 Pro)
How does an LLM run in a browser?

Large Language Models like Phi-3 generate text one word at a time. For each word, the model runs 342 GPU compute operations ("dispatches") across 32 neural network layers. Each layer does matrix multiplication, attention computation, and activation functions — all running on your GPU via WebGPU.

Normally, a framework called TVM manages these operations. TVM's runtime (written in WASM) decides which GPU shader to run, writes the parameters, submits the work, and reads the result — 342 times per word.

I intercepted TVM's GPU calls, captured every shader and buffer, decoded the architecture, then threw away the 85 TVM shaders and wrote 10 of my own (27 WGSL files, 3,078 LOC) that drive the GPU directly. Same weights, same math — 228 dispatches per token instead of 342, fully readable in one sitting, 22% slower than TVM's autotuned kernels on M2 Pro.

The 342 Dispatches

Every token the model generates requires exactly 342 GPU compute dispatches. Here's what each one does:

Matmul (int4 dequant)
Attention
Norm + Residual
Activation (RoPE, SiLU)
KV Cache
Sampling

85 Captured Shaders → 10 Kernel Roles

I intercepted every createShaderModule call during model load, capturing 85 TVM-generated WGSL compute shaders totaling 12,962 lines. Then I rewrote the hot path: 10 hand-written kernel roles across 27 WGSL files, 3,078 LOC total.

TVM's Per-Layer Pattern (10 dispatches × 32 layers)

Zero-TVM's 10 Kernel Roles

728 GPU Buffers

Every buffer referenced in the 342 dispatches, classified by purpose:

270
Weight buffers
3.8 GB
71
Activation buffers
424 KB
37
KV Cache buffers
19 MB
350
Uniform buffers
7.7 KB

Key Findings

Submit batching (337x fewer submits) doesn't help on Apple Silicon

I accumulated all 342 command buffers and submitted them as a single GPU submit — a 337x reduction. The output was correct (coherent English, proper EOS detection). But it was 30% slower than TVM's 1:1 submit pattern.

Why? Chrome's GPU driver on Apple M2 already pipelines submits. While the CPU prepares dispatch N+1, the GPU executes dispatch N. Batching breaks this pipeline — the GPU sits idle during the entire CPU preparation phase.

TVM:     CPU: [setup1][setup2][setup3]...  ← overlaps with GPU
         GPU:    [work1][work2][work3]...

Batched: CPU: [setup1][setup2]...[setup342]  ← GPU idle
         GPU:                               [work1][work2]...[work342]
TVM already fuses elementwise operations

TVM's compiler isn't naive. It already fuses across elementwise boundaries:

fused_dequantize + NT_matmul     — int4 dequant + matmul in one shader
fuse_add_norm_decode             — residual add + RMSNorm in one shader
fused_split_silu_multiply        — gate split + SiLU + elementwise mul

It does NOT fuse across matmul boundaries. My fused shaders (FFN+SiLU, RMSNorm+Matmul) cross this boundary.

GPU f16 parallel reduction is non-deterministic

Running the exact same TVM code twice with the same prompt produces different tokens after position 7-156 (depending on prompt length). The f16 parallel reduction in matmul shaders accumulates partial sums in a tree pattern — different GPU scheduling = different rounding = different result.

This is a hardware-level non-determinism on Apple M2. It means any replay-based approach (including mine) can only match TVM's output up to this natural divergence point.

The memory bandwidth wall

At ~40 tok/s on M2 Pro (200 GB/s memory bandwidth), the model loads ~1.8 GB of weights per token:

1.8 GB / 200 GB/s = 9 ms per token = ~111 tok/s theoretical maximum

Zero-TVM at ~40 tok/s hits ~36% of that bandwidth ceiling; WebLLM at ~51 tok/s hits ~46%. The gap between us and WebLLM is compute (their matmul tiling, attention reduction, subgroup use) — not dispatch overhead, which we already win. Neither engine can change the bandwidth limit.

The attention uniform struct (56 bytes, 14 fields)

The most complex uniform in the model, decoded from the WGSL batch_decode_paged_kv_kernel shader:

struct PODArgs {
  B: i32,                            // offset 0:  batch size (=1)
  k_rope_pos_offset_elem_offset: i32,// offset 4:  =0
  length_info_elem_offset: i32,      // offset 8:  =0
  max_num_pages: i32,                // offset 12: =257
  nnz_pages: i32,                    // offset 16: CHANGES at page boundaries
  page_indptr_elem_offset: i32,      // offset 20: =0
  page_values_elem_offset: i32,      // offset 24: =0
  pages_elem_offset: i32,            // offset 28: =0
  q_rope_position_elem_offset: i32,  // offset 32: =0
  rope_scale: f32,                   // offset 36: =1.0
  rope_theta: f32,                   // offset 40: =10000.0
  rotary_mode: i32,                  // offset 44: =0
  sm_scale: f32,                     // offset 48: =1/sqrt(96)
  packGridDimX: u32                  // offset 52: =1
}

This struct was reverse-engineered from the captured WGSL source and verified against TVM's runtime writes. All 14 fields mapped correctly.

Try It Yourself

Chat with Phi-3-mini running on the 10 hand-written kernels — no TVM runtime, no compiler, no server.

Open Zero-TVM Chat

Chrome or Edge only. ~2 GB model download on first load.