Phi-3 Mini 4K (3.8B, Q4) running 100% in-browser via WebGPU. Reverse-engineered from TVM.
Msg 1: TVM generates + captures full dispatch graph. Msg 2+: replays captured pipelines with 6 per-token value updates. ~100 dispatches vs TVM's 342.
engine-v3.ts chat-v3.ts v3.html
Replaces 228 of 342 dispatches with our 11 hand-written WGSL shaders (post-M4 QKV+RoPE+KV-append fusion). Only attention + LM-head reduction fall back to TVM-style paths. Full prefill + decode.
compiler/ compiler-chat.html
Maps each step to named TVM pipelines (qkv, rope, attention...) with per-layer bind groups. Fused FFN enabled. Supports own prefill + decode via chat().
phi3.ts fast-chat.ts
Monkey-patches GPUDevice to record everything TVM does:
createShaderModule calls (85 shaders)createComputePipeline calls (85 pipelines)createBindGroup calls (entries map)dispatchWorkgroups calls (342 decode, 343 prefill)writeBuffer calls (357 decode writes)copyBufferToBuffer for token readbackState machine: loading → prefill_done → done
Separates prefill vs decode dispatches at the mapAsync boundary.
| Field | Count | Description |
|---|---|---|
shaders | 85 | WGSL source code + module |
pipelines | 85 | entry point + pipeline object |
dispatches | 342 | Decode: pipeline + bind group entries + workgroups |
prefillDispatches | 343 | Prefill: same structure |
writes | 357 | Decode buffer writes (buffer + offset + data) |
copy | 1 | Token readback: src buffer + offset |
weights | ~200 | Large buffer writes (>1KB) during load |
Compiler engine: 279 our dispatches + 63 TVM (2 per layer for KV append + attention, skipped 19 sampling). Fused FFN saves 32 dispatches.
| Step | Index | Kernel | Owner | Workgroups | Description |
|---|---|---|---|---|---|
| 0 | base+0 | int4_matmul | Ours | 9216 | QKV projection: D=3072 → 3*D=9216 (Q,K,V concatenated) |
| 1 | base+1 | rope_kernel | Ours | 36 | Rotary position embedding on Q,K |
| 2 | base+2 | kv_cache_transpose_append | TVM | varies | Append K,V to paged cache (transpose layout) |
| 3 | base+3 | batch_decode_paged_kv | TVM | varies | Paged KV attention with sliding window |
| 4 | base+4 | int4_matmul | Ours | 3072 | Output projection: head_dim*heads → D |
| 5 | base+5 | add_norm | Ours | 1 | Residual add + RMSNorm (pre-FFN) |
| 6 | base+6 | fused_ffn_kernel | Fused | 8192 | Gate+Up int4 matmul + SiLU in ONE dispatch |
| 7 | base+7 | Skip | - | Replaced by fused FFN above | |
| 8 | base+8 | int4_matmul | Ours | 3072 | FFN down projection: 8192 → 3072 |
| 9 | base+9 | add_norm | Ours | 1 | Residual add + RMSNorm (post-FFN) |
Dequantize int4 weights + matrix multiply. 64 threads, 6 chunks, tree reduction. Zero-point 7, 32-group scales. f16 accumulation.
D_PACKED = D/8 packed u32s per row
Same as int4_matmul but accumulates in f32. Used for LM head (32064 outputs) where f16 precision causes sampling errors.
Fuses gate+up matmul + SiLU into ONE dispatch. Dual dot product (gate row i, up row i+8192), shared memory input cache (3072 f16 = 6KB), f32 sigmoid.
Replaces 2 TVM dispatches → saves 32 dispatches/token
RMSNorm: x * rsqrt(mean(x^2) + eps) * weight. Single workgroup, D=3072 elements. Used for initial norm.
Fused residual add + RMSNorm. y = norm(residual + x) * weight. Saves a dispatch vs separate add+norm.
Rotary Position Embedding. Applies cos/sin rotation to Q,K heads based on position. 36 workgroups for 32+4 heads.
Token embedding lookup from dequantized int4 embedding table. Maps token ID → D=3072 hidden state.
Simple argmax over 32064 logits. Replaces TVM's 19-dispatch sampling pipeline (top-p, temperature, etc.) with greedy decoding.
Multi-head attention with paged KV cache. (Compiler engine only — main engines use TVM's optimized kernel.)
Append K,V to paged cache. (Compiler engine only — main engines use TVM's transpose-append kernel.)
| Write | Buffer | Value |
|---|---|---|
w[0] | token_id | Current token ID (u32) |
w[4] | position_map | Position for RoPE (u32) |
w[8] | seq_counter | Sequence index (u32) |
w[11] | q_rope_position | RoPE position (u32) |
w[12] | length_info | Sequence length (u32) |
w[349] | seed | Sampling RNG seed (f32) |
Plus: nnz_pages at offset 16 in 32 attention uniform buffers
3 scratch buffers cycle through layers:
| Buffer | Size | Role |
|---|---|---|
| BUF#730 | 6KB | Hidden state (3072 f16) |
| BUF#731 | 32KB | FFN intermediate (16384 f16) |
| BUF#732 | 18KB | QKV output (9216 f16) |
Same buffers reused across all 32 layers. Weight buffers are per-layer (~50MB total for Q4).
| File | Engine | Description |
|---|---|---|
index.html | Landing | Project overview, stats, shader catalog, compare table |
zero-tvm.html | Zero-TVM | Chat demo on 10 hand-written WGSL kernels (no TVM) |
validate.html | Zero-TVM | Multi-prompt smoke test driving engine-core.ts |
webllm-bench.html | WebLLM | Head-to-head harness against identical weights |
compiler-chat.html | Compiler | TVM capture → replay via own shaders (279/342) |
demo.html | Demo | Interactive dispatch-timeline visualization |
dump.html | Analysis | Full TVM architecture dump |
shaders.html | Analysis | Browse all 85 captured TVM shader sources |
docs.html | Docs | Shader catalog, URL flags, benchmarks |