In February 2026, Hugging Face shipped Transformers.js v4 — a C++ WebGPU runtime built with Microsoft's ONNX Runtime team — as the production answer for browser LLM inference. In April 2026, this repo shows that for Phi-3 Mini specifically, the answer is 10 kernel roles across 27 WGSL files and ~2,000 lines of TypeScript. No WebLLM, no TVM, no ONNX, no WASM runtime.
Every step from token to token — weights, tokenizer, attention, sampling — runs in the browser with no external calls during inference.
Fetches MLC Q4F16 shards from HuggingFace (or browser cache from a prior WebLLM session). Uploads directly to GPU buffers.
Hand-written BPE tokenizer reads tokenizer.json. Applies Metaspace pre-tokenization and Phi-3 chat template — no tokenizer WASM.
32 transformer layers on the GPU. Embedding → RMSNorm → (QKV + RoPE + KV-append, fused) → Paged Attention → Fused FFN → LM Head. All WGSL, 228 dispatches per token.
GPU argmax over 32,064 logits selects the next token. BPE decode maps IDs back to text. Streams to the UI incrementally.
Every compute step is a readable WGSL file you can open and understand. No auto-generated code. No compiled blobs. The extra files are A/B variants — subgroup-reduced kernels, tiled matmuls, and an opt-in int8 KV cache path.
Token ID lookup with Q4F16 dequantization. Group size 32, int4 packed as uint32.
embedding.wgslRoot mean square layer normalization. Tree-reduce in workgroup shared memory.
rms_norm.wgslQ/K/V projection, rotary embeddings, and paged-KV write in a single dispatch per layer. TVM ships these as three separate kernels.
qkv_fused.wgslMulti-head attention over the paged KV cache. Online softmax, scaled dot-product, page-table read folded in.
attention.wgslQ4F16 output projection from attention. Same kernel family powers FFN-down and LM-head. Eight tile/subgroup variants for GPU portability.
int4_matmul*.wgslResidual add and normalize in one pass. Ping-pong buffers avoid GPU aliasing.
add_norm.wgslGate and up projections with SiLU activation fused. Q4F16, 8192-dim intermediate. Tiled-subgroup variant available via ?ffnsg=1.
Projects hidden state to 32,064 vocab logits (int4 matmul), then GPU argmax selects the next token. No CPU roundtrip.
argmax.wgsl + int4_matmul.wgslSubgroup-reduced versions of attention, QKV, argmax, and matmul — same math, one hardware-accelerated reduction step instead of workgroup shared memory.
*_sg.wgsl (6 files)Per-page int8 quantization of K and V with an attention kernel that dequantizes on read. Halves KV memory at a small accuracy cost. Gate with ?kv8=1.
Similar goal, very different philosophy. We trade peak speed for full transparency and zero dependencies.
| Zero TVM (ours) | TVM / WebLLM | Transformers.js v4 + ORT-WebGPU | |
|---|---|---|---|
| Unique WGSL kernels | 10 roles / 27 files | 85 captured | ORT-generated (not user-visible) |
| Hand-written WGSL LOC | 3,078 (incl. A/B variants) | 12,962 (compiler-generated) | 0 (runtime-generated) |
| Dispatches / decode token | 228 (f16 KV) / 260 (int8 KV) | 342 (measured) | not measured here |
| Runtime language | TypeScript | Apache TVM WASM | C++ via ONNX Runtime |
| JS bundle (chat page, excl. weights) | 157 kB / 33 kB gz | 5.9 MB / 2.1 MB gz | ~10 MB JS + native |
| Decode speed (Phi-3, M2 Pro) | ~40 tok/s | ~51 tok/s | not measured here |
| Model scope | Phi-3 Mini Q4F16_1 only | ~20 MLC-compiled models | ~200 architectures |
| Readable source | ✓ 27 WGSL files | ✗ Compiled blob | ✗ Generated ops |
| Designed for | "Readable in one sitting" | General production | General production |
Transformers.js v4 is the right choice for most production work. Zero-TVM is not trying to replace it. The point of the contrast: when a stack can assume a single model and precision, most of the compiler and runtime surface is optional.
It is a minimal-surface-area demonstration that for one fixed model shape, the compiler/runtime complexity is not required to do browser LLM inference end-to-end.
The repo is organized so each HTML file demonstrates one claim. Open any of them to inspect the stack at a different layer.
The result: Phi-3-mini decoding on 10 hand-written WGSL kernels. WebLLM is never touched.
/zero-tvm.htmlHead-to-head against the same Phi-3 Q4F16 weights. Measured ~51 vs ~40 tok/s on M2 Pro.
/webllm-bench.htmlMulti-prompt smoke test driving engine-core.ts end-to-end, no DOM dependencies.
The intermediate milestone: captures TVM dispatches, replays 279 of 342 through our shaders.
/compiler-chat.htmlInteractive timeline of every GPU dispatch — 342 for WebLLM, 228 for Zero-TVM.
/demo.htmlDecode pipeline diagram, buffer layout, KV-cache paging, entry-point map.
/architecture.htmlPipeline catalog, dispatch DAG, uniform decoding, buffer flow — full reverse-engineering trace of WebLLM's 342-dispatch decode loop.
/dump.htmlRaw WGSL source for every one of TVM's 85 generated kernels, captured live from a running WebLLM session. 12,962 lines total.
/shaders.htmlShader catalog, URL flags, BPE tokenizer, weight loader, benchmarks reference.
/docs.html27 WGSL files and ~2,000 lines of TypeScript. Read it in one sitting.
github.com/abgnydn/zero-tvmPhi-3-mini runs entirely on your GPU. Your conversation never leaves your browser.