⚡ WebGPU · No Framework · No TVM

Run Phi-3 in your browser
with 10 kernels / 27 WGSL files

In February 2026, Hugging Face shipped Transformers.js v4 — a C++ WebGPU runtime built with Microsoft's ONNX Runtime team — as the production answer for browser LLM inference. In April 2026, this repo shows that for Phi-3 Mini specifically, the answer is 10 kernel roles across 27 WGSL files and ~2,000 lines of TypeScript. No WebLLM, no TVM, no ONNX, no WASM runtime.

Open Chat → GitHub ↗
~40
tok/s on M2 Pro
10 / 27
kernel roles / WGSL files
3.8B
parameter model
0
ML framework deps
2.1 GB
weights, Q4F16
zero-tvm · phi-3-mini · browser
// No WebLLM. No TVM. No WASM runtime.
// 10 kernel roles · 27 WGSL files · raw WebGPU.

[local] ndarray-cache.json
[cache] params_shard_0.bin 49.3 MB
...
✓ All weights loaded · 2.1 GB · GPU ready

YOU What is the capital of Australia?

PHI-3 The capital of Australia is Canberra.
It was selected as the capital in 1908 as a
compromise between rivals Sydney and Melbourne.

88 tokens · ~40 tok/s (M2 Pro) · 100% local · zero server

How it works

Every step from token to token — weights, tokenizer, attention, sampling — runs in the browser with no external calls during inference.

STEP 01
📦

Load weights

Fetches MLC Q4F16 shards from HuggingFace (or browser cache from a prior WebLLM session). Uploads directly to GPU buffers.

STEP 02
🔤

Tokenize

Hand-written BPE tokenizer reads tokenizer.json. Applies Metaspace pre-tokenization and Phi-3 chat template — no tokenizer WASM.

STEP 03

GPU forward pass

32 transformer layers on the GPU. Embedding → RMSNorm → (QKV + RoPE + KV-append, fused) → Paged Attention → Fused FFN → LM Head. All WGSL, 228 dispatches per token.

STEP 04
🎯

Argmax + decode

GPU argmax over 32,064 logits selects the next token. BPE decode maps IDs back to text. Streams to the UI incrementally.

💬
Prompt
text
🔤
BPE Tokenizer
JS · no WASM
🔢
Embedding
shader 1
🔁
× 32 Layers
6 fused kernels
🎯
LM Head + Argmax
int4 + reduce
💬
Token
streamed

10 kernel roles. 27 WGSL files.

Every compute step is a readable WGSL file you can open and understand. No auto-generated code. No compiled blobs. The extra files are A/B variants — subgroup-reduced kernels, tiled matmuls, and an opt-in int8 KV cache path.

Q4F16 dequant f16 arithmetic paged KV cache RoPE RMSNorm SiLU
01 🔤

Embedding

Token ID lookup with Q4F16 dequantization. Group size 32, int4 packed as uint32.

embedding.wgsl
02 📐

RMSNorm

Root mean square layer normalization. Tree-reduce in workgroup shared memory.

rms_norm.wgsl
03 🌀

QKV + RoPE + KV-append (fused)

Q/K/V projection, rotary embeddings, and paged-KV write in a single dispatch per layer. TVM ships these as three separate kernels.

qkv_fused.wgsl
04 👁️

Paged Attention

Multi-head attention over the paged KV cache. Online softmax, scaled dot-product, page-table read folded in.

attention.wgsl
05 ✖️

Int4 Matmul (O-proj)

Q4F16 output projection from attention. Same kernel family powers FFN-down and LM-head. Eight tile/subgroup variants for GPU portability.

int4_matmul*.wgsl
06

Add + RMSNorm (fused)

Residual add and normalize in one pass. Ping-pong buffers avoid GPU aliasing.

add_norm.wgsl
07

Fused FFN (Gate · Up · SiLU)

Gate and up projections with SiLU activation fused. Q4F16, 8192-dim intermediate. Tiled-subgroup variant available via ?ffnsg=1.

fused_ffn.wgsl
08 🎯

LM Head + Argmax

Projects hidden state to 32,064 vocab logits (int4 matmul), then GPU argmax selects the next token. No CPU roundtrip.

argmax.wgsl + int4_matmul.wgsl
09 🧪

Subgroup variants

Subgroup-reduced versions of attention, QKV, argmax, and matmul — same math, one hardware-accelerated reduction step instead of workgroup shared memory.

*_sg.wgsl (6 files)
10 💾

Int8 KV cache (opt-in)

Per-page int8 quantization of K and V with an attention kernel that dequantizes on read. Halves KV memory at a small accuracy cost. Gate with ?kv8=1.

kv_quantize_int8 + attention_int8

vs WebLLM vs transformers.js

Similar goal, very different philosophy. We trade peak speed for full transparency and zero dependencies.

Zero TVM (ours) TVM / WebLLM Transformers.js v4 + ORT-WebGPU
Unique WGSL kernels 10 roles / 27 files 85 captured ORT-generated (not user-visible)
Hand-written WGSL LOC 3,078 (incl. A/B variants) 12,962 (compiler-generated) 0 (runtime-generated)
Dispatches / decode token 228 (f16 KV) / 260 (int8 KV) 342 (measured) not measured here
Runtime language TypeScript Apache TVM WASM C++ via ONNX Runtime
JS bundle (chat page, excl. weights) 157 kB / 33 kB gz 5.9 MB / 2.1 MB gz ~10 MB JS + native
Decode speed (Phi-3, M2 Pro) ~40 tok/s ~51 tok/s not measured here
Model scope Phi-3 Mini Q4F16_1 only ~20 MLC-compiled models ~200 architectures
Readable source ✓ 27 WGSL files ✗ Compiled blob ✗ Generated ops
Designed for "Readable in one sitting" General production General production

Transformers.js v4 is the right choice for most production work. Zero-TVM is not trying to replace it. The point of the contrast: when a stack can assume a single model and precision, most of the compiler and runtime surface is optional.

What this repo is not

  • Not a compiler. Nothing here emits WGSL — the 27 shader files are human-written.
  • Not cross-architecture. Hard-coded to Phi-3 Mini with Q4F16_1 weights. No Llama / Qwen / Gemma path without shader work.
  • Not cross-runtime. Browser WebGPU only. No Node / Bun / Deno target.
  • Not a production replacement for Transformers.js v4. It does fewer things, on purpose.
  • Not faster than WebLLM on Apple Silicon. ~22% behind on M2 Pro (~40 vs ~51 tok/s). Closing that gap is work-in-progress, not a shipped claim.

It is a minimal-surface-area demonstration that for one fixed model shape, the compiler/runtime complexity is not required to do browser LLM inference end-to-end.


Every page is a milestone of the argument.

The repo is organized so each HTML file demonstrates one claim. Open any of them to inspect the stack at a different layer.



Open the chat. No install. No account.

Phi-3-mini runs entirely on your GPU. Your conversation never leaves your browser.

Share zero-tvm

Link copied