On March 31, 2026, PrismML emerged from stealth to announce 1-bit Bonsai — a family of open-weight language models the company calls the first commercially viable 1-bit LLMs. The 8B flagship fits in just 1.15 GB of memory, runs 8× faster than a standard FP16 8B model, and matches its benchmark performance. Backed by $16.25M from Khosla Ventures and built on years of mathematical research at Caltech, Bonsai makes a credible claim to being a turning point for efficient AI.
Intermediate
In a standard language model, each weight is stored as a 16-bit or 32-bit floating-point number. In Bonsai, every weight is a single binary value — 0 or 1 — mapped to −scale or +scale, where each group of 128 weights shares one FP16 scale factor. The effective storage is just 1.125 bits per weight. Critically, this is not post-training quantization applied to an existing full-precision model; it is a native 1-bit architecture trained from scratch on Google v4 TPUs.
The practical consequence for inference is significant: the matrix multiplications that dominate transformer compute collapse into simple additions, since multiplying by ±1 requires no FPU. This is why the energy and speed gains are so large. PrismML CEO Babak Hassibi, a Caltech professor, put it plainly: “We spent years developing the mathematical theory required to compress a neural network without losing its reasoning capabilities.”
PrismML released three models simultaneously, all under the Apache 2.0 license, available on Hugging Face in GGUF and MLX formats for immediate use with llama.cpp, Ollama, and Apple Silicon:
The headline claim is a ~14× reduction in memory footprint with no meaningful accuracy regression compared to full-precision 8B models. PrismML measures this with a custom metric they call Intelligence Density — performance per gigabyte — where Bonsai 8B scores 1.06/GB compared to 0.10/GB for Qwen3 8B.
Beyond memory, the energy story is compelling. On an RTX 4090, 1-bit Bonsai 8B consumes 0.276 mWh per token compared to 1.134 mWh for a standard 8B 16-bit model — a 4× reduction. On an M4 Pro, the gap narrows but remains substantial (0.074 vs. 0.415 mWh/token). The 1.7B model is the only variant tested on an iPhone 17 Pro Max, since the full 8B 16-bit model does not fit in phone memory at all.
This arithmetic opens up hardware markets previously inaccessible to 8B-class reasoning: microcontrollers, edge devices, and battery-constrained mobile deployments. Amir Salek, founder of Google’s TPU program and an investor in PrismML, called it “a fundamental change in the power-to-compute equation.”
Microsoft’s BitNet series has been pursuing 1-bit LLMs since 2023, but prior efforts consistently fell short of matching full-precision models at practical scales. What’s different with Bonsai?
PrismML argues three things close the gap: (1) a native 1-bit training regime rather than quantization after the fact, grounded in new mathematical theory from Caltech; (2) sufficient scale — 8B parameters appears to be large enough that the 1-bit constraint no longer causes meaningful accuracy degradation; and (3) standard deployment formats (GGUF, MLX) that let the models slot directly into existing inference stacks without custom infrastructure.
Independent community benchmarking is still forthcoming — the models were released just days ago. The “Intelligence Density” metric is also PrismML’s own framing, and replication of raw benchmark numbers against established leaderboards will be the real test. That said, the combination of open weights, permissive licensing, and credible academic backing makes Bonsai worth watching closely.
