Taalas HC1: Hardwiring Llama 3.1 Into Silicon for 17,000 Tokens/Second

February 24, 2026Provided by Utku Ege Tuluk

On February 21, 2026, Taalas — a Toronto-based AI hardware startup — unveiled the HC1, its first commercial product: a custom ASIC that hard-codes Meta’s Llama 3.1 8B language model directly into silicon. The result is an AI accelerator that delivers up to 17,000 tokens per second per user, roughly ten times faster than today’s best GPU-based solutions, at a fraction of the cost and power.

Taalas HC1 PCIe card with Llama 3.1 8B hardwired into silicon — Image credit: Taalas

A Different Approach to AI Hardware

Conventional AI accelerators — GPUs, TPUs, NPUs — are general-purpose processors that load model weights from memory at runtime. This creates a fundamental bottleneck: the chip must continuously shuttle billions of floating-point values across a power-hungry memory interface, requiring exotic technologies like High Bandwidth Memory (HBM), 3D stacking, and liquid cooling just to keep pace.

Taalas takes a radically different approach. Rather than running Llama 3.1 8B on hardware, they cast the model into hardware. CEO Ljubisa Bajic — a former AMD GPU architect and co-founder of Tenstorrent — describes the technique: “We can store four bits and do the multiply related to it with a single transistor,” using a mask ROM recall fabric paired with SRAM for the KV cache and fine-tuning adapters.

The HC1 is built on TSMC’s 6nm N6 process node, packs 53 billion transistors onto an 815 mm² die, and draws approximately 200 watts per card. A dual-socket x86 server hosting 10 HC1 cards operates within a standard 2,500-watt power envelope — no liquid cooling required.

Performance comparison chart showing Taalas HC1 tokens per second against competitors — Image credit: Taalas

Performance and Efficiency Claims

Taalas positions the HC1 against Cerebras and NVIDIA’s latest datacenter parts. The company claims:

17,000 tokens/second per user — versus ~230 tokens/second on an NVIDIA H200 running the same Llama 3.1 8B model
10× faster than Cerebras chips on per-user throughput
20× lower manufacturing cost compared to Cerebras
10× lower power consumption than comparable Cerebras deployments

Live testing via Taalas’s public chatbot demo showed 15,000–16,000 tokens/second on typical queries. In one benchmark, a 100-page book outline was generated at 15,651 tokens/second — completed in just 0.064 seconds.

Despite being hardwired to a single model, the HC1 retains practical flexibility: context window size is configurable, and fine-tuning is supported via low-rank adapters (LoRAs).

The Path to Ubiquitous AI

Taalas frames the HC1 as the first step toward making AI inference as cheap and widespread as transistors made computing in the 1970s. The company was founded just 2.5 years ago by three former Tenstorrent engineers and has raised over $200 million in venture funding, spending only $30 million to reach this milestone — a deliberate choice to stay disciplined with capital.

The 24-person team has also built a platform that can transform any AI model into custom silicon within two months of receiving the weights, reducing what previously took years to a seasonal design cycle. This rapid customization pipeline is central to Taalas’s business model: companies could conceivably “print” their fine-tuned models into hardware on a quarterly basis.

The roadmap is aggressive. A mid-sized reasoning LLM on the HC1 platform is expected in Taalas’s labs this spring, followed by integration into their inference API. Later in 2026, the second-generation HC2 silicon — offering considerably higher density — will host a frontier-class LLM deployed across multiple HC cards.

For now, a chatbot demo and inference API are available to developers today. Whether hardwired silicon can challenge NVIDIA’s programmable dominance at scale remains to be seen, but the HC1’s benchmark numbers make a compelling opening argument.

Related Coverage

FastFlowLM — Running LLMs on AMD Ryzen AI NPUs With Ease — earlier coverage of non-GPU LLM inference on specialized AI accelerators

A Different Approach to AI Hardware

Performance and Efficiency Claims

The Path to Ubiquitous AI

Related Coverage

Sources