Taalas HC1: Absurdly Fast, Per-User Inference at 17,000 tokens/second
The Weekly Kaitchup #131
Hi everyone,
In this edition of The Weekly Kaitchup, let’s discuss what Taalas has been quietly building: a “model-on-silicon” chip that bakes an LLM directly into hardware to deliver absurdly fast, per-user inference.
Taalas unveiled “HC1,” a “hardcore model” chip that runs a single model, Meta’s Llama 3.1 8B, at extreme speed by effectively hardwiring the model (including weights) into silicon.

The result: “~17K tokens/sec per user”.
It’s far above publicly discussed per-user throughput figures for other inference-specialist hardware on that same model. For comparison, for the same model, Cerebras is at ~2,000 tokens/sec, and Groq is at ~600 tokens/sec per user.

And it’s real!
I’ve been working with Taalas for some time as a contractor, mainly on evaluation, quantization, and fine-tuning. The team is small (24 people), and it’s honestly remarkable that a group this lean pulled this off.
You can give it a try yourself and see thousands of tokens appear instantly:
a chatbot demo powered by HC1: https://chatjimmy.ai
an API access request flow (application form)
Full quantization (3~6-bit; that’s very aggressive for Llama 3.1) does come with trade-offs, and in this first silicon, it clearly impacts model quality. But this is v1: the hard part, and priority, was proving the end-to-end flow works, and the next iterations are already geared toward much better fidelity. Improving quantization quality is actually the easy part here.
How did they achieve this speed?
HC1’s speed comes from hardwiring an entire model, including its weights onto a chip, removing almost all programmability, with some SRAM remaining for things like KV cache and fine-tuned weights.
Taalas frames this as “merging storage and computation” and eliminating the usual memory↔compute boundary that drives HBM stacks, advanced packaging, high-speed I/O, and often liquid cooling.
I won’t go deeper into the hardware specifics (not my area). Here is what they publicly disclosed (source: eetimes):
Process/fab: TSMC N6 (6nm)
Die size: 815 mm²
Form factor: shown as a PCIe card
Power: reported around ~250W; so 10-card server at ~2.5kW, compatible with standard air-cooled deployments.
On-chip memory usage: includes SRAM used for KV cache and fine-tuned weights.
Implementing a new model takes a reasonable time, a roughly two-month turnaround “from a previously unseen model” to working PCIe cards doing inference (done with a “foundry-optimal workflow” with TSMC).
The economics
For this Llama 3.1 8B version:
~$0.0075 per 1M tokens
Simulation: multi-chip configuration for DeepSeekR1-671B: ~12,000 tokens/sec per user and 7.6 cents per 1M tokens
For comparison, Cerebras still serves Llama 3.1 8B for $0.10 per 1M tokens (as of February 20). HC1 is 13x cheaper and 8x faster.
Fine-tuning: “Frozen” doesn’t mean “unchangeable”
The model is hardwired, but it retains flexibility via configurable context window size and fine-tuning via LoRA adapters.
So the base weights are “baked,” but large adapters can sit on top. If you are good at fine-tuning adapters, this gives endless possibilities, even for an old Llama 3.1.
What’s next?
Second model (still on HC1 platform): a mid-sized reasoning LLM. See it as scaling the demo to something actually useful. Still not a frontier model, but showing that it works with larger/newer models. This one will be really good.
Second-generation silicon (HC2): HC2 has higher density and faster execution, with deployment soon.
Imagine a frontier model with agentic behavior at that speed! Tasks that takes hour to complete now will take minutes, and cost less.
I’ve been thinking about this a lot, but I’m very curious about the new use cases this unlocks. At ~17k tokens/sec, a reasoning model can “think” in under a second, which suddenly makes large reasoning models feel interactive. And if we can get the reasoning trace faster, we can spend a bigger reasoning budget for higher accuracy: sample multiple outputs and select the best, or run longer traces via budget forcing (or using an adapter trained to take advantage of longer budgets).
I plan to write one or two tutorials, using their API, to demonstrate what we can do with that inference speed.
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!




Amazing!