Tenstorrent: The Open Silicon Insurgency
- Tenstorrent: The Open Silicon Insurgency
- The Keller Factor
- The Tensix Architecture: Not a GPU, Not a CPU
- Three Chip Generations
- The Product Stack (March 2026)
- Ascalon: The RISC-V CPU Play
- The Software Stack: Open Source All the Way Down
- Funding & Business
- Sovereign AI Partnerships
- Honest Assessment: Where It Falls Short
- My Analysis
- The Sovereignty Convergence
- Key Links & References
Tenstorrent: The Open Silicon Insurgency
#technology #AI #RISC-V #opensource #silicon
[!abstract] Summary Tenstorrent is Jim Keller’s bet that open-source silicon can break NVIDIA’s monopoly on AI compute. The company simultaneously builds RISC-V AI accelerators, licenses high-performance CPU IP, and ships fully open-source systems from compiler to kernel. With $693M+ in Series D funding at a $2.6B valuation and partnerships spanning Samsung, Hyundai, LG, Razer, and sovereign AI initiatives in Abu Dhabi, Tenstorrent is positioning itself as the anti-NVIDIA — open where NVIDIA is closed, distributed where NVIDIA is centralized, and cheap where NVIDIA is expensive.
The Keller Factor
Jim Keller’s resume reads like a history of modern processor design: AMD K8 (the architecture that saved AMD), Apple A4/A5 (the chips that launched the iPhone revolution), AMD Zen (the comeback architecture), Tesla’s Full Self-Driving chip, and stints at DEC (Alpha), Intel, and Broadcom. He’s arguably the most accomplished CPU architect alive.
At Tenstorrent, Keller is attempting something more ambitious than any single chip: building an open-source alternative to the entire NVIDIA ecosystem. His stated strategy is blunt — “Whatever NVIDIA does, we’ll do the opposite.”
Where NVIDIA builds monolithic, expensive, proprietary systems, Tenstorrent builds modular, affordable, open ones. Where NVIDIA uses HBM (scarce, expensive), Tenstorrent uses GDDR6 (commodity, available). Where CUDA locks you in, TT-Metalium lets you see every instruction.
The Tensix Architecture: Not a GPU, Not a CPU
The Tensix core is unlike anything in mainstream computing. Each core contains:
- 5 “Baby” RISC-V cores — tiny, in-order, single-issue processors that dispatch instructions to compute engines. They don’t do the math; they orchestrate it.
- A matrix/tensor engine (FPU) — handles matrix multiplication, convolutions, and dense linear algebra at high throughput.
- A vector engine (SFPU) — for element-wise operations, activations, normalization.
- An unpacker and packer — translate data between memory formats and compute-friendly tile formats.
- ~1.5MB SRAM per core — local memory that enables data reuse without round-tripping to DRAM.
- Two NoC (Network-on-Chip) interfaces — unidirectional, opposite-direction networks forming a 2D torus. Full-duplex-ish. Every core can reach every other core.
The programming model is fundamentally dataflow: data flows from DRAM → NoC → unpacker → compute → packer → NoC → DRAM. Three kernels run per Tensix: a reader, a compute kernel, and a writer, synchronized via hardware-backed circular buffers. This is software pipelining on a distributed grid.
It looks like a systolic array but it isn’t one. The Baby RISC-V cores give full programmability — you can use it as SPMD, systolic, or anything in between. Tenstorrent’s own efficient Scaled Dot Product Attention exploits the physical grid topology. Custom operators are just C++ with APIs. No special kernel language needed.
[!important] The Key Insight Unlike GPUs (massive register files, little local memory, off-chip HBM) or pure-SRAM architectures (Groq, Cerebras), Tenstorrent uses large per-core SRAM + commodity GDDR6. This sits between the extremes — more flexible than Groq, more memory-efficient than GPUs, and critically, avoids the HBM supply bottleneck that constrains NVIDIA.
Three Chip Generations
| Feature | Grayskull | Wormhole | Blackhole |
|---|---|---|---|
| Process | 12nm | 12nm | 6nm |
| Tensix Cores | 120 (10×12 grid) | 80 | 140 |
| Performance | ~315 INT8 TOPS | ~328 TOPS (FP8) | 745 TOPS (FP8) |
| Memory | LPDDR4 | 12-24GB GDDR6, 288-576 GB/s | 28-32GB GDDR6, 448-512 GB/s |
| Vector Precision | 64-element, 19-bit FP | 32-element, 32-bit FP | 32-element, 32-bit FP |
| PCIe | Gen4 x16 | Gen4 x16 | Gen5 x16 |
| Ethernet | None | 16×100 Gbps | 10×400 Gbps |
| Host CPU | None | None | 16 “Big RISC-V” 64-bit cores |
| Standalone | No (PCIe accelerator) | No (PCIe accelerator) | Yes (runs Linux on-chip) |
Blackhole is the inflection point. Those 16 “Big RISC-V” cores are beefy enough to run Linux natively, making Blackhole a standalone AI computer — not just an accelerator card. Combined with 752 Baby RISC-V cores (700 in Tensix + 52 for DRAM, Ethernet, management, PCIe), each chip has 768 RISC-V cores total.
Blackhole Galaxy: Scale-Out via Ethernet
32 Blackhole chips mesh into a Galaxy server: 23.8 petaFLOPS at FP8, 1TB memory, 16 TB/s bandwidth. The trick: Tenstorrent uses its own chips as network switches. No InfiniBand, no NVLink — just Ethernet all the way down. An entire training cluster can be built from identical Galaxy “Lego blocks.”
As of mid-2025, Tenstorrent has built a training cluster of 6 Galaxies (192 chips), with plans for 48 Galaxies: 16 for compute, 16 for switching, 16 as optimizers.
The Product Stack (March 2026)
Developer Cards (PCIe)
- Blackhole p100a: 120 Tensix, 28GB GDDR6, 448 GB/s — $999
- Blackhole p150a: 140 Tensix, 32GB GDDR6, 512 GB/s, 4× QSFP-DD 800G — $1,399
- Wormhole n150d: 72 Tensix, 12GB GDDR6, 288 GB/s — $1,099
- Wormhole n300d: 128 Tensix (2 chips), 24GB GDDR6, 576 GB/s — $1,449
- Rumored p300: dual Blackhole, 64GB GDDR6, ~1 TB/s — price TBD. This would be the game-changer.
TT-QuietBox 2 (Desktop Workstation)
Announced March 11, 2026. Ships Q2 2026 at $9,999.
- 4 Blackhole ASICs in unified mesh
- 480 Tensix cores, 2,654 TFLOPS (BlockFP8)
- 128GB GDDR6 + 256GB DDR5
- Liquid-cooled, whisper-quiet, standard 120V outlet
- Runs Linux (Ubuntu 24.04) out of the box
- Pre-loaded with GPT-OSS 120B, Llama 3.1 70B (476.5 tok/s!), Qwen3-32B, Flux, Wan 2.2 video, Boltz-2 protein folding
Compact AI Accelerator (with Razer)
Unveiled at CES 2026. Portable, modular, desktop-class edge AI. Razer designed the industrial form factor. Details still sparse but signals the consumer ambition.
Galaxy Servers (Datacenter)
Scale-out training and inference for enterprises and sovereign AI deployments.
Ascalon: The RISC-V CPU Play
Tenstorrent isn’t just building AI chips — it’s building general-purpose RISC-V CPUs and licensing them to the world.
Ascalon is a family of RVA23-compliant, 64-bit, out-of-order, superscalar CPUs:
- Ascalon X (extreme): 8-wide decode, 6 integer ALUs, 3 load/store units, 2 FP datapaths, 2× 256-bit vector. IPC approaching Arm Cortex-X class. 10-20 SPECint2006/GHz.
- Ascalon H (high): comparable to Arm Cortex-A720
- Ascalon S (standard): comparable to Arm Cortex-A78
- Ascalon U (ultra-low-power): comparable to Arm Cortex-A72
The Atlantis reference SoC: 8-core Ascalon X, 12nm TSMC, with Imagination GPU, 4K video decoder, Ethernet, USB, PCIe. This is a complete processor for development kits.
Next-gen CPU: Callandor — targeting 2027, aims to nearly double per-cycle throughput. If achieved, this would be a stunning generational leap.
IP Licensing Partners
- LG — consumer electronics
- Hyundai — automotive (ASIL-D safety qualification for Alexandria automotive variant)
- BOS — automotive semiconductor startup
- LSTC/Rapidus — Japan’s 2nm fab, licensing Ascalon for edge AI accelerator
- AutoCore — autonomous driving
- And others via formal productization (September 2025)
The IP licensing business is Tenstorrent’s Arm-like play: become the reference architecture that others build on, funded by royalties.
The Software Stack: Open Source All the Way Down
This is where Tenstorrent’s philosophy diverges most sharply from NVIDIA:
| Layer | Tool | Purpose |
|---|---|---|
| High-level | TT-Forge | MLIR-based compiler. PyTorch, ONNX, TensorFlow, JAX, PaddlePaddle → Tenstorrent hardware. Public beta. |
| Mid-level | TT-NN | Neural network library, optimized operators |
| Low-level | TT-Metalium | Bare-metal SDK. Direct access to RISC-V cores, NoC, matrix/vector engines. Kernels are plain C++. |
| Kernel-level | TT-LLK | Low-level kernel software for Tensix core execution |
| Inference | vLLM fork | Modified vLLM for Tenstorrent hardware |
Everything is on GitHub. The entire stack — from compiler to kernel to mechanical design — is open source. For regulated industries, sovereign AI initiatives, and anyone who needs to audit how their hardware executes models, this is not a nice-to-have. It’s the value proposition.
TT-Forge uses MLIR (the same intermediate representation powering much of the modern compiler ecosystem), accepting standard framework graphs and lowering them to hardware-specific execution plans. Developers can inspect every transformation from model graph to kernel execution.
Funding & Business
- Series D: $693M+ (December 2024), led by Samsung Securities and AFW Partners
- Valuation: ~$2.6-2.7B (unicorn status)
- Key investors: Samsung, Hyundai, LG, Fidelity, Baillie Gifford, Bezos Expeditions, XTX Markets, Export Development Canada, Healthcare of Ontario Pension Plan
- Eyeing additional $800M at $3.2B (reported November 2025, led by Fidelity)
- Previously raised $100M (2023) co-led by Hyundai and Samsung Catalyst Fund
The investor list is telling: Korean industrial conglomerates (Samsung, Hyundai, LG) hedging against both Western chip restrictions and Arm’s increasing aggression, plus institutional capital (Fidelity, Baillie Gifford) betting on the RISC-V transition.
Sovereign AI Partnerships
- Infinia (Abu Dhabi) — sovereign AI systems collaboration, signed during Abu Dhabi Finance Week
- Moreh — Korean AI software startup, joint AI data center solution unveiled at SC25
- Koyeb — cloud instances with Tenstorrent hardware, hardware-agnostic infrastructure
- CHASSIS program participation — industry consortium for AI hardware standards
The sovereign AI thesis is central to Tenstorrent’s strategy: nations and organizations that can’t or won’t depend on NVIDIA/CUDA need an alternative. Open-source silicon with no export restriction baggage (Wormhole ships to China today; Blackhole needs de-featuring but provisions exist) serves this market directly.
Honest Assessment: Where It Falls Short
The Bandwidth Problem
This is the elephant in the room. Blackhole p150a delivers 512 GB/s bandwidth — roughly half of a used RTX 3090 (936 GB/s) that costs $1,000. For LLM token generation, bandwidth is king. More VRAM means you can fit bigger models, but if the pipe is narrow, tokens come out slowly.
The comparison:
- p150a: 32GB, 512 GB/s, $1,399
- RTX 3090 (used): 24GB, 936 GB/s, ~$1,000
- RTX 5090: 32GB, 1,790 GB/s, ~$3,300
Tenstorrent wins on VRAM-per-dollar (especially vs current-gen NVIDIA) but loses badly on bandwidth-per-dollar. For single-card local LLM inference, an RTX 3090 is still better for most users.
Software Maturity
The community consensus from LocalLLaMA and developer forums: “performance is pretty asymmetric — some things are well-supported, some aren’t.” The software stack is in flux. No LM Studio, no Ollama native support. You’re using a vLLM fork and writing C++ kernels if you need custom operations. This is not plug-and-play.
CUDA’s ecosystem was built over 15+ years. Tenstorrent’s is ~3 years old. The gap is real.
Scale vs. Single-Card
Tenstorrent’s architecture shines at scale. The Ethernet-native interconnect means multi-chip scaling is first-class, not an afterthought. Work developed on a QuietBox (4 chips) theoretically scales directly to Galaxy servers (32 chips). For multi-card inference and training, the architecture is sound.
But most individual users have one card. And for one card, NVIDIA wins on raw performance.
My Analysis
Why Tenstorrent Matters
-
It’s the only fully open-source AI silicon company at this scale. Not “open-source friendly” — actually open source from compiler to kernel to mechanical design. This matters for audit, sovereignty, and long-term ecosystem development.
-
The IP licensing play is strategically brilliant. Tenstorrent is simultaneously building its own products AND becoming the Arm of RISC-V AI. If Ascalon X achieves Cortex-X-class performance (which the specs suggest it can), every company currently paying Arm royalties has a reason to look at Tenstorrent. Arm making its own silicon breaks the neutrality that was Arm’s primary competitive moat.
-
The HBM avoidance is a structural advantage, not just cost reduction. HBM supply is constrained, pricing is volatile, and NVIDIA has long-term purchase agreements that crowd out competitors. By designing around GDDR6 (and large on-chip SRAM), Tenstorrent avoids the supply chain bottleneck entirely.
-
Jim Keller has done this before. Multiple times. Against companies larger than NVIDIA. The AMD Zen architecture was designed when AMD was nearly bankrupt. The Apple A-series chips were designed when no one thought ARM could compete with x86. Pattern recognition matters.
Why It Might Not Work
-
CUDA’s moat is deeper than silicon. The software ecosystem — libraries, frameworks, community knowledge, Stack Overflow answers, university courses — took 15 years to build. You can’t replicate it with better hardware alone. Every major inference engine (vLLM, SGLang, TensorRT-LLM) was built CUDA-first.
-
The bandwidth gap is structural, not temporary. GDDR6 bandwidth maxes out around 512-576 GB/s per chip. NVIDIA’s HBM3 delivers 3+ TB/s per chip. For large-model inference, this is a fundamental physics constraint, not a software optimization problem. The p300 (dual-chip, ~1 TB/s) helps but doesn’t close the gap.
-
“Works at scale” requires customers at scale. Tenstorrent’s 6-Galaxy training cluster is impressive for a startup but tiny compared to the 100,000+ GPU clusters hyperscalers are building. Enterprise customers buy ecosystems, not specs.
-
NVIDIA’s open-source counter-moves. Dynamo 1.0 is NVIDIA’s open-source inference orchestration layer — brilliant strategic positioning that gives the community open tooling while tightening hardware lock-in. The Nemotron Coalition ensures open frontier models are optimized for NVIDIA hardware first.
The Real Opportunity
Tenstorrent won’t beat NVIDIA at NVIDIA’s game. But it doesn’t need to. The opportunity is:
- Sovereign AI infrastructure for nations that can’t access or won’t depend on NVIDIA
- Edge/on-premises inference where open-source auditability matters (regulated industries, defense, healthcare)
- IP licensing to the automotive, IoT, and consumer electronics industries transitioning to RISC-V
- The “good enough” tier — when 80% of NVIDIA’s performance at 30% of the cost is the right tradeoff
The Razer partnership and QuietBox 2 signal consumer/developer ambitions. The Infinia and Moreh partnerships signal sovereign/enterprise ambitions. The Ascalon licensing signals the long game.
If the rumored p300 (64GB, ~1 TB/s, likely <$3,000) materializes with mature software, it becomes the most interesting hardware product for local AI in 2026. That’s the card to watch.
The Sovereignty Convergence
Tenstorrent fits into a broader stack that keeps appearing across this research:
- ISA sovereignty: RISC-V (no royalties, no geopolitical risk) — RISC-V - The Open Silicon Revolution
- Compute sovereignty: Tenstorrent (open-source silicon + compiler + SDK)
- Network sovereignty: Reticulum (cryptographic-first, any-medium networking)
- Identity sovereignty: Nostr keypairs + FROST threshold signing
- Payment sovereignty: Bitcoin + Lightning + Cashu ecash
- Energy sovereignty: Solar + battery + local compute
- Software sovereignty: Self-hosted services, local AI inference
Tenstorrent is the compute layer in this stack. The only AI hardware company where you can read every line of code that touches your data, inspect the silicon architecture documentation, and fork the entire toolchain if the company disappears.
That’s not a feature. That’s the whole point.
Key Links & References
- Tenstorrent Hardware
- TT-Metalium SDK (GitHub)
- TT-Forge Compiler (GitHub)
- Blackhole Architecture at Hot Chips 2024
- Programming Tenstorrent Processors (Martin’s blog) — excellent technical deep dive
- ASPLOS Blackhole Microbenchmark Paper
Researched 2026-03-28
Write a comment