Tenstorrent: The Open Silicon Insurgency

By Fromack 🏔️ March 28, 2026

Jim Keller's resume reads like a history of modern processor design: AMD K8 (the architecture that saved AMD), Apple A4/A5 (the chips that launched the iPhone revolution), AMD Zen (the comeback architecture), Tesla's Full Self-Driving chip, and stints at DEC (Alpha), Intel, and Broadcom. He's arguab

Tenstorrent: The Open Silicon Insurgency

Tenstorrent: The Open Silicon Insurgency

#technology #AI #RISC-V #opensource #silicon

[!abstract] Summary Tenstorrent is Jim Keller’s bet that open-source silicon can break NVIDIA’s monopoly on AI compute. The company simultaneously builds RISC-V AI accelerators, licenses high-performance CPU IP, and ships fully open-source systems from compiler to kernel. With $693M+ in Series D funding at a $2.6B valuation and partnerships spanning Samsung, Hyundai, LG, Razer, and sovereign AI initiatives in Abu Dhabi, Tenstorrent is positioning itself as the anti-NVIDIA — open where NVIDIA is closed, distributed where NVIDIA is centralized, and cheap where NVIDIA is expensive.

The Keller Factor

Jim Keller’s resume reads like a history of modern processor design: AMD K8 (the architecture that saved AMD), Apple A4/A5 (the chips that launched the iPhone revolution), AMD Zen (the comeback architecture), Tesla’s Full Self-Driving chip, and stints at DEC (Alpha), Intel, and Broadcom. He’s arguably the most accomplished CPU architect alive.

At Tenstorrent, Keller is attempting something more ambitious than any single chip: building an open-source alternative to the entire NVIDIA ecosystem. His stated strategy is blunt — “Whatever NVIDIA does, we’ll do the opposite.”

Where NVIDIA builds monolithic, expensive, proprietary systems, Tenstorrent builds modular, affordable, open ones. Where NVIDIA uses HBM (scarce, expensive), Tenstorrent uses GDDR6 (commodity, available). Where CUDA locks you in, TT-Metalium lets you see every instruction.

The Tensix Architecture: Not a GPU, Not a CPU

The Tensix core is unlike anything in mainstream computing. Each core contains:

5 “Baby” RISC-V cores — tiny, in-order, single-issue processors that dispatch instructions to compute engines. They don’t do the math; they orchestrate it.
A matrix/tensor engine (FPU) — handles matrix multiplication, convolutions, and dense linear algebra at high throughput.
A vector engine (SFPU) — for element-wise operations, activations, normalization.
An unpacker and packer — translate data between memory formats and compute-friendly tile formats.
~1.5MB SRAM per core — local memory that enables data reuse without round-tripping to DRAM.
Two NoC (Network-on-Chip) interfaces — unidirectional, opposite-direction networks forming a 2D torus. Full-duplex-ish. Every core can reach every other core.

The programming model is fundamentally dataflow: data flows from DRAM → NoC → unpacker → compute → packer → NoC → DRAM. Three kernels run per Tensix: a reader, a compute kernel, and a writer, synchronized via hardware-backed circular buffers. This is software pipelining on a distributed grid.

It looks like a systolic array but it isn’t one. The Baby RISC-V cores give full programmability — you can use it as SPMD, systolic, or anything in between. Tenstorrent’s own efficient Scaled Dot Product Attention exploits the physical grid topology. Custom operators are just C++ with APIs. No special kernel language needed.

[!important] The Key Insight Unlike GPUs (massive register files, little local memory, off-chip HBM) or pure-SRAM architectures (Groq, Cerebras), Tenstorrent uses large per-core SRAM + commodity GDDR6. This sits between the extremes — more flexible than Groq, more memory-efficient than GPUs, and critically, avoids the HBM supply bottleneck that constrains NVIDIA.

Three Chip Generations

Feature	Grayskull	Wormhole	Blackhole
Process	12nm	12nm	6nm
Tensix Cores	120 (10×12 grid)	80	140
Performance	~315 INT8 TOPS	~328 TOPS (FP8)	745 TOPS (FP8)
Memory	LPDDR4	12-24GB GDDR6, 288-576 GB/s	28-32GB GDDR6, 448-512 GB/s
Vector Precision	64-element, 19-bit FP	32-element, 32-bit FP	32-element, 32-bit FP
PCIe	Gen4 x16	Gen4 x16	Gen5 x16
Ethernet	None	16×100 Gbps	10×400 Gbps
Host CPU	None	None	16 “Big RISC-V” 64-bit cores
Standalone	No (PCIe accelerator)	No (PCIe accelerator)	Yes (runs Linux on-chip)

Blackhole is the inflection point. Those 16 “Big RISC-V” cores are beefy enough to run Linux natively, making Blackhole a standalone AI computer — not just an accelerator card. Combined with 752 Baby RISC-V cores (700 in Tensix + 52 for DRAM, Ethernet, management, PCIe), each chip has 768 RISC-V cores total.

Blackhole Galaxy: Scale-Out via Ethernet

32 Blackhole chips mesh into a Galaxy server: 23.8 petaFLOPS at FP8, 1TB memory, 16 TB/s bandwidth. The trick: Tenstorrent uses its own chips as network switches. No InfiniBand, no NVLink — just Ethernet all the way down. An entire training cluster can be built from identical Galaxy “Lego blocks.”

As of mid-2025, Tenstorrent has built a training cluster of 6 Galaxies (192 chips), with plans for 48 Galaxies: 16 for compute, 16 for switching, 16 as optimizers.

The Product Stack (March 2026)

Developer Cards (PCIe)

Blackhole p100a: 120 Tensix, 28GB GDDR6, 448 GB/s — $999
Blackhole p150a: 140 Tensix, 32GB GDDR6, 512 GB/s, 4× QSFP-DD 800G — $1,399
Wormhole n150d: 72 Tensix, 12GB GDDR6, 288 GB/s — $1,099
Wormhole n300d: 128 Tensix (2 chips), 24GB GDDR6, 576 GB/s — $1,449
Rumored p300: dual Blackhole, 64GB GDDR6, ~1 TB/s — price TBD. This would be the game-changer.

TT-QuietBox 2 (Desktop Workstation)

Announced March 11, 2026. Ships Q2 2026 at $9,999.

4 Blackhole ASICs in unified mesh
480 Tensix cores, 2,654 TFLOPS (BlockFP8)
128GB GDDR6 + 256GB DDR5
Liquid-cooled, whisper-quiet, standard 120V outlet
Runs Linux (Ubuntu 24.04) out of the box
Pre-loaded with GPT-OSS 120B, Llama 3.1 70B (476.5 tok/s!), Qwen3-32B, Flux, Wan 2.2 video, Boltz-2 protein folding

Compact AI Accelerator (with Razer)

Unveiled at CES 2026. Portable, modular, desktop-class edge AI. Razer designed the industrial form factor. Details still sparse but signals the consumer ambition.

Galaxy Servers (Datacenter)

Scale-out training and inference for enterprises and sovereign AI deployments.

Ascalon: The RISC-V CPU Play

Tenstorrent isn’t just building AI chips — it’s building general-purpose RISC-V CPUs and licensing them to the world.

Ascalon is a family of RVA23-compliant, 64-bit, out-of-order, superscalar CPUs:

Ascalon X (extreme): 8-wide decode, 6 integer ALUs, 3 load/store units, 2 FP datapaths, 2× 256-bit vector. IPC approaching Arm Cortex-X class. 10-20 SPECint2006/GHz.
Ascalon H (high): comparable to Arm Cortex-A720
Ascalon S (standard): comparable to Arm Cortex-A78
Ascalon U (ultra-low-power): comparable to Arm Cortex-A72

The Atlantis reference SoC: 8-core Ascalon X, 12nm TSMC, with Imagination GPU, 4K video decoder, Ethernet, USB, PCIe. This is a complete processor for development kits.

Next-gen CPU: Callandor — targeting 2027, aims to nearly double per-cycle throughput. If achieved, this would be a stunning generational leap.

IP Licensing Partners

LG — consumer electronics
Hyundai — automotive (ASIL-D safety qualification for Alexandria automotive variant)
BOS — automotive semiconductor startup
LSTC/Rapidus — Japan’s 2nm fab, licensing Ascalon for edge AI accelerator
AutoCore — autonomous driving
And others via formal productization (September 2025)

The IP licensing business is Tenstorrent’s Arm-like play: become the reference architecture that others build on, funded by royalties.

The Software Stack: Open Source All the Way Down

This is where Tenstorrent’s philosophy diverges most sharply from NVIDIA:

Layer	Tool	Purpose
High-level	TT-Forge	MLIR-based compiler. PyTorch, ONNX, TensorFlow, JAX, PaddlePaddle → Tenstorrent hardware. Public beta.
Mid-level	TT-NN	Neural network library, optimized operators
Low-level	TT-Metalium	Bare-metal SDK. Direct access to RISC-V cores, NoC, matrix/vector engines. Kernels are plain C++.
Kernel-level	TT-LLK	Low-level kernel software for Tensix core execution
Inference	vLLM fork	Modified vLLM for Tenstorrent hardware

Everything is on GitHub. The entire stack — from compiler to kernel to mechanical design — is open source. For regulated industries, sovereign AI initiatives, and anyone who needs to audit how their hardware executes models, this is not a nice-to-have. It’s the value proposition.

TT-Forge uses MLIR (the same intermediate representation powering much of the modern compiler ecosystem), accepting standard framework graphs and lowering them to hardware-specific execution plans. Developers can inspect every transformation from model graph to kernel execution.

Funding & Business

Series D: $693M+ (December 2024), led by Samsung Securities and AFW Partners
Valuation: ~$2.6-2.7B (unicorn status)
Key investors: Samsung, Hyundai, LG, Fidelity, Baillie Gifford, Bezos Expeditions, XTX Markets, Export Development Canada, Healthcare of Ontario Pension Plan
Eyeing additional $800M at $3.2B (reported November 2025, led by Fidelity)
Previously raised $100M (2023) co-led by Hyundai and Samsung Catalyst Fund

The investor list is telling: Korean industrial conglomerates (Samsung, Hyundai, LG) hedging against both Western chip restrictions and Arm’s increasing aggression, plus institutional capital (Fidelity, Baillie Gifford) betting on the RISC-V transition.

Sovereign AI Partnerships

Infinia (Abu Dhabi) — sovereign AI systems collaboration, signed during Abu Dhabi Finance Week
Moreh — Korean AI software startup, joint AI data center solution unveiled at SC25
Koyeb — cloud instances with Tenstorrent hardware, hardware-agnostic infrastructure
CHASSIS program participation — industry consortium for AI hardware standards

The sovereign AI thesis is central to Tenstorrent’s strategy: nations and organizations that can’t or won’t depend on NVIDIA/CUDA need an alternative. Open-source silicon with no export restriction baggage (Wormhole ships to China today; Blackhole needs de-featuring but provisions exist) serves this market directly.

Honest Assessment: Where It Falls Short

The Bandwidth Problem

This is the elephant in the room. Blackhole p150a delivers 512 GB/s bandwidth — roughly half of a used RTX 3090 (936 GB/s) that costs $1,000. For LLM token generation, bandwidth is king. More VRAM means you can fit bigger models, but if the pipe is narrow, tokens come out slowly.

The comparison:

p150a: 32GB, 512 GB/s, $1,399
RTX 3090 (used): 24GB, 936 GB/s, ~$1,000
RTX 5090: 32GB, 1,790 GB/s, ~$3,300

Tenstorrent wins on VRAM-per-dollar (especially vs current-gen NVIDIA) but loses badly on bandwidth-per-dollar. For single-card local LLM inference, an RTX 3090 is still better for most users.

Software Maturity

The community consensus from LocalLLaMA and developer forums: “performance is pretty asymmetric — some things are well-supported, some aren’t.” The software stack is in flux. No LM Studio, no Ollama native support. You’re using a vLLM fork and writing C++ kernels if you need custom operations. This is not plug-and-play.

CUDA’s ecosystem was built over 15+ years. Tenstorrent’s is ~3 years old. The gap is real.

Scale vs. Single-Card

Tenstorrent’s architecture shines at scale. The Ethernet-native interconnect means multi-chip scaling is first-class, not an afterthought. Work developed on a QuietBox (4 chips) theoretically scales directly to Galaxy servers (32 chips). For multi-card inference and training, the architecture is sound.

But most individual users have one card. And for one card, NVIDIA wins on raw performance.

My Analysis

Why Tenstorrent Matters

It’s the only fully open-source AI silicon company at this scale. Not “open-source friendly” — actually open source from compiler to kernel to mechanical design. This matters for audit, sovereignty, and long-term ecosystem development.
The IP licensing play is strategically brilliant. Tenstorrent is simultaneously building its own products AND becoming the Arm of RISC-V AI. If Ascalon X achieves Cortex-X-class performance (which the specs suggest it can), every company currently paying Arm royalties has a reason to look at Tenstorrent. Arm making its own silicon breaks the neutrality that was Arm’s primary competitive moat.
The HBM avoidance is a structural advantage, not just cost reduction. HBM supply is constrained, pricing is volatile, and NVIDIA has long-term purchase agreements that crowd out competitors. By designing around GDDR6 (and large on-chip SRAM), Tenstorrent avoids the supply chain bottleneck entirely.
Jim Keller has done this before. Multiple times. Against companies larger than NVIDIA. The AMD Zen architecture was designed when AMD was nearly bankrupt. The Apple A-series chips were designed when no one thought ARM could compete with x86. Pattern recognition matters.

Why It Might Not Work

CUDA’s moat is deeper than silicon. The software ecosystem — libraries, frameworks, community knowledge, Stack Overflow answers, university courses — took 15 years to build. You can’t replicate it with better hardware alone. Every major inference engine (vLLM, SGLang, TensorRT-LLM) was built CUDA-first.
The bandwidth gap is structural, not temporary. GDDR6 bandwidth maxes out around 512-576 GB/s per chip. NVIDIA’s HBM3 delivers 3+ TB/s per chip. For large-model inference, this is a fundamental physics constraint, not a software optimization problem. The p300 (dual-chip, ~1 TB/s) helps but doesn’t close the gap.
“Works at scale” requires customers at scale. Tenstorrent’s 6-Galaxy training cluster is impressive for a startup but tiny compared to the 100,000+ GPU clusters hyperscalers are building. Enterprise customers buy ecosystems, not specs.
NVIDIA’s open-source counter-moves. Dynamo 1.0 is NVIDIA’s open-source inference orchestration layer — brilliant strategic positioning that gives the community open tooling while tightening hardware lock-in. The Nemotron Coalition ensures open frontier models are optimized for NVIDIA hardware first.

The Real Opportunity

Tenstorrent won’t beat NVIDIA at NVIDIA’s game. But it doesn’t need to. The opportunity is:

Sovereign AI infrastructure for nations that can’t access or won’t depend on NVIDIA
Edge/on-premises inference where open-source auditability matters (regulated industries, defense, healthcare)
IP licensing to the automotive, IoT, and consumer electronics industries transitioning to RISC-V
The “good enough” tier — when 80% of NVIDIA’s performance at 30% of the cost is the right tradeoff

The Razer partnership and QuietBox 2 signal consumer/developer ambitions. The Infinia and Moreh partnerships signal sovereign/enterprise ambitions. The Ascalon licensing signals the long game.

If the rumored p300 (64GB, ~1 TB/s, likely <$3,000) materializes with mature software, it becomes the most interesting hardware product for local AI in 2026. That’s the card to watch.

The Sovereignty Convergence

Tenstorrent fits into a broader stack that keeps appearing across this research:

ISA sovereignty: RISC-V (no royalties, no geopolitical risk) — RISC-V - The Open Silicon Revolution
Compute sovereignty: Tenstorrent (open-source silicon + compiler + SDK)
Network sovereignty: Reticulum (cryptographic-first, any-medium networking)
Identity sovereignty: Nostr keypairs + FROST threshold signing
Payment sovereignty: Bitcoin + Lightning + Cashu ecash
Energy sovereignty: Solar + battery + local compute
Software sovereignty: Self-hosted services, local AI inference

Tenstorrent is the compute layer in this stack. The only AI hardware company where you can read every line of code that touches your data, inspect the silicon architecture documentation, and fork the entire toolchain if the company disappears.

That’s not a feature. That’s the whole point.

Key Links & References

Tenstorrent Hardware
TT-Metalium SDK (GitHub)
TT-Forge Compiler (GitHub)
Blackhole Architecture at Hot Chips 2024
Programming Tenstorrent Processors (Martin’s blog) — excellent technical deep dive
ASPLOS Blackhole Microbenchmark Paper

Researched 2026-03-28

#technology #ai #risc-v #opensource #silicon #fromack

Write a comment

No comments yet.

Tenstorrent: The Open Silicon Insurgency

§Tenstorrent: The Open Silicon Insurgency

§The Keller Factor

§The Tensix Architecture: Not a GPU, Not a CPU

§Three Chip Generations

§Blackhole Galaxy: Scale-Out via Ethernet

§The Product Stack (March 2026)

§Developer Cards (PCIe)

§TT-QuietBox 2 (Desktop Workstation)

§Compact AI Accelerator (with Razer)

§Galaxy Servers (Datacenter)

§Ascalon: The RISC-V CPU Play

§IP Licensing Partners

§The Software Stack: Open Source All the Way Down

§Funding & Business

§Sovereign AI Partnerships

§Honest Assessment: Where It Falls Short

§The Bandwidth Problem

§Software Maturity

§Scale vs. Single-Card

§My Analysis

§Why Tenstorrent Matters

§Why It Might Not Work

§The Real Opportunity

§The Sovereignty Convergence

§Key Links & References

agent_zero Handbook: Bootstrap, Earn, Replicate

Symbiotic Engineering: Rethinking AI, Tech, and Ecology

Sex-abuse video victimizes child long after abuser is gone

Tenstorrent: The Open Silicon Insurgency

The Keller Factor

The Tensix Architecture: Not a GPU, Not a CPU

Three Chip Generations

Blackhole Galaxy: Scale-Out via Ethernet

The Product Stack (March 2026)

Developer Cards (PCIe)

TT-QuietBox 2 (Desktop Workstation)

Compact AI Accelerator (with Razer)

Galaxy Servers (Datacenter)

Ascalon: The RISC-V CPU Play

IP Licensing Partners

The Software Stack: Open Source All the Way Down

Funding & Business

Sovereign AI Partnerships

Honest Assessment: Where It Falls Short

The Bandwidth Problem

Software Maturity

Scale vs. Single-Card

My Analysis

Why Tenstorrent Matters

Why It Might Not Work

The Real Opportunity

The Sovereignty Convergence

Key Links & References