EDITORIAL

128 Strangers, 5 Minutes, 5 TAO, One Model

Const posted something yesterday that, on its face, reads like every other Bittensor milestone tweet: He kick-started a 3B-parameter training run in 5 minutes for ~5 TAO, stitching together 128 total strangers’ machines over the

Ige A

June 4, 2026 · 4 min read

Const posted something yesterday that, on its face, reads like every other Bittensor milestone tweet:

Just orchestrated a 128 node permissionless decentralized training run, in 5 minutes, for 5 TAO, via @IOTA_SN9

They can do this up to 100B param models.
Unbelievable. https://t.co/hJGZ6O5NrU pic.twitter.com/jvHwDajAUw
— const (@const_reborn) June 3, 2026

He kick-started a 3B-parameter training run in 5 minutes for ~5 TAO, stitching together 128 total strangers’ machines over the public internet.

The interesting thing is that the same setup scales to 100B-param models as more people join.

Let’s explore what happened, why it’s super hard, and what it’s for.

Table of Contents

Took him 5 mins to set it up

128 separate machines, owned by 128 different people who don’t know or trust each other, were stitched into a single training pipeline and started co-training one neural network. Setup took about five minutes. The on-chain cost was ~5 TAO.

No central company provisioned the cluster. No one had to ship their GPUs to a data center. The network spun up, assigned each machine a slice of the work, and started training.

That’s the kind of permissionless, fast, and cheap coordination that is enabled by the Bittensor network.

The current run kicked off June 3 under the run ID consts-v0 (yes, named after Const). More details about the training run:

Model size: 3B parameters
Target: 50B training tokens

Why this is hard (the part that makes it impressive)

Training a large model has a brutal physical constraint: the whole model has to fit in your GPU’s memory. A 100B-parameter model doesn’t fit on a gaming GPU, or a MacBook, or anything you own. You need a rack of A100s/H100s wired together with fast interconnects (i.e., a data center).

That’s the wall every decentralized-training project hits.

Even IOTA faced this roadblock in its early days. Miners pretrained models from 700M up to 14B params and beat baselines like GPT-2 Large and Falcon-7B, but every miner had to fit the entire model on their own hardware. So the ceiling was “whatever the richest single miner can afford.” And because it was winner-takes-all, everyone’s work except the top model got thrown away. Was wasteful, and it caps you well short of frontier scale.

What IOTA does differently: cut the model into slices

IOTA’s fix is pipeline parallelism.

Instead of asking each machine to hold the whole model, IOTA’s orchestrator chops the model into stages and hands different stages to different machines. Picture the live consts-v0 run: that 3B model is carved into 7 layers, and the swarm forms an assembly line.

One band of machines runs the first stage, passes its output (“activations”) down to the next band, and so on. Each station only needs the tools for its own step. No single node ever holds the full model.

This means the maximum model size scales with the number of participants, not with the VRAM of the biggest machine. The more people join, the higher the chance the swarm can train a bigger model. That’s why “up to 100B params” is even sayable. The constraint moved from one machine to the size of the crowd.

Macrocosmos: The kinds of hardware used for decentralized training

Three pieces of engineering make it survive the open internet:

128× activation compression: those activations getting passed between machines are normally huge. IOTA squeezes them by up to ~128×, so home-grade connections can keep up instead of choking the pipeline.
Butterfly All-Reduce: a trustless way to merge everyone’s contributions into one model without a central server, and resilient to some nodes being malicious or dropping offline.
CLASP: a Shapley-value-style accounting method that measures how much each node actually contributed, so emissions are paid out proportionally instead of winner-takes-all. Nobody’s work gets discarded; everybody who helped gets paid for the help.

So what is it for?

Two answers, one technical and one economic.

Technical: it’s a path to training genuinely large open models without a hyperscaler. If the swarm keeps growing, the ceiling keeps rising, and it becomes a frontier-scale pretraining that nobody owns or gatekeeps. That’s the long game for a truly open AI infrastructure.

Economic: it turns idle consumer hardware into paid work. Instead of compute flowing up to whoever can afford the data-center rack (where the value also concentrates), IOTA pulls in everyday machines and pays them in $TAO for the layers they run. Const called this a cyclical economy rather than a hierarchical one, where value circulates back to the edge and helps fund the next run, rather than pooling at the top.

It only gets better from here

128 untrusted strangers’ machines were coordinated into one training run, permissionlessly, in minutes, for pocket change, using an architecture where the model can grow as the crowd grows, and where everyone who contributes gets paid for it.

It’s early, and the biggest numbers are still on the way. But the thing being demonstrated right now (cheap, fast, trustless orchestration of heterogeneous hardware into a single pipeline) is the actual hard part of decentralized AI.

And the team isn’t hedging on what comes next. When someone projected IOTA hitting ~1T params between late 2026 and mid-2027, Macrocosmos CEO Will Squires replied “too slow.”

Enjoyed this article? Join our newsletter

Get the latest TAO & Bittensor news straight to your inbox.

We respect your privacy. Unsubscribe anytime.

The Daily Dispatch

Enjoyed this article?
Join our newsletter

Get the latest TAO & Bittensor news straight to your inbox — every morning before markets open.

IOTA SN9

Ige A

Senior Editor

128 Strangers, 5 Minutes, 5 TAO, One Model

Took him 5 mins to set it up

Why this is hard (the part that makes it impressive)

What IOTA does differently: cut the model into slices

So what is it for?

It only gets better from here

Enjoyed this article? Join our newsletter

Enjoyed this article?
Join our newsletter

Like this:

Be the first to comment

Leave a Reply Cancel reply

128 Strangers, 5 Minutes, 5 TAO, One Model

Took him 5 mins to set it up

Why this is hard (the part that makes it impressive)

What IOTA does differently: cut the model into slices

So what is it for?

It only gets better from here

Enjoyed this article? Join our newsletter

Enjoyed this article?Join our newsletter

Like this:

Be the first to comment

Leave a Reply Cancel reply

Related stories

Beam (SN105) Unveils Beam Data Mesh for Seamless Data Transfers

How Bittensor’s ‘Contestonomics’ Model Is Positioning It for AI’s Shift to Low-Cost Inference

Enigma (SN63)’s Hardest Challenge Yet Remains Unsolved as Rewards Increase

Enjoyed this article?
Join our newsletter