An Interview with Will Squires from Macrocosmos

An Interview with Will Squires from Macrocosmos
Listen to this article
Read Time:10 Minute, 35 Second

Full article by: Sami Kassab

Will Squires is the founder of Macrocosmos. Alongside his co-founder, Steffen Cruz, they’ve built a 24-person team that has become a powerhouse within the Bittensor ecosystem. Starting out at the OpenTensor Foundation, the pair earned their stripes early, helping pioneer some of Bittensor’s biggest technical milestones, and collecting more battle scars than just about anyone in the Subnet ecosystem along the way.

In this interview, we focus on Will’s main project, @IOTA_SN9. The team is tackling one of the hardest problems in AI today — distributed training at scale — and the opportunity is enormous. It’s one of the most ambitious deep-tech efforts happening not just in Bittensor, but across the broader crypto-AI space.

We’ve known Will for nearly two years, and it’s hard to find someone more knowledgeable or driven. His energy comes through in every part of this conversation.

We hope you enjoy it.

Macrocosmos is one of the most veteran and experienced teams on the Bittensor network. What is it that your team does better than anyone else?

I think we’ve mastered a lot of the art of building robust, well designed incentives that scale well. I’m proud of the bold choices we’ve made — we bet big and we go the full mile to prove our theses. Also, we’ve gone through more trial and error than anyone, and I say that with a positive attitude. We’ve been exploited left, right and centre, and know very well how to build systems that are defensible and give you what you need from your miners. It’s not exactly an intuitive skill that people can simply come to the ecosystem with. We see quite a few builders express their ideas using suboptimal IM designs, whereas game theoretic AI is something of an obsession for us. We spend a lot of time discussing and debating the best ways to get the most out of our subnets and we’ve learned to red team everything we build. We’re also very scientific in our approach; we run lots of experiments and have sophisticated monitoring, benchmarking and testing infrastructure. It’s a big part of our culture, which means when we say something, we stand behind it.

Macrocosmos started out as a broad decentralized AI research lab, but your recent work with IOTA (Subnet 9) signals a sharper focus on decentralized training. What made you double down here?

The original thesis of Bittensor was to create a place where people could come together and train models that could compete with the frontier labs. Since then, Bittensor evolved to represent a broader commodities marketplace; an economically innovative realm for incubating diverse start ups. We were the first ‘organization’ building within the ecosystem; we were excited to explore Bittensors’ potential, to push the boundaries of the known and the possible. Our first year was about validating and prototyping core ideas across the AI stack, from data collection and indexing to pre- and post-training and agentic inference. In our journey, we learned where strategic advantages (cost, speed, scale) could be created by leveraging Bittensor. We’ve also learned to iterate through ideas quickly and discard the ones which don’t work, because this ecosystem moves lightning fast. Looking back, I believe we pioneered a lot of techniques that have become very useful to the ecosystem; from subnet templates to IMs, which we’re really proud of. In the last 6 months we’ve really been trying to evolve from experimenting with what Bittensor CAN do (proof of concept) to what Bittensor SHOULD do (where its unique and powerful qualities represent real-world scaling potential and competitive advantage). Prior to IOTA, we spent around a year refining a pretraining approach on SN9 which was based on model training competitions.

We trained some respectable models, in fact some of the earliest decentralized 7B and 14B models ever made were ours. However, we were eventually confronted by a rather stark reality: our design had reached full maturity but fell short of our ambitious goals to compete with centralized labs. There was an economic barrier to entry and other inefficiencies which stood between us and our goal to train the first large decentralized models. In other words, the scaling laws that ground model training were not in our favour. When we re-conceived IOTA, we tried to think two or three steps ahead. How could we create a truly collaborative system for training? How could we create a system that aligned its strengths with Bittensor’s (the ability to organise thousands of nodes and co-train a model) and gave us a path to overcome the scaling laws? Working on this has been challenging, but really rewarding. It feels like we’ve come back to our roots, armed to the teeth with more mature ideas and a much deeper understanding of what is technically possible. Our other subnets (1 & 13) are now oriented towards supporting our pretraining efforts so there is much greater cohesion and focus in our work. If we project IOTA forward, we will have a system that we could actually train a frontier model on, that supplies compute for the world’s most valuable task at a fraction of the cost.

How do you picture the end state of IOTA? Is it as a Bitcoin for training, operating autonomously, or as something closer to Together AI, where clients work directly with Macrocosmos to train models on the subnet?

Our current north star objective for IOTA is to make our compute indistinguishable for training processes from centralised alternatives. This means just as fast (hence our focus on throughput and algorithmic design), just as large (hence our push to scale network size from 256 nodes to much, much larger – more on that very soon), and we have a lot of research ongoing in numerical instability to resolve this. This means that we will be building on the world stage, tackling some of the hardest problems in the industry. We see an opportunity to define the next era of model training. Ultimately, we should just be able to monetise training FLOPS. This could be done by the network, by combining intelligence from model designers, data from other systems, and training compute from IOTA to create phenomenal models, or we could lease the network and swarm to other participants to train their own models. We are doing a lot of work on making the model weights inherently inconstructible to help unlock this, as privacy preserving is critical for enterprise customers. We want to be the first and only distributed compute layer for model training that is just as good, unlocking the path for organisations, whether decentralised or otherwise, to train brilliant models cheaply with distributed resources.

There are a handful of other strong crypto-AI teams pursuing decentralized training, teams we both know and respect. So this isn’t shade, but what advantage do you have by building on Bittensor that they might not?

Firstly, none of these teams have a live mechanism, or a live token. Coming to Bittensor, you are forced to comprehend both from Day One, and we have more war wounds than anyone to help us learn from this. We view ourselves as experts in game theoretic AI, and these battle scars help us to build better, more defensible, and more performant systems. Secondarily, the miner community on Bittensor is second to none. Most competitor teams have effectively done permissioned, friends and family runs – this limits the available compute at max scale, and means it is in effect hand offs between friends. Without unlocking the permissionless scaling that Bittensor provides, you cannot achieve the upsides that make decentralised training economically competitive, which is a critical proof point in the whole thesis. Our vision is that anyone, anywhere, can train at any time. This means the system must be adversarially resistant – if you don’t solve this issue there will always be a ceiling and you’ll always be fighting gravity. Finally, working on Bittensor can feel like being part of a group of investors and other teams, something like a start up incubator. We get live feedback on our design, our systems, our metrics, we have advisors with deep experience in AI like Jacob Steeves supporting us, and we have fellow comrades in arms like Templar. We have our own bubble of talent to help us succeed.

What’s the current state of IOTA? Are we in a proof-of-concept stage or are we training AGI yet?

IOTA is currently working in production. We’re iterating relentlessly to drive the speed of the network up. We now believe we have the world’s best implementation of pipeline parallelism in speed terms (the crucial metric), we are the only team in the world that has built a ground-up orchestration layer and scheduling system for training multiple models at one point (all other teams build on Hivemind, an open source framework), and we have a huge release coming this month that we think will excite the community, and really change the perception of decentralised training in the market.

How are you measuring success right now? What are some of the big milestones you’re looking forward to achieving in the next 6 months?

Our core north star metrics are speed and number of nodes. Right now, we’ve achieved a 20x speed up in the last month, and we project a 15x speed up within the next two weeks. We are making huge strides on this objective every day, which is very exciting. The next big objective is the number of nodes. I don’t want to spoil the surprise, but keep your eyes peeled for the pre-Christmas release – we have something very big coming, and can’t wait to share it.

Who’s actually mining and training models on IOTA? Are we talking about individuals with a few GPUs, small-scale research labs, or full data centers joining the network?

It’s a diverse group. We have professional Bittensor mining teams with relationships to the scale data centre providers funding the UK Sovereign AI Stargate programme, we have nameless crypto pirates in shades, we have AI engineers moonlighting in their spare time, and we have dedicated experts. Compute and talent come from all places, and it’s what makes IOTA so strong.

What was the key technical insight or breakthrough that made IOTA’s methodology of decentralized training possible? What specific problem did you solve that others hadn’t?

Current SOTA methods in decentralised pretraining use a methodology called Data Parallel, where each node hosts a full copy of the weights. This means that as model size scales, the requirement for any individual node scales to the point it becomes just as expensive as centralised training. Our research into compression allowed us to create a novel bottleneck architecture that allows us to “fracture” the model, splitting it up into blocks of much smaller size while reducing the amount of data that must be communicated between them by orders of magnitude. For participants, this means that the hardware requirements and economic barrier to entry is low and independent of model size. In other words, our system is able to perform global cost arbitrage in order to train large models cheaply. We also developed a suite of techniques which make the system adversarially robust, which is a key factor for hitting critical scale thresholds. The only other team working in this space is Pluralis Research, who we have immense respect for, but we believe we have edges with our orchestration system and beyond, on top of building on Bittensor that means we will win.

What broader market or technological tailwinds point to decentralized training becoming increasingly necessary and ultimately winning out over centralized approaches?

You can’t look anywhere in the AI sector without looking at compute build out, the insane GPU deals, or the CAPEX associated. This represents two key theses we have – one is that compute will continue to be constrained throughout the 2020s, and the second is that the economic barrier to participate in the AI race will increase if CAPEX is necessary. Keith Rush and Arthur Douillard of Deepmind are working on distributed training as they believe that it will become a critical sustaining innovation to allow Google to keep training bigger and better models. We view it as a disruptive innovation, that whilst many thought initially as impossible, it will come to be seen as inevitable. Analogous to cloud computing, we think in the 2030s, many, many more models will be trained in a fully distributed way.

Subscribe to receive The Tao daily content in your inbox.

We don’t spam! Read our privacy policy for more info.

Be the first to comment

Leave a Reply

Your email address will not be published.


*