
Oro AI (SN15) published a full technical report this month explaining how the team used Bittensor subnet data to train a small but powerful shopping agent.
Authored by Shardul Bansal (Oro AI), Seth Schilbe (Oro AI), and Jarrod Barnes (Dynamical Systems), the paper argues that what holds back small AI agents is not the training method but the quality of the data they learn from.
Oro used trajectories generated by their subnet to lift a 4-billion-parameter base model from 18% to 42.7% accuracy on a held-out test set. The result matches frontier-class quality at roughly one-tenth the cost and twice the speed.
The Argument: Data Quality, Not Method
The paper opens with a direct claim about why most small AI agents underperform. The training methods used today are well understood; the problem is the data the model gets to learn from.
Two common sources have known issues:
1. Synthetic Data: A large model is prompted to play both the user and the agent, and the resulting “fake” conversations become training data. The problem is that this data inherits the biases of whichever model generated it, and the variety tends to collapse over time.
2. Production Logs: Real conversations from deployed agents have variety, but they are mostly unranked, dominated by whichever policy is currently winning, and often include cases where the agent reached the right answer for the wrong reasons.
The paper‘s central claim is that a third option exists: an incentive-aligned arena where many independent teams compete to produce better agents. The data that falls out of that arena addresses both failure modes at once.
What Makes SN15 Different
Oro designed SN15 so that three properties fall out of the mechanism itself:

1. Diverse Competition: Many independent teams continuously get paid in tokens to find new and better policies. The race has a daily promotion cycle, a 12-hour cooldown between submissions, and an embargo on the winning code until the next weekday. This keeps the competition alive and prevents one team from dominating.
2. Per-Attempt Grading: Every submission is scored on two independent axes. A rule-based scorer checks if the recommended product meets the user’s criteria, including price, service, SKU, and attribute. An LLM judge then reads the full reasoning trace and assigns a quality score that multiplies the outcome. This produces a per-attempt quality signal that synthetic pipelines have to manufacture and that production logs do not have at all.
3. Rotating Held-Out Problems: Three small versioned sets are reserved as untouched evaluation territory. A continuously growing test bank is rotated, with a safeguard that catches paraphrased or reordered versions of the same problem so they cannot sneak into training.
The Filter: Keeping the Right Kind of Attempts
Not every attempt the subnet produces is useful for training. Oro built a filter that selects the attempts that teach an AI how to actually be an agent, and rejects the ones that do not.
The most important distinction in the filter is between two types of attempts:
1. Agentic Attempts: The AI itself decides what tools to use and when. It calls a search, looks at the results, refines its query, and eventually commits to a recommendation. This is what an agent should look like.
2. Scripted Attempts: A Python program decides what tools to use, and the AI is only called in to classify, score, or narrate the results. From the outside, this looks like the AI is doing the work, but the AI is not really making the decisions.
Oro keeps the agentic attempts and rejects the scripted ones. A model trained on scripted attempts becomes a good classifier and narrator but never learns to act as an agent.
The Training Pipeline
The team used a five-stage training process to turn filtered subnet data into a competent shopping agent:
STEP 1: Fine-tune the base model on the filtered subnet data.
STEP 2: Sample multiple attempts per problem from the fine-tuned model, keep only the successful ones, and fine-tune again on the combination.
STEP 3: Generate teacher trajectories using Claude Sonnet 4.6 on the same problems, then fine-tune again on the successful teacher outputs.
STEP 4: Run a preference refinement step that nudges the model toward correct answers and away from incorrect ones.
STEP 5: Run a reinforcement learning stage that rewards the model for matching the teacher’s tool choices at every step.
The reinforcement learning stage showed a clear climb in process quality over 20 optimization steps. Per-step product hallucinations fell from 14 to zero, partial successes rose from 0 to 24 out of 48 on a held-out test, and the model learned to navigate toward the right neighborhood of products. The curve was still climbing when the time limit hit, so the final model is the version from before this stage.
The Numbers
Results on the 75-problem held-out test set:
| Model | Accuracy |
| Qwen3-4B base (untrained) | 18.0% |
| GPT-5.5 | 38.7% |
| ORO model (this paper) | 42.7% |
| Published synthetic-data baseline | 43.6% |
| Published baseline with reinforcement learning | 48.7% |
| Claude Sonnet 4.6 (frontier teacher) | 64.0% |
| Top SN15 miner (multi-LLM ensemble) | 77.3% |
The 24.7-point lift from the base model lands within margin of error of the published synthetic-data baseline. The model also has a pass@8 score of 53.3% versus a pass@1 of 34.8%, which means the model often produces the right answer somewhere in its top 8 attempts even when its top 1 is wrong. That headroom is what future stages aim to convert into reliable single-attempt results.
The Honest Read
The authors are direct about three issues:
1. The Scripted Attempts Dominate the Firehose: Of the 12,000 to 27,000 attempts produced per day, the agentic ones (which the filter keeps) are a small minority. The top-scoring miner on the leaderboard at the time of writing was 1,850 lines of Python that called the AI only as a classifier. The trained model draws from a much smaller slice than the published training corpora it is compared against.
2. The Testing Harness Matters: The numbers reported here are measured through a 7-tool harness. Published baselines use a 4-tool harness. A reviewer should not over-interpret a direct comparison between the two.
3. The Reasoning Judge Is Not Ground Truth: It is an LLM judgment that inherits its own failure modes.
The headline result is a lower bound on what the subnet can produce, not the ceiling.
What Comes Next
Two threads of follow-up work sit on the roadmap. On the training side, the team plans to run the reinforcement learning stage longer, with three improvements in progress. These are periodically refreshing the rollout policy to match the current model, a finer turn-by-turn reward signal, and a separate scoring head trained on the individual rule axes.
On the data side, the bigger lever is bringing the scripted attempts into training. Two paths are named:
1. Rewrite Scripted Attempts Into Agentic Form: Use a frontier model to add reasoning between each tool call, grounded in what actually happened. The reasoning is synthetic but the tool calls and outcomes are real.
2. Reweight The Subnet Scoring: Adjust scoring so that agentic policies are not systematically outscored by scripted ones. This shifts the kind of data the subnet produces in the first place. Pilot tests are underway.
The combination changes the picture from “a model trained on a small slice of the data” to “a model trained on all of it.”
The Substrate Argument
The deeper claim in the paper rather than being about shopping agents or about Qwen3-4B, is that Bittensor subnets, when designed against the right principles, become a new kind of training data source. Synthetic data is biased by whoever generated it. Production logs are unranked and uneven.
A continuously running incentive-aligned arena with quality grading and rotating tests produces a different shape of data that addresses both problems at once. Oro’s result is the first published case of that argument running in production.
If the scripted attempts come online and the reinforcement learning stage converges, the same approach should close the remaining gap to frontier performance. The substrate question is the one the team is making the longest bet on.
➛ Read The Full Report Here
Enjoyed this article? Join our newsletter
Get the latest TAO & Bittensor news straight to your inbox.
We respect your privacy. Unsubscribe anytime.

Be the first to comment