Quasar releases first public proof that the subnet architecture works at scale

Quasar releases first public proof that the subnet architecture works at scale
Read Time:2 Minute, 28 Second

Quasar (SN24) released Quasar-Preview yesterday, an 18B-parameter Mixture-of-Experts (MoE) model with just 2B active parameters and experimental support for a 5-million-token context window.

The team tips it as the “first public proof” that its custom Quasar architecture works at real scale.

The bottleneck it’s attacking

Long-context processing is still one of AI’s most stubborn engineering problems. Leading 2026 models advertise ranges up to 10 million tokens (Google’s Gemini 3 Pro and Meta’s Llama 4 Scout among them), but real-world performance often degrades sharply past a few hundred thousand.

Most AI models run on a standard Transformer, the same foundation under GPT, Claude, Gemini. Powerful, but it has a fatal limit: double the context, quadruple the compute. That quadratic wall is why long-context AI is still a bottleneck everywhere.

The architecture

Quasar’s bet is a hybrid recurrent/attention stack rather than dense scaling or pure sparse-attention tricks:

  • Loop Transformer: A scaffold that reuses decoder layers across multiple passes, raising effective compute depth without inflating parameter count. The Preview runs a single loop with looped anchor injection disabled.
  • Quasar Hybrid Attention: Layers cycle through three branch types — dominant Quasar branches, Raven (slot-routed recurrent attention with Mamba-2-style decay), and GLA (Flash Linear Attention for fast sequence mixing). The Preview uses 20 layers, with active hybrid layers from 4 to 19.
  • Sparse MoE routing: 256 experts, 8 selected per token plus one shared expert, keeping active params at ~2B while the full checkpoint sits at ~18B.
  • Experimental context extension: A “Safe NoPE” (no positional encoding past the first 512 tokens) plus RoPE config enables the 5M-token setting, though the model card flags this path as immature.

The design targets specialized massive-context workloads. The team has hinted that this release is an experiment and a reveal of the long-context architecture, with a tech report and a 10T-token version still to come.

Benchmarks and early pushback

Source: Quasar team

The release includes a table comparing Quasar-Alpha (the lineage leading to Preview) against Covenant-72B, Youtu-LLM-2B, Qwen3-4B-Base, and Gemma-3-4B-PT across MMLU, ARC, PIQA, HellaSwag, OpenBookQA, and MATH-500. Quasar-Alpha looks competitive or better in several categories despite fewer active params and limited training.

The X community flagged few caveats fast. The comparators are older or differently sized models; long-context-specific evals (needle-in-haystack at scale) are absent; and every Quasar result is bolded regardless of win or loss.

Critics also pressed on VRAM requirements for actually running the full 5M context, and on tool-calling and agentic capability. The team’s response: the release “is not about evals or beating other models” but about proving the architecture scales. A fuller tech report is promised.

Learn more about Quasar:

Enjoyed this article? Join our newsletter

Get the latest TAO & Bittensor news straight to your inbox.

We respect your privacy. Unsubscribe anytime.

Be the first to comment

Leave a Reply

Your email address will not be published.


*