
Quasar (SN24) released Quasar-Preview yesterday, an 18B-parameter Mixture-of-Experts (MoE) model with just 2B active parameters and experimental support for a 5-million-token context window.
The team tips it as the “first public proof” that its custom Quasar architecture works at real scale.
The bottleneck it’s attacking
Long-context processing is still one of AI’s most stubborn engineering problems. Leading 2026 models advertise ranges up to 10 million tokens (Google’s Gemini 3 Pro and Meta’s Llama 4 Scout among them), but real-world performance often degrades sharply past a few hundred thousand.
Most AI models run on a standard Transformer, the same foundation under GPT, Claude, Gemini. Powerful, but it has a fatal limit: double the context, quadruple the compute. That quadratic wall is why long-context AI is still a bottleneck everywhere.
The architecture
Quasar’s bet is a hybrid recurrent/attention stack rather than dense scaling or pure sparse-attention tricks:
- Loop Transformer: A scaffold that reuses decoder layers across multiple passes, raising effective compute depth without inflating parameter count. The Preview runs a single loop with looped anchor injection disabled.
- Quasar Hybrid Attention: Layers cycle through three branch types — dominant Quasar branches, Raven (slot-routed recurrent attention with Mamba-2-style decay), and GLA (Flash Linear Attention for fast sequence mixing). The Preview uses 20 layers, with active hybrid layers from 4 to 19.
- Sparse MoE routing: 256 experts, 8 selected per token plus one shared expert, keeping active params at ~2B while the full checkpoint sits at ~18B.
- Experimental context extension: A “Safe NoPE” (no positional encoding past the first 512 tokens) plus RoPE config enables the 5M-token setting, though the model card flags this path as immature.
The design targets specialized massive-context workloads. The team has hinted that this release is an experiment and a reveal of the long-context architecture, with a tech report and a 10T-token version still to come.
Benchmarks and early pushback

The release includes a table comparing Quasar-Alpha (the lineage leading to Preview) against Covenant-72B, Youtu-LLM-2B, Qwen3-4B-Base, and Gemma-3-4B-PT across MMLU, ARC, PIQA, HellaSwag, OpenBookQA, and MATH-500. Quasar-Alpha looks competitive or better in several categories despite fewer active params and limited training.
The X community flagged few caveats fast. The comparators are older or differently sized models; long-context-specific evals (needle-in-haystack at scale) are absent; and every Quasar result is bolded regardless of win or loss.
Critics also pressed on VRAM requirements for actually running the full 5M context, and on tool-calling and agentic capability. The team’s response: the release “is not about evals or beating other models” but about proving the architecture scales. A fuller tech report is promised.
Learn more about Quasar:
Enjoyed this article? Join our newsletter
Get the latest TAO & Bittensor news straight to your inbox.
We respect your privacy. Unsubscribe anytime.

Be the first to comment