Every LLM Compares Every Word to Every Other Word. That Is Why It Costs So Much.

Every LLM Compares Every Word to Every Other Word. That Is Why It Costs So Much.

AI Infrastructure · Architecture
The quadratic attention problem has been taxing enterprise AI budgets since 2017. A Miami startup says it solved it. The independent benchmarks are worth taking seriously.
12M Token context window (SubQ research build)
56× Speed over FlashAttention — Appen independent eval
$8 vs. $2,600 to run RULER 128 benchmark (vendor-supplied, unaudited)
$29M Seed funding raised at launch, May 2026
Key Takeaway

Dense attention's quadratic scaling was never a feature. Enterprise teams absorbed it because there was no credible alternative. Subquadratic's SubQ model is the first to post independent third-party benchmarks suggesting the constraint is breakable at frontier quality. The debate about whether it beats GPT misses the point.

Draw a circle. Mark ten dots around its edge. Now draw a line between every possible pair of dots. You end up with 45 lines. Double the dots to twenty and you get 190 lines. Double again to forty and the number climbs past 780. This is not a geometry exercise. It is how every large language model you have ever used processes text.

Each dot is a token, roughly a word or part of a word. Each line is a computation the model runs to figure out how that token relates to every other token in the document. The technical term is dense attention. The practical result is that every time you double the length of the document you feed an LLM, you roughly quadruple the compute required. That relationship, quadratic scaling, is the reason long-context inference is expensive, slow, and power-hungry at scale. It has been baked into the transformer architecture since Google researchers published "Attention Is All You Need" in 2017.

Enterprise AI teams have spent years building around it. Retrieval-augmented generation pipelines, document chunking, context compression layers, summarization middleware: almost all of it exists to manage the quadratic ceiling, not because it is the right architecture for the problem but because there was no other option. That may be changing.

The Fix Is Conceptually Simple. Executing It Has Not Been.

Sparse attention is the idea that not every token needs to be compared to every other token. If you are reading a contract, the word "indemnification" in paragraph three probably has a meaningful relationship to the word "liability" in paragraph twenty-two. It almost certainly has no meaningful relationship to the word "the" in paragraph eight. Dense attention runs both comparisons anyway. Sparse attention skips the ones that do not matter.

The problem is deciding which comparisons matter. Previous sparse attention implementations used fixed patterns: always compare token one to token five, always skip token one to token three. Language does not work that way. The relevant relationships in a legal document are different from the relevant relationships in a codebase, which are different again from a scientific paper. Fixed patterns could not capture that variation, and the models built on them underperformed dense attention consistently enough that the field largely abandoned the approach.

"Historically, most mechanisms have used fixed patterns, like always comparing the first word to the fifth. Language is too sophisticated for that."

Subquadratic CTO Alex Whedon, speaking to MIT Technology Review, described the company's approach as dynamic selection: the model calculates on the fly, for each piece of text, which token relationships actually matter, and only runs those computations. The selection criteria change with the content. That is the claimed breakthrough, and it is also the part the company has not published in detail. "That's kind of where the secret sauce is," Whedon said.

What the Independent Numbers Show

Subquadratic launched SubQ on May 5, 2026, with $29 million in seed funding and a set of claims that the AI research community spent the following weeks arguing about. The initial response on X captured the range: one AI engineer called it either "the biggest breakthrough since the Transformer" or "AI Theranos." Both framings overshoot the available evidence in opposite directions.

The more useful question is what the third-party numbers say. Appen, an independent model evaluation firm, ran SubQ through a series of standard benchmarks. In a raw speed test, SubQ was 56 times faster than models using FlashAttention, a previous sparse-attention technique. On LiveCodeBench, a competitive coding test drawn from real programming contests, SubQ scored 89.7%, placing it alongside other top-tier coding models. On the Needle-in-a-Haystack test, which measures whether a model can retrieve a specific piece of information buried in a large document, SubQ scored 98% with context windows of both six million and twelve million tokens.

That last number deserves attention. Most frontier models today operate with context windows around one million tokens. SubQ's research build handles twelve million. The company has publicly targeted fifty million tokens by the end of 2026.

The cost claim is harder to verify because SubQ is not yet widely available. Dangel told MIT Technology Review that running Anthropic's Opus 4.6 through RULER 128, a benchmark developed by NVIDIA to test large-context retrieval, costs $2,600. SubQ completed the same benchmark for $8. Those figures are vendor-supplied and unaudited. They are also the kind of difference, if it holds, that restructures enterprise AI budgeting conversations entirely.

The Weights Question Is Legitimate and Should Not Be Dismissed

The most credible skepticism about SubQ concerns how it was built. Rather than training from scratch on its new architecture, Subquadratic reused weights from Qwen, an open-source model from Alibaba. Weights are the values set during training that determine how a model behaves. The skeptical argument, articulated by independent AI researcher Will Depue, is that reusing weights from a dense-attention model limits what SubQ can actually claim about its architecture. The training cost of quadratic attention is already encoded in the Qwen weights. A sparse-attention layer applied on top does not erase that.

Subquadratic has not published a full rebuttal. Whedon acknowledged that reusing weights is "a common approach for model makers to take" but insists the architectural changes are genuine. The company says it plans to release models trained from scratch on the sparse-attention architecture, which would settle the question more cleanly.

Until then, Depue's read is the right posture: "They may have built something real and useful. But the public evidence does not yet justify the stronger claim that they have solved the quadratic attention bottleneck." That is a narrower critique than "AI Theranos." It is also the correct one.

Key Takeaway

The weights debate matters for the architecture claim. It matters less for the enterprise cost question. If SubQ produces frontier-quality output at a fraction of the inference cost, the procurement case does not depend on whether the training was quadratic or not.

The RAG Tax Is the Real Enterprise Story

Enterprise AI teams have built entire infrastructure categories around the quadratic ceiling. Retrieval-augmented generation, or RAG, is the most widely deployed. The pattern: because you cannot feed a model an entire knowledge base at once, you build a retrieval layer that fetches the most relevant chunks and passes only those to the model. It works. It also adds latency, introduces retrieval errors, requires its own indexing and maintenance pipeline, and produces answers that are only as good as the retrieval step.

Chunking strategies, context compression algorithms, summarization preprocessing: the same logic applies across all of them. They are engineering workarounds for a compute constraint, not architectural choices made because they produce better results. The constraint drove the design.

A model that can hold twelve million tokens in a single context window, and eventually fifty million, does not eliminate RAG for every use case. But it eliminates the cases where RAG exists only because the document did not fit. Entire codebases, full contract histories, complete regulatory filing sets: those become single-pass inference problems instead of retrieval-and-stitch problems. The middleware complexity goes with them.

In a demo for MIT Technology Review, Whedon asked SubQ to reason across four hundred documents simultaneously. The model responded in seconds. The same task sent to Perplexity failed to load all four hundred documents.

Five hundred enterprise customers are already on the early access waitlist. Most have not yet used the model.

The Transformer Is Not Dead. The Assumption That It Has No Competition Probably Is.

Dangel's prediction, that nobody will be building on transformers in a few years, is almost certainly too aggressive. The transformer architecture has enormous institutional momentum: billions of dollars in training runs, entire inference infrastructure stacks, years of optimization work from OpenAI, Google DeepMind, and Anthropic. That does not evaporate because one startup posts good benchmarks.

The more measured version of the claim is also more useful: the assumption that quadratic scaling is a permanent cost of frontier AI quality has not been rigorously tested at production scale. SubQ is the first model to seriously test it with third-party evaluation. The results do not prove the assumption wrong. They provide enough evidence that the assumption deserves scrutiny it has not previously received.

That is a different thing than a breakthrough. It is also not nothing.

CIO / CTO Viability Question

Before SubQ or any linear-scaling architecture earns a production slot, ask your team one question: how much of your current AI infrastructure exists because of context limits, not because it produces better results? Chunking pipelines, retrieval layers, summarization preprocessing — audit which pieces are architectural choices and which are workarounds for a compute ceiling that may not be permanent. If SubQ's benchmarks survive broader scrutiny and the from-scratch training delivers, the workarounds get retired. The architectural choices stay. Knowing which is which before the market moves is the decision that matters now.

Disclaimer: This blog reflects my personal views only. Content does not represent the views of my employer, Info-Tech Research Group. AI tools may have been used for brevity, structure, or research support. Please independently verify any information before relying on it.