The hallucination problem was never a surprise. Anyone who used large language models in production before 2022 already knew that a general-purpose model trained on the whole internet would, with confidence, make things up. The surprise was how long the enterprise AI industry treated it as a footnote.
Microsoft announced Critique on March 30, 2026, as part of its Microsoft 365 Copilot Researcher agent. The mechanism is straightforward: OpenAI's GPT drafts a research response, and Anthropic's Claude reviews it for factual accuracy, citation quality, and completeness before the answer reaches the user. Two competing models, each checking the other's work, embedded into a single enterprise workflow.
The benchmark result is notable. Copilot's Researcher with Critique enabled scored 57.4 on the DRACO deep-research benchmark, outperforming standalone tools from OpenAI, Anthropic, Google, and Perplexity. The 13.8% improvement over single-model approaches is not trivial. And because DRACO was originally built to evaluate Perplexity's own deep-research system, Microsoft cannot be accused of grading its own homework.
Two Different Theories of the Accuracy Problem
What Microsoft built with Critique is a probabilistic verification layer on top of a probabilistic generation system. It is an improvement. But it is worth setting it alongside a different approach that has been in production for years.
Chata.ai, through its AutoQL platform, takes a deterministic path. Rather than using a general large language model and then checking its output, Chata trains a custom, private language model on the specific schema and data of each customer's database. The query engine translates natural language into structured query language commands that run against the actual data. The answer is either correct or it fails. There is no plausible-sounding fabrication in between, because the system is not predicting plausible text. It is executing a query against real records. I covered this architecture in depth in a conversation with Taisa Noetzold, VP of Growth at Chata.ai, published earlier this month.
Probabilistic means the system predicts the most likely answer based on patterns it learned during training. It is not running a calculation. It is making an educated guess. Most large language models, including GPT and Claude, work this way. The guess is often right, but it can be confidently wrong, and the system generally cannot tell you which one it is.
Chata's model runs on central processing units rather than graphics processing units, which changes the cost profile of scaling deterministic accuracy in enterprise environments.
The architectural bet is fundamentally different from what Microsoft is doing. Chata is not reducing the probability of hallucination. It is building a system where hallucination as a failure mode is structurally excluded, at least within the domain the model is trained to cover. The trade-off is scope. A deterministic system constrained to your database schema cannot answer questions about competitive markets, regulatory trends, or anything outside the data it was trained on. A system like Critique, built on general-purpose models, can range much more widely, but must manage the risk that comes with that range.
The Manual Workaround That Preceded This
Practitioners have been doing the multi-model check informally for some time. Ask one model a question. Paste the answer into a second model from a different vendor. Ask it to identify errors, missing context, or unsupported claims. This workflow works, but it is slow, it depends on the practitioner's discipline to maintain it, and it does not scale into automated enterprise pipelines.
What Critique does is operationalize that workflow inside the product. Nicole Herskowitz, corporate vice president of Microsoft 365 and Copilot, framed it to Reuters as making competing models actively collaborate rather than simply offering multiple model options. That is a meaningful distinction. Offering model choice transfers the judgment burden to the user. Embedding sequential review inside the workflow removes it.
The eventual direction, per Microsoft's announcement, is bidirectional. GPT will also review drafts generated by Claude. When that becomes available, the architecture resembles a structured adversarial peer review rather than a simple draft-and-check sequence. Whether that further improves output quality, or simply increases latency and cost, will depend on implementation details that have not been disclosed.
The Adoption Signal Underneath the Feature Announcement
Microsoft reported 15 million paid Copilot seats in January 2026. Against a commercial Microsoft 365 user base of 450 million, that is approximately 3.3% penetration. The distance between availability and adoption is not primarily a feature gap. It is a trust gap. Organizations that have watched colleagues paste hallucinated legal citations into court filings, or make financial decisions based on AI-generated numbers that no one verified, are slow to hand research workflows to AI tools.
Features like Critique are positioned to close that gap. But closing it requires more than a benchmark score on DRACO. Enterprise buyers who have already been burned by AI inaccuracy will want to see the failure cases, not just the success rate. They will want to know what Critique misses, how often, and in what categories of queries. A 57.4 benchmark score against competitors does not answer the question that matters most to a risk-averse chief information officer: what is the worst thing that can happen, and how often does it happen.
Where the Two Models Diverge in Practice
For organizations whose accuracy requirement centers on their own structured data, the deterministic approach Chata represents offers something Critique does not: a provable chain of logic from question to data to answer. A sales forecast generated by a query against actual transaction records can be audited. A sales forecast generated by a large language model that drew on its training data, web sources, and internal documents, reviewed by a second large language model, is harder to audit and harder to explain to a regulator or a board.
For organizations whose research tasks span broad, unstructured information domains, such as competitive intelligence, regulatory monitoring, or technology assessment, the general-purpose model approach with verification layers is the only viable path. There is no deterministic model that can be trained to cover the entire web.
The practical answer for large enterprises is that both architectures will coexist. Structured, high-stakes data queries route to deterministic systems. Open-ended research routes to verified general-purpose models. The integration challenge, which no vendor has fully solved, is making those two systems appear seamless to the end user.
Microsoft Critique improves the odds that a research answer is correct. It does not change the fundamental architecture of a system that generates text by predicting plausible sequences. The question a chief information officer should be asking is not whether Critique produces better benchmarks than the competition, but whether Microsoft will disclose the failure categories, rate, and conditions under which sequential model review still produces confident, wrong answers, and whether that failure rate is acceptable for the specific decisions their organization will be making with it.
"Microsoft Critique Explained: How Copilot Now Uses GPT and Claude Together for Deep Research." Knowledge Hub Media, 30 Mar. 2026, knowledgehubmedia.com.
"Microsoft Pairs GPT with Claude to Reduce AI Hallucinations." Technobezz, 30 Mar. 2026, technobezz.com.
"AI Hallucination Statistics: Research Report 2026." Suprmind, Mar. 2026, suprmind.ai.
"AI Hallucination Rates Across Different Models 2026." AboutChromebooks, Feb. 2026, aboutchromebooks.com.
"AutoQL Technology." Chata.ai, chata.ai/autoql/technology.
Bellamkonda, Shashi. "The Same Answer Every Time Using AI: A Conversation with Taisa Noetzold of Chata.ai." shashi.co, 25 Mar. 2026, shashi.co/2026/03/the-same-answer-every-time-using-ai.html.
