Microsoft Critique and the Accuracy Problem AI Has Been Avoiding

Analyst Note · Microsoft · Enterprise AI

Shashi Bellamkonda  ·  March 31, 2026

57.4

Critique DRACO Score

13.8%

Benchmark Improvement

3.3%

Copilot Seat Adoption Rate

The hallucination problem was never a surprise. Anyone who used large language models in production before 2022 already knew that a general-purpose model trained on the whole internet would, with confidence, make things up. The surprise was how long the enterprise AI industry treated it as a footnote.

Microsoft announced Critique on March 30, 2026, as part of its Microsoft 365 Copilot Researcher agent. The mechanism is straightforward: OpenAI's GPT drafts a research response, and Anthropic's Claude reviews it for factual accuracy, citation quality, and completeness before the answer reaches the user. Two competing models, each checking the other's work, embedded into a single enterprise workflow.

Plain Language: AI Hallucination

When an AI model states something confidently that is simply not true, it is called a hallucination. The model is not lying. It has no concept of lying. It is generating text that looks statistically plausible given everything it was trained on, but happens to be wrong. The danger is that it sounds exactly the same whether the answer is correct or fabricated.

The benchmark result is notable. Copilot's Researcher with Critique enabled scored 57.4 on the DRACO deep-research benchmark, outperforming standalone tools from OpenAI, Anthropic, Google, and Perplexity. The 13.8% improvement over single-model approaches is not trivial. And because DRACO was originally built to evaluate Perplexity's own deep-research system, Microsoft cannot be accused of grading its own homework.

Plain Language: AI Benchmarks

A benchmark is a standardized test used to compare AI models. DRACO (Deep Research Accuracy and Completeness on Open-ended tasks) scores how well a model handles complex research questions: does it find the right information, cite it correctly, and avoid errors? Higher scores mean fewer mistakes on that test. Benchmarks are useful for comparison but limited: a model that scores well on one test can still fail on the specific task you actually need it for.

Two Different Theories of the Accuracy Problem

What Microsoft built with Critique is a probabilistic verification layer on top of a probabilistic generation system. It is an improvement. But it is worth setting it alongside a different approach that has been in production for years.

Chata.ai, through its AutoQL platform, takes a deterministic path. Rather than using a general large language model and then checking its output, Chata trains a custom, private language model on the specific schema and data of each customer's database. The query engine translates natural language into structured query language commands that run against the actual data. The answer is either correct or it fails. There is no plausible-sounding fabrication in between, because the system is not predicting plausible text. It is executing a query against real records. I covered this architecture in depth in a conversation with Taisa Noetzold, VP of Growth at Chata.ai, published earlier this month.

Plain Language: Deterministic vs. Probabilistic

Deterministic means the system runs a fixed, logical process against your actual data and returns an exact result. Same question, same answer, every time. Think of a calculator: 2 + 2 always returns 4.

Probabilistic means the system predicts the most likely answer based on patterns it learned during training. It is not running a calculation. It is making an educated guess. Most large language models, including GPT and Claude, work this way. The guess is often right, but it can be confidently wrong, and the system generally cannot tell you which one it is.

Chata's model runs on central processing units rather than graphics processing units, which changes the cost profile of scaling deterministic accuracy in enterprise environments.

Plain Language: CPU vs. GPU

A central processing unit (CPU) is the standard chip inside every computer and server. A graphics processing unit (GPU) was originally designed for rendering video game images but is now the primary chip used to train and run AI models because it can handle enormous amounts of math in parallel. GPUs are expensive and in short supply. Running a system on standard CPUs rather than GPUs means lower infrastructure costs and easier deployment at scale, but only works if the system is designed to operate without the kind of heavy computation that large language models require.

The architectural bet is fundamentally different from what Microsoft is doing. Chata is not reducing the probability of hallucination. It is building a system where hallucination as a failure mode is structurally excluded, at least within the domain the model is trained to cover. The trade-off is scope. A deterministic system constrained to your database schema cannot answer questions about competitive markets, regulatory trends, or anything outside the data it was trained on. A system like Critique, built on general-purpose models, can range much more widely, but must manage the risk that comes with that range.

The Manual Workaround That Preceded This

Practitioners have been doing the multi-model check informally for some time. Ask one model a question. Paste the answer into a second model from a different vendor. Ask it to identify errors, missing context, or unsupported claims. This workflow works, but it is slow, it depends on the practitioner's discipline to maintain it, and it does not scale into automated enterprise pipelines.

What Critique does is operationalize that workflow inside the product. Nicole Herskowitz, corporate vice president of Microsoft 365 and Copilot, framed it to Reuters as making competing models actively collaborate rather than simply offering multiple model options. That is a meaningful distinction. Offering model choice transfers the judgment burden to the user. Embedding sequential review inside the workflow removes it.

The eventual direction, per Microsoft's announcement, is bidirectional. GPT will also review drafts generated by Claude. When that becomes available, the architecture resembles a structured adversarial peer review rather than a simple draft-and-check sequence. Whether that further improves output quality, or simply increases latency and cost, will depend on implementation details that have not been disclosed.

The Adoption Signal Underneath the Feature Announcement

Microsoft reported 15 million paid Copilot seats in January 2026. Against a commercial Microsoft 365 user base of 450 million, that is approximately 3.3% penetration. The distance between availability and adoption is not primarily a feature gap. It is a trust gap. Organizations that have watched colleagues paste hallucinated legal citations into court filings, or make financial decisions based on AI-generated numbers that no one verified, are slow to hand research workflows to AI tools.

47% of enterprise AI users made at least one major business decision based on hallucinated content in 2024, according to research from Deloitte. The cost of hallucination mitigation runs approximately $14,200 per employee per year.

Features like Critique are positioned to close that gap. But closing it requires more than a benchmark score on DRACO. Enterprise buyers who have already been burned by AI inaccuracy will want to see the failure cases, not just the success rate. They will want to know what Critique misses, how often, and in what categories of queries. A 57.4 benchmark score against competitors does not answer the question that matters most to a risk-averse chief information officer: what is the worst thing that can happen, and how often does it happen.

Where the Two Models Diverge in Practice

For organizations whose accuracy requirement centers on their own structured data, the deterministic approach Chata represents offers something Critique does not: a provable chain of logic from question to data to answer. A sales forecast generated by a query against actual transaction records can be audited. A sales forecast generated by a large language model that drew on its training data, web sources, and internal documents, reviewed by a second large language model, is harder to audit and harder to explain to a regulator or a board.

For organizations whose research tasks span broad, unstructured information domains, such as competitive intelligence, regulatory monitoring, or technology assessment, the general-purpose model approach with verification layers is the only viable path. There is no deterministic model that can be trained to cover the entire web.

The practical answer for large enterprises is that both architectures will coexist. Structured, high-stakes data queries route to deterministic systems. Open-ended research routes to verified general-purpose models. The integration challenge, which no vendor has fully solved, is making those two systems appear seamless to the end user.

Viability Question

Microsoft Critique improves the odds that a research answer is correct. It does not change the fundamental architecture of a system that generates text by predicting plausible sequences. The question a chief information officer should be asking is not whether Critique produces better benchmarks than the competition, but whether Microsoft will disclose the failure categories, rate, and conditions under which sequential model review still produces confident, wrong answers, and whether that failure rate is acceptable for the specific decisions their organization will be making with it.

Sources

    Bishop, Todd. "GPT Drafts, Claude Critiques: Microsoft Blends Rival AI Models in New Copilot Upgrade." GeekWire, 30 Mar. 2026, geekwire.com.

    "Microsoft Critique Explained: How Copilot Now Uses GPT and Claude Together for Deep Research." Knowledge Hub Media, 30 Mar. 2026, knowledgehubmedia.com.

    "Microsoft Pairs GPT with Claude to Reduce AI Hallucinations." Technobezz, 30 Mar. 2026, technobezz.com.

    "AI Hallucination Statistics: Research Report 2026." Suprmind, Mar. 2026, suprmind.ai.

    "AI Hallucination Rates Across Different Models 2026." AboutChromebooks, Feb. 2026, aboutchromebooks.com.

    "AutoQL Technology." Chata.ai, chata.ai/autoql/technology.

    Bellamkonda, Shashi. "The Same Answer Every Time Using AI: A Conversation with Taisa Noetzold of Chata.ai." shashi.co, 25 Mar. 2026, shashi.co/2026/03/the-same-answer-every-time-using-ai.html.

Disclaimer: This blog reflects my personal views only. Content does not represent the views of my employer, Info-Tech Research Group. AI tools may have been used for brevity, structure, or research support. Please independently verify any information before relying on it.

Shashi.co

Microsoft Critique and the Accuracy Problem AI Has Been Avoiding

Two Different Theories of the Accuracy Problem

The Manual Workaround That Preceded This

The Adoption Signal Underneath the Feature Announcement

Where the Two Models Diverge in Practice

Get new posts by email: