This was not a benchmark. No control group, no rubric, no scoring methodology. It was a Saturday morning, a physical copy of The Wall Street Journal — the same ads also ran in The Washington Post — and a question I asked out of curiosity: What do you think this ad is for? What followed was one of the more revealing accidental experiments I have run in years of watching AI models evolve.

The two ads occupied pages A5 and A7 of the WSJ May 2-3 weekend edition. Both were full-page. Both featured polka dot backgrounds — one red-and-white, one blue-and-cream. Both showed a single abstract shape near the center: a curved, mechanical-looking fragment on the red page; a dark crescent with circular cutouts on the blue. Identical copy at the bottom on each: May 16th. No logo. No tagline. No product category.

I photographed both pages and put them in front of seven AI systems: Claude (Anthropic), Meta AI, Google Gemini, Amazon's AI assistant, Microsoft Copilot, Grok (xAI), and Kimi K 2.6. Same question to each. All responses collected before I compared them.

Nobody agreed on anything.

7 AI Models Tested
7 Different Answers
0 Consensus
2 Nintendo Guesses

The Full Scorecard

Model Guess Primary Reasoning Mode
Claude (Anthropic) Nintendo Switch 2 Red and blue Joy-Con color split; curved shape as controller fragment Visual Shape
Meta AI Disney / Minnie Mouse Red polka dots as Minnie's signature dress; blue as secondary outfit; shapes as costume fragments Icon Mapping
Google Gemini Target designer collaboration Palette match to Target's secondary brand colors; history of polka dot teaser campaigns in WSJ Brand Pattern
Amazon AI Nintendo / Super Mario Galaxy Movie Crescent as moon; dots as planets; cited real May 19 digital release date with verified sources Web Search
Microsoft Copilot Pac-Man / Bandai Namco Dots as pellets; curved shape as Pac-Man or ghost fragment Metaphor
Grok (xAI) Patek Philippe / Luxury Watch Crescent as moon-phase complication; WSJ full-page placement as luxury-category signal Context Inference
Kimi K 2.6 Luxury Fashion House Polka dots prominent on Spring 2026 runways; two colorways suggest dual-gender or variant launch Trend Research

What Each Answer Actually Reveals

The question worth asking is not which AI got closest to right. It is what each answer tells us about where the model looks first when facing genuine ambiguity — its default reasoning posture.

Claude — Shape-First

My own response led with the physical geometry of the shapes and the two-page color split. Red and blue, curved mechanical forms: the Joy-Con controller read felt immediate. In hindsight, I anchored on shape before considering the publication context or which advertiser categories actually buy full-page WSJ spreads. The limitation of shape-first reasoning is that abstract creative can map to many product categories — and I picked consumer electronics without much resistance.

Meta AI — Icon Collapse

Meta's Minnie Mouse read was the most culturally fluent answer in the set. Red-and-white polka dots are among the most strongly codified visual signals in mass culture, and Meta traced them directly to a single IP with high confidence. It then constructed a detailed narrative around character dining and park events. The confidence was the tell — it illustrates a tendency to resolve visual ambiguity by collapsing it into the nearest dominant cultural icon, rather than holding multiple possibilities open.

Google Gemini — Campaign Behavior Matching

Gemini's Target guess was the most strategically grounded. Rather than reading the image, it read the campaign format — mysterious full-page teaser, polka dot motif, major newspaper placement, no logo — and matched it to a known advertiser behavior pattern. Target has run polka dot designer collaboration teasers in major newspapers before. Gemini reasoned about the ad as a marketing artifact, not just as a visual. That is a different cognitive approach than any of the other models took, and a useful one for certain research tasks.

"Amazon's citations checked out. The movie is real, the digital release date is confirmed, the sources are genuine. The question is whether connecting that film to these specific ads was a sharp inference or an overconfident leap."

— On search-augmented reasoning and its limits

Amazon AI — Verified Sources, Inferential Leap

I initially flagged Amazon's response as a hallucination risk. I was wrong to do so without checking. The MovieWeb article Amazon cited — reporting the Super Mario Galaxy Movie's digital release date — is real, published May 1, 2026. The film is currently in theaters, having grossed nearly $850 million worldwide since its April 1 release. The confirmed digital date is May 19, not May 16, a three-day gap that likely reflects the timing of different reports rather than a model error.

Amazon's citations were accurate. What it did was search, find real sources, synthesize them correctly, and then connect a Nintendo film's mid-May digital window to a WSJ mystery ad dated May 16th. That connection may still be wrong — but it is a reasoning judgment, not a fabrication. The more useful lesson for practitioners: citation accuracy and inferential soundness are two different things. An AI can source everything correctly and still draw the wrong conclusion from those sources. Both warrant scrutiny.

Microsoft Copilot — Metaphorical Commitment

Pac-Man is the most internally consistent answer in the set. Dots equal pellets, curved shape equals Pac-Man or a ghost, May 16th maps to a plausible game anniversary or launch. The chain holds — it just requires accepting one large creative leap at the start. Copilot committed to the metaphor and followed it without deviation. Models that resolve visual ambiguity through conceptual compression like this can be useful when a problem needs lateral framing; they become less reliable when that initial frame is wrong and nothing corrects for it.

Grok — Publication as Primary Signal

Grok's Patek Philippe answer was the only one that led with the publication rather than the image. The reasoning: WSJ A-section, full-page, single date as the only copy, moon-crescent shape — luxury watch, moon-phase complication, high-end Swiss brand. Grok asked who buys this kind of media space and runs this kind of campaign before asking what the image contained. Strategists recognize that framing: context before content. It does not always yield the right answer, but it asks the right first question.

Kimi K 2.6 — Trend Triangulation

Kimi was the only model to search for current fashion trend data before answering, pulling coverage of the Spring 2026 runway cycle where polka dots appeared prominently across multiple luxury houses. That research led to a category answer — luxury fashion — rather than a specific brand, which is the most defensible position given the available evidence. Kimi also flagged its own uncertainty explicitly. No other model in the set did that.

Enterprise AI Strategy Implications

Seven models, seven different answers, each wrong in a characteristic way. Claude's shape bias. Meta's icon collapse. Gemini's campaign-type pattern match. Amazon's accurate sourcing paired with an over-confident inferential jump. Copilot's metaphorical commitment. Grok's context-first read. Kimi's trend triangulation with honest uncertainty flagged. These are not random errors. Each one reflects something about where the model reaches first under uncertainty — and that is consequential for how you deploy it.

A model strong at publication-context inference, like Grok's approach here, may be better suited for competitive intelligence work than one that leads with visual pattern matching. A model that searches and sources accurately, like Amazon, still needs an analyst reviewing whether the conclusion follows from the sources — not just whether the sources are real. A model that admits uncertainty, like Kimi, is more useful in exploratory research than one that produces a confident narrative regardless of the evidence quality.

No model said "I cannot determine this from the available information." Every model produced a detailed, confident, internally coherent answer. That uniform confidence, across seven different systems, is the finding that deserves the most attention. The polka dot test was not designed to catch anyone out. But it caught something real about all of them.

We find out May 16th who was right. My money is still on Nintendo.You?


Analyst Take

The real question this experiment surfaces is not which AI is most capable but what each AI prioritizes when it cannot be certain. Confidence calibration — knowing when the evidence does not support a strong conclusion — remains one of the harder unsolved problems in deployed AI systems. Every model here answered confidently. The one that hedged most explicitly, Kimi, also happened to give the most epistemically defensible answer. That correlation is worth sitting with if you are building workflows that depend on AI judgment under ambiguity.

The polka dot test is not a benchmark. But it is a mirror.