I pay for five AI models. Claude Pro, Gemini Pro, Amazon Quick Plus, Grok (through X Pro), and access to GPT-5.5 and others through Abacus AI's router LLM. I also use Meta AI, though I have not found a clear business use for it yet.
Rather than ranking them by vibes or benchmark tables, I ran a test with a known answer. The results tell you more about these models than any leaderboard.
The Verification Test
On May 3, 2026, I asked every model the same question: how many posts did I publish on shashi.co in the past 48 hours?
I knew the answer. It was 8.
None of them got it right.
| Model | Posts Found | Accuracy | Method | Qualification |
|---|---|---|---|---|
| GPT-5.5 (Abacus) | 11 | 137% | Found and read the Blogger RSS feed | Said "11 posts" with specifics |
| Grok | 5-6 | ~69% | Visited homepage and /2026/05/ archive | Acknowledged there might be more |
| Amazon Quick | 5-6 | ~69% | Real-time search of live URLs | Acknowledged there might be more |
| Claude | 3-4 | ~44% | site:shashi.co via search index | Said "confirmed count is 3, possibly 4" |
| Gemini | 2 | 25% | Google Search index only | Stated 2 as a definitive fact |
What the Test Reveals
GPT-5.5 was the most resourceful. It was the only model that thought to check the RSS feed instead of just reading the homepage. It overcounted, probably pulling in a scheduled post or something just outside the 48-hour window. But the method was the smartest in the group.
Grok and Amazon Quick tied. Both visited the site, read what was visible, listed titles, and honestly said there could be more behind pagination. Neither pretended to have the complete picture.
Claude underperformed. I have praised Claude for using site:shashi.co to get around Gemini's indexing limitation. That is real. But in this test, it still only found 3 or 4 out of 8. The site: operator queries Google's index. If Google has not indexed a post yet, Claude does not see it either. Better than Gemini, but the same underlying dependency on search indexing.
Gemini was the most dangerous. Not because it found only 2, but because it presented 2 as the definitive answer. "You have published two posts." No hedging, no disclaimer, no acknowledgment that there might be more. A user who does not know the real number walks away confident in a wrong answer. This is the specific hallucination pattern that has pushed me away from using Gemini for verification tasks.
And this is the core problem with Gemini for content creators: it does not visit URLs. It queries Google Search's index. If your content is not indexed yet, it does not exist. For anyone publishing multiple posts a day, this makes Gemini unreliable for anything involving your own recent output.
What the Benchmarks Say
The formal benchmarks tell a different story than my daily experience, and that gap is worth understanding.
On Humanity's Last Exam, the hardest public reasoning benchmark available:
| Model | Accuracy |
|---|---|
| Gemini 3.1 Pro | 45.9% |
| Claude Opus 4.6 | 34.4% |
| GPT-5 Pro | 31.6% |
| Llama 4 Maverick | 5.7% |
| Nova Pro | 4.4% |
Gemini leads. By a wide margin. On coding benchmarks (SWE-bench Verified), Claude leads at 82.1%. On scientific reasoning (GPQA Diamond), Gemini leads again at 94.3%. Claude holds the top Arena Code Elo at 1548.
Nova Pro, the model behind Amazon Quick, sits near the bottom on raw benchmarks. That is a fact I am not going to hide.
So why did I upgrade to Quick Plus?
Benchmarks Measure Capability. Workflow Measures Output.
I do not use AI to solve PhD-level reasoning problems. I use it to research topics, draft analysis, generate schema markup and structured data, create images for posts, and publish. Every day. Often multiple times a day.
Here is what I actually need and how each model performs in my workflow:
Research and real-time verification. Amazon Quick's research feature pulls from multiple sources and synthesizes them into structured analysis. What used to take me 3 to 4 hours now takes about 1 hour. The real-time search is genuine. It visits live URLs, not just a cached index.
Memory and preferences. I told Quick once that I do not want em dashes in my content. It remembers. I told it my headings are always pasted separately into Blogger. It remembers. I told it I am on my phone and need full edited code, not instructions to edit. It remembers. I rarely repeat myself. With other models, I am re-explaining preferences at the start of every session or managing custom instructions that have character limits.
Code, schema, and structured data. Every post I publish gets schema markup, pagemaps, and structured data. Quick generates these correctly and consistently. I do not have to explain the format each time.
Images in the same thread. I can generate images without leaving the conversation. It is slower than Grok or dedicated image tools. But staying in one thread and not context-switching to another app has value when you are trying to maintain a line of thinking.
Staying in one place. This is the real argument. I am not switching between Claude for writing, Grok for images, Gemini for search, and a code tool for schema. One thread handles the full publishing workflow from research through structured data. The reduction in context-switching is where the time savings come from.
Where Each Model Fits in My Stack
Claude remains the best writer and coder. When I need polished prose or complex code, Claude is where I go. The 82.1% SWE-bench score is real and you feel it. But it lacks native real-time search, has no memory across sessions in the same way, and I cannot generate images in the same conversation.
Grok is fast and gives you real-time validation. Images are quick to generate. But there is no structured way to work with it. No persistent memory, no research orchestration, no schema generation workflow. The images lean toward social media and memes rather than business content. If prompted well, it gives good results, but you are doing the orchestration yourself.
Gemini has the highest raw reasoning scores on paper. If you are doing scientific analysis or complex logic problems, the benchmarks say it should be your first choice. But for content verification and anything involving your own recently published work, the Google Search index dependency is a dealbreaker. I was on Gemini Pro. I still am. But it has moved from my primary tool to a specialized one.
GPT-5.5 (via Abacus AI) showed the most creative problem-solving in the verification test. The RSS feed approach was something no other model attempted. Abacus AI's router LLM, which picks the best model for each task, is an interesting approach that aligns with what the benchmark data suggests: no single model dominates every task. I note that Amazon's Nova models are not available on Abacus, but Llama is.
Amazon Quick does not win any single benchmark category. Nova Pro scores 4.4% on Humanity's Last Exam. That is near the bottom. But I upgraded to Plus anyway because it does the most things well enough in one place. Research, real-time search, memory, code generation, structured data, images. The compound effect of not switching tools is where the value sits.
Meta AI I use it but I have not found a consistent business or content creation use case for it yet. If that changes, I will write about it.
The Honest Summary
If you want the best reasoning engine, the benchmarks point to Gemini. If you want the best code and writing, Claude. If you want the fastest real-time results with images, Grok. If you want model routing that picks the right tool for each job, Abacus AI.
If you want one place that handles the full workflow of researching, writing, generating code and structured data, creating images, and remembering how you work, that is where Amazon Quick sits for me. It is not the smartest model in the room. It is the most useful tool on my desk.
That is a different thing. And for someone publishing 8 posts in 48 hours, the difference matters.

