Gemini 3.1 Pro Review: Better, But Still Behind Where It Matters
Gemini 3.1 Pro is a clear improvement over Gemini 3 Pro, but in real workflows it still trails on tool calling, long coherence, and feedback incorporation even as it shines in SVG and vision tasks.
Gemini 3.1 Pro improves on its predecessor. Nobody's arguing that. But using it day to day, I can't shake the feeling it's still bringing a really fast horse to what's become a car race.
I've been putting Gemini 3.1 Pro through real work. Not synthetic benchmarks, not toy demos. Just the messy, ordinary prompts I throw at models every day. And my honest take? It's a solid step forward. It is not a leap.
I know that sounds harsh. It's not meant as a dunk. Rewind a year and this model would've blown my mind. But the ground moves so fast now that "impressive" has a shelf life of about two weeks.
Theo Browne nailed it with a line I can't stop thinking about: Gemini feels like a faster horse while everyone else is building cars. That's exactly how it lands in practice.
Benchmarks Are Real. So Is Benchmaxing.
Look, I'm not here to trash Gemini's benchmark numbers. They matter. They capture real capability shifts, and they're useful signal when you know how to read them.
But let's be honest. "Benchmaxing" is a thing. You can absolutely juice your leaderboard scores while quietly fumbling the stuff power users actually care about: holding a thread together over a long task, rolling with feedback gracefully, chaining tool calls without losing the plot halfway through.
Those are the make-or-break moments in my actual workflow. Not whether a model can ace a multiple-choice exam.
For Everyday Chat? We're Hitting "Good Enough."
Here's a strange realization I keep bumping into: for plain old chat use, raw model intelligence is starting to matter less.
When I tested Gemini 3.1 Pro, I genuinely struggled to dream up prompts that any modern frontier model couldn't handle. Think about that for a second. That's a weird milestone, and an important one. If every top model clears a high bar, then the race shifts. It stops being about who's smartest and starts being about who's most available. Best defaults. Deepest integrations. Broadest distribution. Strongest trust.
Google is scary well-positioned for that world. They don't need to own the absolute number one spot every quarter. They need to own the product stack. And they just might.
Where the Real Battle Is: Tool Calling and Long Coherence
The frontier right now, the stuff that actually separates models in hard, real work, boils down to two things:
- Tool calling that doesn't fall apart under pressure. Not demo-quality tool use. Production-quality, end-to-end, things-go-wrong-and-it-adapts tool use.
- Long coherence. Can the model stay sharp across a dozen turns of clarifications, revisions, and course corrections?
This is where I still feel daylight between the players. OpenAI's models are best at incorporating feedback and iterating when something goes sideways. Anthropic's models have what I'd call response EQ. They read intent better. They match your energy. They sound more like a thoughtful collaborator than a completion engine.
Gemini 3.1 Pro? It can absolutely nail a first pass. But it drifts sooner. And when you need a tight correction after a miss, which is constant, it's less reliable.
That matters more than people think. Half of advanced prompting is cleaning up your own vague instructions. The best models aren't the ones that never whiff. They're the ones that recover fast, with minimal back-and-forth.
Credit Where It's Due: Gemini's Genuine Bright Spots
I don't want this to read like "Gemini bad, everyone else good." That's lazy, and it's wrong.
Gemini is still the model I reach for first on certain SVG generation tasks. Its vision capabilities in practical, real-world scenarios? Genuinely excellent. Often best-in-class.
These aren't footnotes. For anyone doing serious multimodal work, these are meaningful differentiators that most benchmark Twitter threads completely ignore.
The CLI Gap Is Starting to Hurt
Here's the thing that honestly surprises me most: Gemini's CLI story still feels miles behind what power users now expect.
Right now, the serious builder energy, the people shipping real things with AI, is clustering around Claude Code and Codex-style environments. And that's dangerous for Google, because developer mindshare compounds. Once the tooling, the docs, the community patterns, the muscle memory all build up around a particular ecosystem, that flywheel is really hard to reverse.
If Google wants Gemini at the center of agentic coding workflows, the CLI and tool-calling reliability can't be a "we'll get to it" priority. It needs to be a right now priority.
The Bigger Picture: Adoption Is Its Own Freight Train
One more thing I keep coming back to: the speed of model research and the speed of societal adoption run on completely different clocks.
Even if core LLM progress hit a wall tomorrow, we'd still have years of work ahead rewiring products, workflows, education, and institutions around what already exists today. The capability is ahead of the integration. Way ahead.
And from where I'm sitting, research doesn't look stalled at all. So we're watching two waves build at the same time: models getting better and the world finally catching up to what's already here.
That's what makes this moment electric.
If you're seeing different results with Gemini 3.1 Pro, I'd genuinely love to compare notes. Send me the hardest real-world prompts you've got. What broke. What worked. What surprised you. Let's figure this out together.