GPT-5.4 Is My Daily Driver, but Claude Opus 4.6 Still Wins Some Hard Cases

GPT-5.4 is still my default model. But "default" and "best at the weird hard part" are not the same thing.

I have been using GPT-5.4 as my daily driver and I genuinely think it is excellent. Fast, capable, good in Codex CLI, good at writing, good at code, usually good enough that I do not need to think twice.

But this week I ran into two cases where Claude Opus 4.6 was just better.

Not a little better. Clearly better.

That matters because model conversations are still too flattened. People want one leaderboard, one winner, one answer to "what is the best model?" Real usage does not look like that. It looks more like having a very strong general-purpose tool on your desk, plus one or two specialists you bring in when the problem shape changes.

A daily driver is not the same thing as the best teacher

First example: I had some code that solved a fairly complex problem, and I wanted to understand the math behind why it worked.

This is an underrated use case for LLMs. Not "write the code for me," but "teach me the thing my code is already doing."

I asked GPT-5.4 to turn that into a blog post that would teach me the mathematics. It was technically correct. That is important. But it felt mechanical. The explanation was orderly without being especially intuitive. It moved through the concepts the way a diligent documentation generator would move through them.

Then I gave the same task to Opus 4.6.

Opus wrote the explanation the way a good math teacher would. It anticipated confusion. It picked better framing. It explained not just the formal steps, but why the steps should feel natural. It was the difference between reading a competent solution key and sitting with someone who knows where students get lost.

As a sanity check, I ran the same prompt through GPT-5.4 Pro, GPT-4.5, and Gemini 3.1 Pro. Opus stood out. Same task, same target, noticeably better teaching.

That does not mean GPT-5.4 is bad at explanation. It means "correct" and "illuminating" are different bars.

The last 10 percent is where models stop looking interchangeable

The second example was more practical.

I recently built a userscript for the New York Times Connections Color Cycler. It lets me mark words with colors while playing NYT Connections, so I can sketch possible groups before committing a guess.

GPT-5.4 in Codex CLI understood the feature immediately and got about 90 percent of the way there. That is still impressive. The problem was the last 10 percent: getting the script's state management to coexist cleanly with the New York Times site's own JavaScript.

That kind of bug is annoying because the obvious fix is usually not the real fix. The page re-renders, your event handlers are technically attached but semantically stale, your state exists but not in the lifecycle the page actually cares about, and each small patch creates one more strange edge case.

After several iterations, GPT-5.4 still could not quite land it.

Opus 4.6 dug in, found the underlying issue, and fixed it in one pass. Later, when I added more features and GPT-5.4 again got most of the way there but introduced fresh state problems it could not clean up, Opus came in and fixed that too.

This is not a benchmark story. It is a hostile real-world JavaScript environment story.

Plenty of models can write the first version of a userscript. Fewer can reason cleanly about state, event lifecycles, and how their code has to coexist with someone else's fast-moving front end.

Some models are better at crossing the gap from syntax to intuition

These two examples feel related to me.

In one case, the job was "explain this math in a way that changes my mental model." In the other, the job was "understand the real system, not just the code I wish existed."

Those are both translation problems.

The first translates formal structure into intuition. The second translates a local code change into system behavior.

That may be one reason Opus felt stronger here. It did not just continue the pattern in front of it. It seemed better at inferring the missing shape around the problem. Where is the user confused? Where is the system brittle? Where is the hidden interaction that actually matters?

That is a harder skill to fake than raw fluency.

"Best model" is becoming a workflow question

I still reach for GPT-5.4 first. It is my daily driver for a reason. It is fast, solid, and unusually dependable across a wide range of tasks. If I am writing, coding, exploring, or using an agent loop, it is the model I want open most of the day.

But I am increasingly convinced that this whole "pick one winner" framing is too simplistic.

A better framing is something like this: which model do I want for broad, repeated work? Which model do I want when I need deep explanation? Which model do I want when the code has to survive contact with a messy live system?

Right now, for me, GPT-5.4 wins the first question and Opus 4.6 still has a real claim on the other two.

That is a useful distinction, especially for people building tooling around these systems. If your product depends on first-pass correctness in messy edge cases, or on explanations that actually teach, model selection is not a generic procurement decision. It shapes the whole user experience.

Cost is probably hiding part of the story

The main thing that keeps me from using Opus 4.6 more is cost.

If it were priced closer to GPT-5.4, I would reach for it more often. The fact that I do not makes me wonder whether Opus is simply a much larger model, or at least a model running with a very different compute budget behind it.

I do not know that from the outside, and I do not want to pretend I do. But it would fit the behavior. Opus often feels like there is more raw horsepower available, while GPT-5.4 feels more like a model that has been shaped to be reliable, useful, and economically manageable at huge scale.

That would also line up with how OpenAI seems to have been shipping lately. My guess is that they are optimizing harder for a model people can actually deploy broadly, not just one that wins the most dramatic head-to-head screenshot.

Maybe they have a Ferrari in the garage. Maybe they are waiting for more datacenter capacity, or for the right moment to line up a bigger release.

OpenAI also has different constraints. It serves a much larger user base, has major distribution and partnership considerations, and may want its biggest future releases to line up with larger company milestones too, potentially including a public offering. When a model is meant to be a product at massive scale, business reality matters almost as much as raw model quality.

Progress is real, but the edges still matter

One reason I find this encouraging rather than disappointing is that the gap is no longer "one model can code and the others cannot."

The gap is subtler now. Taste in explanation. Persistence in debugging. Ability to find the actual root cause instead of circling the symptom. Those are higher-class problems.

That is progress.

But it is also why I do not think the market settles into one obvious champion anytime soon. As models get stronger overall, the interesting differences move into the edges, and the edges are exactly where serious users spend their time.

If you care about real output, not just demos, those edges matter a lot.

GPT-5.4 has absolutely earned daily-driver status for me. Claude Opus 4.6 has also earned something important: a permanent spot as the model I call when I want the best explanation in the room, or when a bug refuses to die.

If you have seen similar split-brain behavior across models, I would love to hear about it. The interesting question is not just who wins overall. It is which model you trust for which kind of hard.

Bruce Hart

GPT-5.4 Is My Daily Driver. Claude Opus 4.6 Is Still My Specialist.