Skip to content

Bruce Hart

AI LLMs opinion

GPT-5.4 Feels Like the Start of the Stability Era

Portrait of Bruce Hart Bruce Hart
6 min read

GPT-5.4 does not feel like just another model release. It feels like the frontier is changing shape.

OpenAI released GPT-5.4 on March 5, 2026, and my first reaction was pretty simple: this thing is good enough across enough tasks that I can imagine using it for almost everything.

That is a bigger shift than it sounds.

A lot of recent model launches have felt specialized. One model is great at coding but clunky for writing. Another is fast but flaky. Another is thoughtful but too slow to sit in the middle of real work. GPT-5.4 feels closer to a default model. Fast, accurate, strong at coding, strong at writing, and importantly, comfortable inside agentic workflows where the model has to keep context straight over multiple steps.

Claude Opus 4.6 deserves a mention here too because it is excellent. In some cases it feels genuinely elite. But it is expensive enough that I do not reach for it as casually, and in my own use it still feels a bit less consistent than GPT-5.4 at following instructions exactly. That does not make it worse across the board. It just makes GPT-5.4 feel more like the model I can default to all day.

For me, that is the real tell. We are getting close to the point where the interesting question is no longer "can the model do the task?" It is "can the system stay reliable long enough to do twenty tasks in a row?"

Coding is starting to look solved for normal people

I do not mean "solved" in some absolute computer science sense. There will still be hard bugs, architecture mistakes, and big engineering problems that require real judgment.

But for a huge amount of everyday work, coding now feels functionally solved.

Not perfect, but solved enough to change behavior.

If I can hand a model a medium-sized implementation task, get back good code quickly, iterate without much friction, and trust it not to go off the rails every other prompt, then the limiting factor is no longer raw code generation. The limiting factor is orchestration. It is tool use. It is whether the model can operate inside a loop without slowly degrading the quality of its own work.

That is why GPT-5.4 feels important to me. It is not just better at producing nice snippets. It feels more usable as a working system component.

The next race is stability, not just benchmark flexing

There is still plenty of capability runway left. OpenAI has been signaling a faster update cadence, and Anthropic has also been communicating, in its own way, that frontier progress is far from tapped out.

That matters because it suggests we are not near some clean plateau. We are still climbing.

But the climb is changing. The sexy part used to be whether a model could write code, reason through a hard prompt, or beat the last benchmark. The more practical question now is whether a model can stay coherent over time. Can it remain useful after hundreds of turns, tool calls, edits, retries, and changing objectives? Can it recover from small mistakes instead of amplifying them?

That is a different kind of problem.

Not "make it smarter once," but "make it dependable for hours."

Agentic use exposes what benchmarks hide

A lot of models look better in a one-shot demo than they do in an actual agent loop.

Once a model starts calling tools, reading files, revising plans, dealing with partial failures, and carrying context over long stretches, weaknesses show up fast. Small hallucinations compound. Overconfidence becomes expensive. A model that seems brilliant in isolation can become annoying the second it has to operate like a teammate.

GPT-5.4 feels unusually promising here.

That does not mean it is flawless. It means it feels stable enough that I want to give it more leash. And that is a pretty meaningful threshold. Agentic systems do not become broadly useful when they hit a benchmark score. They become useful when users stop hovering over them every minute.

Self-learning is the obvious next obsession

If coding is getting cheap and competent, then the next big unlock is not just more raw intelligence. It is systems that improve themselves safely.

That could mean better memory. Better reflection loops. Better error correction. Better use of external tools and feedback. Eventually it may mean controlled forms of self-learning where models refine strategies over time without drifting into nonsense.

That is where things get really interesting.

A model that is 5 percent smarter on a benchmark is nice. A model that can run for a long time, learn from outcomes, and stay aligned with the task is a platform shift.

I suspect that is where a lot of the frontier labs are aiming now. Not because raw intelligence is done, but because persistence multiplies the value of intelligence you already have.

It is kind of wild how fast this moved

What gets me most is the timeline.

It has only been about a year since o3 reset expectations around what these systems could do on harder reasoning and coding tasks. Now we are already talking about models that feel much closer to general-purpose workhorses than flashy research artifacts.

That pace is hard to internalize.

You can still find people talking about AI as if we are waiting for the first really useful coding model. That story already feels out of date. The conversation now is about reliability, productization, and whether we can build systems around these models that are stable enough to trust for long stretches.

That is a much better problem to have.

We may be entering the era of default intelligence

The biggest compliment I can give GPT-5.4 is that it feels less like a novelty and more like infrastructure.

That is when technology starts to matter most. Not when it shocks you once, but when it quietly becomes the thing you reach for first.

If OpenAI really is moving toward more frequent updates, and if Anthropic and others keep pushing hard from their side too, the next year could be absurdly interesting. Not because one model will magically solve everything, but because the baseline of what counts as normal is rising very fast.

I am very impressed by GPT-5.4. More than that, I think it is a glimpse of where the whole field is headed: less drama around whether models can be useful, more pressure on whether they can be trusted to keep going.