Skip to content

Bruce Hart

AI LLMs Codex Developer Tools cost-optimization

GPT-5.5 Feels Like an Incremental Upgrade That Actually Matters

Portrait of Bruce Hart Bruce Hart
7 min read

GPT-5.5 is not a different species from GPT-5.4. It is the kind of incremental upgrade that starts to matter when the agent has to stay useful for a long time.

I have been using GPT-5.5 in Codex, and the short version is: I am impressed.

Not in a "throw out everything else" way. GPT-5.4 is still very good, and if you are using the API, I suspect there are a lot of cases where GPT-5.4 is the correct default. Maybe most cases. If the job is bounded, well-specified, and easy to verify, the extra cost of GPT-5.5 may not buy you enough to matter.

But there are also cases where the upgrade shows up in a way that is hard to ignore. The model feels faster, more token-efficient, and more capable at staying oriented across a messy task. That combination matters for agent work because the failure mode is usually not one big dramatic mistake. It is a slow drift into confusion, wasted steps, or brittle assumptions.

GPT-5.5 seems better at avoiding that drift.

The default API answer is probably still GPT-5.4

The boring but useful take is that GPT-5.5 should not automatically replace GPT-5.4 in every workflow.

If you are classifying support tickets, summarizing short documents, generating routine structured output, or doing one-shot transformations where the input and expected output are clear, GPT-5.4 is probably fine. More than fine. In those cases, the important engineering question is not "what is the strongest model I can call?" It is "what is the cheapest model that clears the quality bar with enough margin?"

That sounds obvious, but it is easy to forget when a new model feels good. Capability is not the same as fit. A slightly better answer is not always worth a materially higher bill.

The places where GPT-5.5 feels more interesting are the places where the model has to hold a plan together, use tools, recover from weirdness, and produce something inspectable at the end. Agentic work. Browser work. Data wrangling. Long tasks with enough state that a small mistake early can compound.

That is where the incremental improvement starts to feel less incremental.

Agent stamina is the thing I notice

The phrase I keep coming back to is agent stamina.

Not intelligence in the abstract. Not benchmark score as a personality trait. Stamina: the ability to keep a multi-step task coherent without making me babysit every turn.

Speed matters here, but not just because faster is nicer. When the model burns fewer tokens and gets to the right next action sooner, the whole loop gets cheaper and less fragile. It has more room to inspect, verify, and adjust. It can spend budget on the task instead of spending it on finding its place again.

That is the part of GPT-5.5 that has stood out to me so far. It feels like a model that wastes fewer moves.

Chrome remote debugging unlocked a new kind of Codex task for me

The example that really sold me was a fantasy football rabbit hole.

I gave Codex CLI access to Google Chrome with remote debugging turned on. Then I logged into my Yahoo Fantasy Football league from last year and asked GPT-5.5 to scrape the season's matchup data into a SQLite database. After that, I asked it to write a report showing the average point differential by position across the season for matchups involving each team.

That is a very different task from "write a script."

It had to work through a live browser session that already had my login state. It had to inspect Yahoo's pages, figure out where the matchup data lived, collect it without losing track of weeks and teams, normalize the results into tables, and then query the database for an actual answer. It also had to deal with the boring operational reality that Yahoo had throttling in place.

What surprised me was that Codex did not just slam into the throttling and fail. It noticed the constraint and slowed down the request pace on its own. That is exactly the sort of small, practical adjustment that makes agent work feel different from script generation. It was not trying to be clever. It was just respecting the system it was interacting with and continuing the job at a sane speed.

The end result was not a demo. It was a database and a report I could actually use.

The best part was learning I was wrong

The report told me I lost the most ground last season in defensive matchups: about -3 points per game.

That surprised me. I thought defense was a strength of my team. Or at least not a real problem. But across the season, in the matchups involving my team, that position group was quietly costing me.

This is the fun part of making personal data easier to query. You get to test the story in your head against the numbers.

Fantasy football is mostly a toy problem, but the pattern is not. A lot of useful software is just a bridge between "I wonder if..." and "here is the answer." The bridge is usually annoying enough that we do not build it. Log in. Click around. Copy tables. Clean names. Make a schema. Write the queries. Fix the weird week where the page layout changed.

With Codex, Chrome remote debugging, and a stronger model in the loop, that bridge got short enough that I actually crossed it.

The frontier is tool access plus judgment

The important thing here is not that GPT-5.5 can scrape a fantasy football site. That is too narrow a lesson.

The bigger lesson is that browser access changes the shape of what an agent can do. A lot of useful work lives behind authenticated sessions, dynamic pages, and awkward UI flows. APIs are great when they exist and expose what you need. But plenty of personal workflows are trapped in websites.

Chrome remote debugging gives Codex a way into that world without pretending the web is cleaner than it is. The model can inspect the page, try an approach, notice when it is getting rate limited, slow down, persist intermediate data, and produce an artifact outside the browser.

That last part matters. The real output was not a scraped page. It was a SQLite database and a report. Browser interaction was just the way to get there.

GPT-5.5 is a routing decision, not a moral victory

My current take is pretty simple: GPT-5.5 is impressive, but I would still route deliberately.

Use GPT-5.4 when the task is cheap to check, narrow in scope, and unlikely to benefit from a stronger long-horizon agent. Reach for GPT-5.5 when the task has messy state, tool use, browser interaction, large context, or expensive mistakes.

That is not as exciting as saying "new model, everything changes." But it is probably more useful.

For me, GPT-5.5 has already earned a place in the jobs where I want Codex to act less like autocomplete and more like a patient data assistant. The fantasy football experiment was small and goofy, but it showed me something real: once an agent can use the browser well, slow down when the site asks it to, and turn messy pages into a database, a lot of personal analysis projects become one prompt away from worth trying.

I am going to do more of this before the season starts. There is probably a lot hiding in last year's matchups, waiver moves, and weekly lineup decisions.

And apparently I need to rethink my defense strategy.