LLM Coding Power Rankings (Jan 2026): Opus 4.5 vs GPT‑4.2 Codex

Power rankings for LLM coding, January 2026: the gap is real, but it’s not where you think.

I’ve been coding long enough to remember when “pair programming” meant another human being and not a probabilistic autocomplete monster that occasionally decides your function should be named final_final_v2_REAL.ts.

Anyway: it’s January 2026, the LLMs have basically turned coding into a high-velocity craft, and I’ve got takes. These are the models I actually use, the ways they win (and lose), and the stuff that still makes me clutch my pearls and git reset --hard.

The criteria (aka my extremely scientific rubric)

I don’t care about benchmark flexing. I care about “can this thing get me to a clean PR without making me babysit it like a toddler near a pool?”

My scorecard:

Time to first useful patch: how fast it gets to something runnable.
Instruction obedience: does it follow the actual prompt, or invent a different sport.
Refactor stamina: can it keep a big change coherent over 30–60 minutes.
Coachability: when you say “no, not like that,” does it adjust.
Context muscle: docs, diffs, long files, weird formats.
Trust tax: the hidden cost of checking its work.
Cost per win: not just price, but price times how much you have to re-do.

Also, let’s be honest: the biggest variable is still you. The better your prompts, the better the outcome. (Yes, this is me blaming the user for everything. It’s a hobby.)

Tier 1: The Go-Tos

These are the two models where I’m like, “Okay, we’re in the real thing now.”

1a. Claude Opus 4.5 — the “why is this so fast?” experience

Opus 4.5 is the model everyone’s raving about right now, and for good reason: it’s insanely quick and it has this “I’ll just do the whole thing” energy that makes you feel like you hired a very caffeinated senior engineer for the afternoon.

When Opus is cooking, it gets to a sensible architecture fast, it’s great at chaining tasks without losing the plot, and it’s weirdly good at the “glue” work: tests, edge cases, little refactors that make the rest easy.

But here’s the thing — and this is why it’s 1a instead of unanimous winner — the trust tax can spike.

It’s expensive. Not “I won’t buy it” expensive, but “this better not hallucinate a non-existent API” expensive. And occasionally it makes a mistake that’s so odd it feels like you caught a trusted teammate doing something baffling in public.

Not constant. Not deal-breaking. Just enough that I don’t fully trust it on autopilot yet.

The Opus paradox: it saves you time by moving fast, then it steals a little time back by making you verify the one weird thing it did at 2:17am.

1b. GPT‑4.2 Codex — the rock-solid baseline that shows up every day

GPT‑4.2 Codex is the one I use the most. It’s steady, it’s consistent, and it delivers on almost every problem I throw at it. It’s the model equivalent of the veteran engineer who never gets flustered and somehow always ships a clean patch, with no drama.

Why it’s my default: it follows instructions really well, it’s excellent at “do the boring thing correctly” coding (migrations, endpoints, refactors, wiring), and it tends to produce code that feels maintainable, not just “it passes once.”

And honestly? Value matters. OpenAI’s subscriber usage is generous, and my Pro subscription feels like one of those rare “this is actually worth it” internet purchases.

Downsides: speed can be noticeably slower than Opus 4.5, and the subagent gap is real. Opus feels like it uses subagents more effectively — it can split work, explore, come back with a coherent answer. Codex is more like: “I will do this, sequentially, with vibes.”

I’m optimistic that gets better this year (maybe even this month). The direction is obvious.

Not flashy, but dependable. And in coding, dependable is basically half of genius.

Tier 2: Diet Dr Pepper Award (sneaky good and something different)

This is the model that doesn’t always win the title, but it does that one thing nobody else does, and suddenly you’re building your workflow around it.

2. Gemini 3 Flash — the document whisperer who sometimes ignores the brief

I know this is heresy in some circles, but I actually prefer Flash to Pro in a lot of real situations.

Gemini 3 Flash is fast and it’s the best I’ve used for handling large amounts of data. If you hand it long docs, multiple files, weird tables, and layouts that make other models sweat, it’s like, “Got it,” and you’re already on the next step.

There are legit use cases where I turn to Gemini 3 on purpose: “Here’s a spec, tell me what I’m missing,” “summarize this messy doc and propose an implementation plan,” or “extract structure from something that isn’t clean text.”

But the frustration is also real. It can give answers that are off the mark and don’t follow instructions exactly, and even worse, it doesn’t always handle feedback well.

You know the moment. You say, “Please revise the code to do X instead of Y,” and it’s like, “Absolutely,” and then it… does the same thing again. It’s the project equivalent of agreeing in the meeting, then shipping a different plan.

When Flash is on, it’s special. When it’s not, it’s like arguing with someone who is technically listening but spiritually elsewhere.

Tier 3: Kirkland Signature Award (almost as good, way cheaper)

These are the “I can’t believe this is the price” models. Not quite Tier 1, but close enough that you’ll start doing cost math like a budget wonk.

3. GLM 4.7 — the OpenCode value pick

GLM 4.7 works really well in OpenCode (and yes: I love the OpenCode UI). It’s not quite on the Tier 1 level, but it’s much cheaper while still being perfectly capable of knocking out simple problems.

Where it shines: straightforward bug fixes, small feature adds, cleaning up repetitive code, and “make this function less gross.”

When things get complex — ambiguous requirements, gnarly refactors, multi-step reasoning — that’s when I need Opus or Codex to come in and clean it up.

And honestly? That’s fine.

In a world where I had to pay for tokens all the time and didn’t have a Codex subscription, I’d use GLM 4.7 a lot more. It’s the “reliable backup who can cover a shift” of LLMs.

4. Minimax M2.1 — the ‘how is this not a bigger story?’ speed demon

Minimax M2.1 is another really good model. It’s kind of the Gemini Flash pairing to GLM: very fast and cheap, and a great reminder of how quickly the bar has risen.

The wild part is that it isn’t talked about more. A year ago, this would’ve easily been state of the art and blowing people away.

That’s the thing about this whole era: the middle class is getting richer. The “Tier 3” models of 2026 would have won the top spot in 2024.

And that should make you excited (or terrified) about what January 2027 looks like.

The subagent gap is the new coordination layer

Here’s my big mental model for 2026: raw IQ isn’t enough anymore.

The models that feel best to code with aren’t just “smart.” They’re organized. They break work into chunks, explore options without derailing, come back with a plan, execute it, and then they actually verify.

Opus 4.5 feels ahead here. Codex is still excellent, but it’s a little more linear — like watching a great performer who hasn’t fully learned the new workflow yet.

And yes, I’m sure this is partly tooling. But it shows up in the experience, and the experience is the product.

My actual workflow (the ‘don’t get cute’ plan)

If you want the boring truth of how I use these, it’s pretty simple. I start with Codex for most day-to-day coding because it’s my safe baseline. I bring in Opus when I need speed, I’m stuck, or the problem is sprawling. I use Gemini Flash when the input is huge or document-y and I need comprehension more than cleverness. And I use GLM / Minimax for the “just do the thing” tasks when I’m cost-sensitive or want a quick draft.

And then I do the most important step, the one nobody tweets about:

I run the code. I read the diff. I assume the model is lying until proven otherwise.

Not because it’s evil. Because it’s a model.

My extremely confident prediction (that I reserve the right to delete later)

By the end of 2026, the conversation won’t be “which model is smartest?” It’ll be “which model is the best teammate,” “which one can own a task end-to-end without me babysitting,” and “which one handles feedback like a grown-up.”

Also: prices will keep dropping, context windows will keep expanding, and we’ll all keep pretending we saw it coming.

Your turn: what’s in your Tier 1?

If you’re using something that’s quietly crushing it — or you think I’m wildly wrong about any of these — I genuinely want to hear it.

Send me your own power rankings and the weird little workflow trick you swear by. I’m collecting them like early-2000s forum rumors.

Bruce Hart

LLM Coding Power Rankings (January 2026)