A 4th of July With Baseball, Firmware, and Codex

The most surprising part of my July 4 weekend was not that an AI model helped reverse engineer an old game. It was that the model was effectively playing an audio-only game, changing the experiment, listening to the result, and using that feedback to understand a piece of 1980s hardware.

America's 250th anniversary felt like a good excuse to do some very American things: grill out, spend time with family, enjoy the holiday, and, apparently, ask an AI agent to dig through the firmware of an old electronic baseball game.

To quote a song my school class performed in third grade: "What's more American than baseball?"

The game was Parker Brothers' Starting Lineup Talking Baseball, an electronic baseball simulator I loved as a kid. It runs on an old Intel MCS-51-family microcontroller, talks through a tiny speaker, takes input from keypads, and barely has a visual interface at all. The game is mostly sound, timing, memory, and baseball rules hidden in ROM.

Back in March, I used Codex and a half-finished open source emulator for that microcontroller to build a full WASM-based emulator of the game. That was already a big step. The game was playable in the browser and felt like the original.

But I could not get the ROM understood well enough to expose the useful browser overlay data: score, inning, count, current batter, runners, and the state that would make the emulator easier to understand without relying entirely on the speaker.

Four months is not a long time in normal life.

It is a long time in LLM time.

So over the July 4, 2026 weekend, I spent a couple of hours trying again with GPT-5.5 xhigh and a much better tool setup. This time the workflow crossed an interesting line. The model was not just reading disassembly. It was driving the emulator, modifying the harness, capturing audio, running ASR, and using what the game said back to guide the next experiment.

It felt less like debugging a program and more like archaeology.

Not archaeology in the grand sense, obviously. But the small personal kind: trying to recover details about a toy, a software design, and a play experience that have mostly faded from common knowledge over the last 35 years.

The agent was not just analyzing the game. It was playing it

The big change was that Codex had tools it could actually use.

I wired it into a small research bench: Ghidra running inside a Sprite VM, the existing WASM emulator running in a dynamic-analysis harness, local scripts for parsing 8051 instructions, and a Whisper transcription path through Replicate that cost less than a cent per decode.

That meant the model could do something much more interesting than stare at a ROM and guess. It could drive setup and gameplay flows, capture the DAC audio, turn the game's speech into text, compare that text against the roster and manual, patch the harness, rerun a scenario, and check whether the next result fit the hypothesis.

That loop is the thing that stuck with me.

For a normal game, you might inspect pixels or scrape UI state. Talking Baseball barely gives you that. The primary output is speech. So the model had to listen.

The workflow became: press buttons, hear the game, transcribe the audio, inspect memory, revise the map, repeat.

That is a very different kind of software assistant from one that only writes code. It is closer to an experimental partner with access to instruments.

ASR turned the speaker into a debug port

The audio system was both the coolest part and the most awkward part.

The first approach was a headless 8051 audio harness. Codex copied the local emulator core, built a Talking Baseball-specific runner, and captured Port 1 DAC output from the Timer0 audio interrupt. That confirmed several important hardware facts: Timer0 drives the audio sample path, Port 1 is the 8-bit DAC, and the sample rate is about 9615 Hz.

It also showed that the speech data is not simple raw PCM.

That was useful, but isolated speech-token injection did not work the way I hoped. Some immediate commands produced waveforms, but normal phrase tokens often stayed silent or depended on surrounding firmware state. The speech tokens were not simple phrase IDs. They were commands inside a larger speech engine.

So the better path was more human: play the game for real.

Codex used the web emulator to drive actual setup and gameplay flows, captured the audio, normalized the WAV files, and sent them through Whisper on Replicate. The ASR was not perfect, but it was enough to turn the speaker into a rough debug port.

One pass heard Wade Boggs as "Hawks." A person who knows the roster would immediately be suspicious of that. The model did the same kind of reasoning. Given the team, the player table, the game flow, and the likely spoken phrase, it inferred that the game was almost certainly saying Boggs. Carlton Fisk was not always heard cleanly either, but the constrained roster made those transcription errors useful instead of fatal.

That was the moment the project started to feel different. The model was using noisy audio like evidence, not truth. It was doing the kind of cross-checking a person would do: this transcript is weird, but this roster slot and this memory value and this known player make one interpretation much more likely.

The ROM had to be viewed like the hardware sees it

The ROM image is sp17208-002.bin, a 128 KiB program ROM. If you load that as one flat file and start labeling addresses, you can get a lot of confident nonsense.

The game uses an 80C31-style setup with banked ROM. The useful mental model was this:

logical 0x0000..0x7FFF = fixed ROM from physical 0x18000..0x1FFFF
logical 0x8000..0xFFFF = one selected 32 KiB bank

So instead of asking Ghidra to understand one big flat image, we generated four logical 64 KiB views, one per upper bank.

That sounds like a small detail, but it mattered a lot. Calls, tables, references, and state-machine paths started to make sense once the disassembly matched the CPU's actual view of memory.

This is where the agent workflow felt much stronger than my March attempt. The model did not need me to hand-author every helper. It wrote Ghidra export scripts, inspected instruction tables, summarized XRAM access, and noticed when older labels looked suspicious.

That matters because reverse engineering is full of tempting false names. A byte changes during a play, so you name it after the visible thing that also changed. Then every future conclusion inherits that mistake.

The useful model is more disciplined: name a byte only when the code path, runtime behavior, and external evidence all point in the same direction.

The game state started to become visible

The most important correction was around the memory region near 0x0580.

Earlier analysis had treated several of those bytes as generic state or counters. The new pass showed something more specific: they were player identity fields.

The emulator-facing map now looks much better:

0x056F       current inning
0x0570       batting/offense team index
0x0571       team 0 score
0x0572       team 1 score
0x0561       batting order cursor for team 0
0x0562       batting order cursor for team 1
0x0580       current batter/player ID
0x0589       runner/player ID on first base
0x058A       runner/player ID on second base
0x058B       runner/player ID on third base
0x0598       game-active / continue-game flag
0x05B1..B3   proposed next base-runner IDs
0x05B9       base occupancy or affected-runner mask

That changes how the browser emulator should think about baserunners.

They are not just occupied/unoccupied flags. The firmware stores player IDs in the base slots, with 0xFF representing an empty base. That matters if we want the overlay to say who is on first, not just whether first base is occupied.

It also corrected a bad assumption: 0x0573 is probably not the inning. The real inning byte is 0x056F. The nearby 0x0573..0x0576 cluster appears to belong to pitch, count, or play-status calculations.

The agent was helpful because it kept testing labels against behavior. Is the byte initialized like an inning? Is it compared against 9? Is it used by the Challenge Game setup? Is it passed through a pitch-location table instead?

That is slow, evidence-based work. The impressive part is that the model could keep the loop moving without losing the thread.

Baseball knowledge became part of the analysis

The player records turned out to be one of the most satisfying parts of the session.

The built-in player data appears in 17-byte records:

American player record = ROM[0x10030 + internal_id * 0x11]
National player record = ROM[0x101A2 + internal_id * 0x11]
record length          = 0x11 bytes

The IDs are team-relative. So 0x0C can mean Rickey Henderson in the American team context and Tim Raines in the National team context.

That team-relative detail matters because the same byte is not a universal player. The model had to pair the ID with the active team, then check whether the resulting name fit the roster, the spoken audio, and the surrounding game state.

The record layout is now understood well enough to be useful. There are defensive-position bytes, displayed batting stats, power-like fields, speed-like fields, and a tail section that becomes especially interesting for pitchers.

The coolest part was watching the model use ordinary baseball knowledge the way a person would. It understood that Rickey Henderson should look fast, that Tim Raines and Vince Coleman should probably look fast too, and that Eddie Murray or Darryl Strawberry should show up differently from a light-hitting speedster.

When Rickey Henderson, Tim Raines, Eric Davis, Tony Gwynn, Steve Sax, and Ozzie Smith all carried high speed-like values, while sluggers like Eddie Murray, Jack Clark, Andre Dawson, Darryl Strawberry, George Bell, and Eric Davis carried high power-like values, the memory map started to feel much less arbitrary.

That does not prove every field by itself. Baseball intuition is not a substitute for disassembly or traces. But it is a useful consistency check. If the inferred speed byte says Rickey Henderson is slow, something is probably wrong.

Pitchers had their own hidden story. The last bytes of the player record look like pitcher-specific ratings. The strongest read is that +0x10 acts like an endurance or role byte. Starters such as Jack Morris, Roger Clemens, and Bret Saberhagen show high values. Relievers like Dave Righetti, Todd Worrell, and Dan Quisenberry show low values.

The exact pitcher-effectiveness formula is still not proven. But the table is no longer just mysterious bytes. It is a testable model.

This felt like recovering a lost design conversation

There is something oddly moving about this kind of project.

Somebody, decades ago, designed a small baseball simulation that had to fit inside tight memory, speak through a simple DAC, use keypad input, and still feel like baseball. They had to encode rosters, batting stats, pitcher roles, base runners, inning state, play results, speech prompts, and LED behavior into a tiny embedded system.

Most of that design process is gone now. There is no source code sitting in front of me. No design doc. No engineer explaining why one table is shaped this way and another routine commits runners that way.

What remains is the artifact.

The ROM. The sound. The manual. The timing. The state changes. The names the game says out loud.

That is why the ASR loop felt so powerful. The model was not just reading dead code. It was interacting with the artifact on its own terms. The game talks, so the agent listened. The firmware changes memory, so the agent traced memory. The roster encodes baseball assumptions, so the agent used baseball knowledge to check the table.

That is the archaeology feeling: brushing away just enough uncertainty to see the shape of the original design.

The failure list got more useful

The session did not solve everything.

Balls, strikes, and outs are still not fully pinned down. The candidate region is smaller now, especially around 0x0573..0x0576 and 0x0583..0x0587, but those labels need dynamic traces tied to SCORE-button announcements and controlled play outcomes.

The exact score increment path still needs runtime confirmation. Routine 0x4599 looks like a strong run-accounting candidate, but I would rather prove it with traces than put a confident label on it too early.

The speech format is still not decoded. We know much more about the DAC path, speech queue behavior, and phrase outputs, but the compressed speech data and full token dictionary remain open.

The pitcher effectiveness formula is also still unresolved. We know where the likely rating bytes are. We have a strong hypothesis about endurance. We do not yet know exactly how those bytes influence outcome probabilities.

That might sound like a lot of unfinished work, and it is.

But it is a much better kind of unfinished work. The problem is now smaller, sharper, and instrumented.

The next emulator should be a lab

The next step is to make the emulator better at explaining itself.

I want a dynamic trace mode that logs PC, bank, XRAM reads and writes, keypad input, speech tokens, and DAC state. Then I can run controlled scripts: press SCORE, throw a ball, take a strike, swing and miss, record an out, hit a single, score a run, trigger a half-inning transition.

Challenge Game setup should be especially useful as a calibration tool, because it lets the game start from known innings, scores, and scenarios.

After that, mutation testing becomes possible. Change one pitcher byte at a time, simulate a bunch of pitches or plate appearances, and measure whether outcomes move. That should turn the pitcher-rating guesses into actual evidence.

This is the part I am most excited about trying next with GPT-5.6 if it lands soon. The model jump matters, but the bigger question is whether the agent can run tighter loops: play, listen, patch, trace, measure, update.

That is the workflow I want.

Not an oracle. An experimental system that can keep interacting with the thing it is trying to understand.

The bigger lesson is about agents with feedback

This project made me more optimistic about coding agents, but not because the agent magically reverse engineered a ROM.

It did something more useful.

It closed feedback loops.

It wrote scripts I would have procrastinated on. It set up Ghidra views. It inspected emulator code. It built audio harnesses. It drove real gameplay. It captured speech. It used ASR when the game would not expose state any other way. It kept track of which claims were proven, which were likely, and which were still guesses.

That is the middle layer where a lot of real technical work lives.

Most hard projects do not go from unknown to known in one clean jump. They move through partial maps, failed experiments, better instruments, corrected assumptions, and increasingly specific questions.

LLMs are getting good at that messy middle, especially when they can act on the system and observe the result.

And for a July 4 weekend project about a talking baseball game from the 1980s, that feels fitting. A little nostalgia, a little hardware archaeology, a little software tooling, and a reminder that the best use of a new tool is often to go back and finish something you could not quite finish before.

There is still a lot to learn about Starting Lineup Talking Baseball.

But the game is less of a black box now. And the next time I hear it talk, I should be a little closer to knowing exactly what the firmware is thinking.

Stay tuned -- I will hopefully have more updates on this project soon!