Some of my AI coding sessions produce great work. Others produce garbage.

The difference isn’t the model. It’s not the tool. It’s what I give the agent before it starts.

You might have heard this called context engineering. It’s an umbrella term with many interpretations. Here’s what it means to me.

How my prompts evolved

When I started using coding agents with Cursor, I’d prompt like this: “Implement the OpenAI Responses API.”

Then I’d play the bingo machine. Allow, allow, allow. The agent would dive in, write code, declare victory. And the code would be wrong in ways that only became obvious when I tried to use it.

That phase didn’t last long. What changed it was Cursor’s @ mentions. Being able to point at specific files forced me to think about which files mattered. That was the first shift: from “just do it” to “here’s where to look.”

With Claude Code, I noticed I needed fewer file references. The agent was better at finding things on its own. I could speak more about high-level architecture instead of just pointing at files.

Though it still missed patterns. I’d have to say “write controller specs like admin_spec” to get consistent test structure. The agent could find files, but it couldn’t always infer which ones to use as templates.

Then Codex came along with GPT 5, and that gap shrank again. It spends real time exploring the codebase before proposing changes. I find myself pointing at files less and giving direction more. Less “look at this file” and more “here’s the constraint, here’s what done looks like.”

The layers build on each other: file references, architecture guidance, constraints and goals. Better agents handle more of the lower layers themselves.

But even the best search can’t surface what’s in your head. The reasons behind the architecture. The edge cases you’re worried about. The definition of “done” that matters for this specific feature.

That’s still on you to provide.

What a real prompt looks like

Now my prompts look more like this:

We need to add support for OpenAI’s new Responses API.

Quick context: our LLMProvider abstraction lets us swap providers without changing the Task model. Each provider implements send_message, and right now they all work the same way: we send the full message history every call.

The Responses API breaks that assumption. It’s stateful and session-based. OpenAI holds the conversation server-side, so we don’t resend history.

This means LLMProvider needs a session concept. But here’s the constraint: it should only affect OpenAIProvider. Anthropic and the others still work statelessly. So the session logic needs to be encapsulated in a way that doesn’t leak into Task or Message, where state currently lives.

Before you start coding, write a plan to docs/backlog/openai_responses_api.md. I want to see how you’re thinking about the abstraction boundary.

This is a convincing argument, not just instructions. It explains why the change matters, states the constraint, and asks the agent to show its thinking before it writes code. It’s the same walkthrough I’d give a coworker before they start on a feature.

That’s one level of context engineering: what you put in the prompt. There’s another level: the tools.

The tool level

Your prompts can be perfect, but if the tools the agent uses aren’t designed for how agents work, you’re fighting uphill.

I’ve mostly avoided connecting MCP servers to my coding agents. The auth model is clunky and the overhead isn’t worth it yet. The exception is browser automation like Playwright. For everything else, CLI tools work better. Coding agents are surprisingly good at learning CLIs quickly, and CLIs naturally express jobs rather than granular operations.

Here’s a pattern I’ve landed on: I’ll notice a repetitive sequence of commands the agent uses to accomplish something. Three or four CLI calls it runs every time it needs to test a webhook, or deploy a branch, or reset the database to a known state.

So I run a “meta session.” I ask the agent to script that sequence into a single command and document it. Now instead of watching it chain docker compose down && docker compose up -d && rails db:seed && cloudflared... every time, it just runs bin/reset-dev.

Same thing with GitHub code reviews. The gh CLI can do everything, but the commands are verbose. Fetching PR comments, filtering by status, responding to a review thread. Each one is its own incantation. So in a meta session, the agent wrote bin/fetch-review and bin/address-feedback. One command instead of four. The next agent run is cleaner, less back-and-forth, and the context stays concise with room for actual work.

The pattern: jobs, not steps. A scripted tool that does one thing well beats a chain of commands the agent has to orchestrate every time.

Managing context

This matters because context is finite.

Playwright is powerful, but it’s context-expensive. A single browser session can eat thousands of tokens in DOM snapshots and action logs.

Ideally, the agent that writes a feature would also run the browser tests and fix what it finds. That’s how I work when I’m coding manually. I test my own stuff end-to-end before calling it done.

But with current context limits, I split it up. One session writes the feature and its test script. A separate session executes the tests and reports back. If there are bugs, the first session fixes them with the test output as input.

It’s a workaround for context bloat, not how I’d design it if context were infinite. But it works, and it keeps each session focused.

The common thread

Both levels come down to the same question: what does the agent need to do this job well?

At the prompt level: the lay of the land, the constraints, what success looks like.

At the tool level: jobs instead of steps, scripted workflows instead of command chains, focused sessions that keep context lean.

Where this is going

The agents keep getting better. Their native search improves. Their context windows expand. The gap between “what the agent can figure out” and “what I need to provide” keeps shrinking.

My job is to figure out how to get the most out of them at each stage. Better prompts. Better tools. Better session design. It’s not about extracting value from a limited system. It’s about helping the agent operate at its full potential.

And that potential keeps growing.