The agent's native toolbelt

A terminal agent session showing tool-use calls in the side panel

This is the fourth post in our series on the AI engineering toolkit. The first one moved you into the terminal, the second taught the agent your project’s conventions, and the third covered the context window and how to keep it from filling up with noise. This one is about what the agent can already do before you wire up a single extra tool. Most people treat it like a smarter chat box for longer than they should, handing it files and pasting back output, without realizing it already ships with a set of tools it reaches for on its own. How far does that native set actually go? What does it cover, and where does it stop? And what happens when the system commands it relies on are not installed on your machine?

One practical note before going further: exact command names and flags vary across Claude Code, OpenCode, and Codex, sometimes noticeably. The concepts map cleanly across all three, but a quick check of the docs for whichever you use is worth five minutes before you rely on a specific flag.

Reading the repo

The most-used tools in the native set are the filesystem ones. The agent can read files, match file names against a glob pattern, search file contents with grep or ripgrep, and write or patch files it has already read. These are the tools that let it find things you never told it about.

When you ask it to find all the places a function is called and check whether any of them pass None, it does not wait for you to paste anything. It greps the repository, reads the matches, and answers with the function name you gave it as the only starting point. Those tools were already there before you changed a single setting.

Glob and grep do different jobs and the agent uses both. Glob finds files by name pattern: it is how the agent locates all the test files, or all the Python files under a particular directory, without you listing them. With grep, the agent searches inside files for a string or pattern, tracing a function to its callers or checking whether a particular import already exists before adding it again. In practice the agent chains them, using glob to narrow the search space and grep to find the specific lines. You can watch it do this in the tool-use log, which shows each call as it happens.

A few of these depend on system binaries being installed. Ripgrep (rg) and fd are the fast paths for content search and file discovery. The agent will try to fall back when they are missing, but the fallback is slower, costs more tokens, and produces lower-quality results. Watch for this in the tool-use log, the panel that shows every tool call the agent makes during a response. If you see the agent trying rg and getting a command-not-found, install it. On macOS, brew install ripgrep fd covers both.

The tool-use log is worth getting comfortable with early. The response text is what the agent says to you, and the log is the panel that shows what the agent is actually doing: the files it opened, the commands it ran, what it searched for. When something goes wrong or the agent seems confused, the log usually shows why before the response does. Failed binary lookups, greps that returned nothing, file reads that hit a path the agent guessed wrong: all of it shows up there.

Running commands

The agent can run shell commands and read what they print, all in the same turn. The first post already showed this with pytest: the agent runs the suite, reads the traceback, fixes the code, and runs it again, without you copying anything between steps. The command output feeds straight back into context.

That feedback loop is what makes the shell tool more than a convenience. When the agent runs a command and reads the output, it can act on what the command actually printed. A linter that flags an unexpected pattern, a build that fails on a dependency the agent did not know about: all of these show up in the output, and the agent can adjust on the same turn.

Say the agent is adding a new dependency and runs the build to check it. The build fails because the package is not available in the current conda environment. The agent reads the error, installs the package, and runs the build again. The failed command output told it what was missing, and the correction happened in the same session without you reading the error and re-prompting.

The same caveat about missing binaries applies here, and it comes up more often than with the filesystem tools. For example, agents routinely reach for wget on macOS, where it is not installed by default. curl is the macOS standard, and the agent will usually recover, but it wastes a turn improvising and occasionally gets it wrong. Watch the tool-use log for failed lookups and install what is missing before it becomes a pattern. That goes for tools like jq or language-specific CLIs the agent might assume are present. A few minutes with brew install after the first session on a new machine saves a lot of improvised workarounds later.

Verbose command output goes straight into the context window, as the last post covered. A test suite that prints ten thousand lines on failure is a context problem as much as a noise problem, and the same discipline applies: keep an eye on what the agent is pulling in and reach for /compact when a fat run has crowded the window.

Searching the web

The agent can issue a web search and fetch a URL without you setting anything up. Most people who have used these tools for a while still open a browser tab when a question touches something outside the repo, because nothing in the interface announces the capability is there.

There are two distinct capabilities here. Search returns result summaries and URLs from a live index. Fetch retrieves a specific page and hands its content to the agent. A practical case: the agent is mid-task and needs to check what a library’s current API looks like. It issues a search, picks the relevant docs page, fetches and reads it, and continues without you opening a tab or pasting anything.

When the agent has a concrete information gap mid-task, it can go and close it on its own. Asking it questions you would normally run through a search engine gets you worse results than the search engine would, and treating it as a search interface misses what actually makes the tool useful.

A more concrete example: you ask the agent to add retry logic using the tenacity library, which it has not used before in this session. It searches for the current API, finds the docs page, reads the decorator signature and the relevant parameters, and writes the code. The work continued without a gap.

Codex is the exception. Network access is off by default due to its OS-level sandbox, which the first post described. If you are on Codex and the agent is not reaching the web, you need to enable network access explicitly, or the web tools do nothing.

Picking up where you left off

The third post covered the human-facing mechanics of session resume: /clear, /compact, and the session list, what carries over, and when to start fresh. From the agent’s side, resuming recovers a working state built up over the previous session, which is a different thing from replaying a conversation.

When the agent picks up a session, it reloads the full conversation transcript from that session, which contains every file it read, every command it ran, and every dead end it hit. That transcript is the working picture. Starting a new session and re-explaining yesterday’s context in your own words gives the agent something close to what it knew, but built from your recollection and not the actual record. The difference is usually small on a short task and meaningful on a long one, where the agent spent real turns building up a picture of the code before it started writing.

Think about what that picture actually contains. The agent may have read a dozen files to understand a single function, traced a call chain across four modules before finding the right entry point, and worked through two approaches that did not pan out. None of that shows up in the code it wrote. If you start a new session and describe the task again, the agent starts that exploration over from scratch. On a long debugging session or a large refactor, that saved exploration is real time.

Here’s a concrete case. You spend an afternoon debugging a subtle race condition. The agent traced the execution path through several modules, ruled out two likely culprits, and was mid-way through a third lead when you had to stop for the day. The next morning you resume the session, and the agent picks up exactly where it left off, with its full working picture of what it already checked and why. Describing the problem fresh from memory gets you something workable, but the agent will likely retrace paths it already ruled out before finding its footing again.

The mechanics differ by tool. Claude Code has --continue to pick up the most recent session and --resume to open a picker or resume a specific one by ID. OpenCode uses --continue (or -c) for the last session and --session (or -s <id>) for a specific one. Inside the TUI, /sessions lists them and lets you jump to any. Codex has a resume subcommand: codex resume opens a picker, and codex resume --last picks up where you left off without prompting. Worth a quick check of the docs for whichever you use, since the flag names vary.

What’s not in the box

The native toolbelt covers the repository, the shell, the web, and session state. What it does not reach is your databases, internal APIs, issue trackers, or the Slack thread where someone explained two years ago why a particular architectural decision was made.

That last category is the hardest to work around. The filesystem tools are good at recovering what the code does, but the reasoning behind it usually lives in a pull request description, a design doc, or a conversation that happened before the code was written. The agent can read the code and make a reasonable guess about why, though that guess may be confidently wrong.

Some of that gap is filled by commercial MCP servers. Some employers provide proprietary internal ones that connect to their actual systems: Jira, Confluence, Google Docs, Google Sheets, Slack, and more. At Anaconda, we built an internal MCP server called Sesame (full disclosure: I wrote it) that connects agents to internal tooling and can surface things like Slack discussions. That sounds mundane until the agent finds the three-year-old thread that explains exactly why the code you are about to delete was written that way. No amount of filesystem grepping recovers that context.

MCP and custom tools get their own posts later in the series.

Conclusions

Learn the native toolbelt before bolting anything on. The agent decides when to reach for these tools, and it can choose wrong, so knowing what is available also means you can steer it when it reaches for the wrong one or misses one it should have used. A lot of the friction people attribute to the model is actually the agent trying to improvise around a missing binary or fetching a page it could have searched for first.

The tool-use log is the thing most people ignore and most benefit from reading. It shows how the agent reached its conclusion, including the dead ends. Glancing at it when something looks off tends to surface the actual cause faster than re-reading the response text.

The web search capability is probably the one that surprises people most. It is also the one most likely to be quietly broken on Codex until you enable network access, which means you might have been doing things the hard way for a while without knowing it. If none of this surprised you, you were already ahead of the curve, congrats!