Components of a Coding Agent

(magazine.sebastianraschka.com)

64 points | by MindGods 4 hours ago

8 comments

  • beshrkayali 1 hour ago
    > long contexts are still expensive and can also introduce additional noise (if there is a lot of irrelevant info)

    I think spec-driven generation is the antithesis of chat-style coding for this reason. With tools like Claude Code, you are the one tracking what was already built, what interfaces exist, and why something was generated a certain way.

    I built Ossature[1] around the opposite model. You write specs describing behavior, it audits them for gaps and contradictions before any code is written, then produces a build plan toml where each task declares exactly which spec sections and upstream files it needs. The LLM never sees more than that, and there is no accumulated conversation history to drift from. Every prompt and response is saved to disk, so traceability is built in rather than something you reconstruct by scrolling back through a chat. I used it over the last couple of days to build a CHIP-8 emulator entirely from specs[2]. I have some more example projects on GitHub[3]

    1: https://github.com/ossature/ossature

    2: https://github.com/beshrkayali/chomp8

    3: https://github.com/ossature/ossature-examples

    • Yokohiii 39 minutes ago
      I like it a lot, I find the chat driven workflow very tiring and a lot of information gets lost in translation until LLMs just refuse to be useful.

      How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state? How high is the success/error rate if you generate from tasks to code, do LLMs forget/mess up things or does it feel better?

      The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?

      • beshrkayali 2 minutes ago
        Thanks!

        > How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state?

        Yes, the flow is: you write specs then you validate them `ossature validate` which parses them and checks they are structurally sound (no LLM involved), then you'd run `ossature audit` which flags gaps or contradictions in the content, and from that it produces a toml build plan that you can read and edit directly before anything is generated. You can reorder tasks, add notes for the llm, adjust verification commands, or skip steps entirely. So when you run `ossature build` to generate, the structure is already something you have signed off on.

        > The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?

        Right now it is best for greenfield, as you said. I have been thinking about a workflow where you generate specs from existing code and then let Ossature work from those, but I am honestly not sure that is the right model either. The harder case is when engineers want to touch both the code and the specs, and keeping those in sync through that back and forth is something I want to support but have not figured out a clean answer for yet. It's on the list, if you have any thoughts please feel free to open an issue! I want to get through some of the issues I am seeing with just spec editing workflow (and re-audit/re-planning) first, specifically around how changes cascade through dependent tasks.

        Regarding success rate, each task requires a verification command to run and pass after generation and if it fails, a separate fixer agent tries to repair it using the error output. The number of retry attempts is configurable. I did notice that the more concise and clear the spec is the more likely it is for capable models to generate code that works (obviously) but that's what auditing is supposed to help with. One interesting case about the chip-8 emulator I mentioned above is that even mentioning the correct name of the solution to a specific problem was not enough, I had to spell out the concrete algorithm in the spec (wrote more details here[1]). But the full prompt and response for every task is saved to disk, so when something does go wrong one can read the exact prompt/response and fix-attempts prompt/response for each task.

        I wrote more details in an intro post[2] about Ossature, if useful.

        1: https://log.beshr.com/chip8-emulator-from-spec/

        2: https://ossature.dev/blog/introducing-ossature/

    • peterm4 22 minutes ago
      This looks great, and I’ve bookmarked to give it a go.

      Any reason you’ve opted for custom markdown formats with the @ syntax rather than using something like frontmatter?

      Very conscious that this would prevent any markdown rendering in github etc.

  • MrScruff 1 hour ago
    > This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code.

    Unless I'm misunderstanding what's being described here, running Claude Code with different backend models is pretty common.

    https://docs.z.ai/scenario-example/develop-tools/claude

    It doesn't perform on par with Anthropic's models in my experience.

    • kamikazeturtles 1 hour ago
      > It doesn't perform on par with Anthropic's models in my experience.

      Why do you think that is the case? Is Anthropic's models just better or do they train the models to somehow work better with the harness?

      • mmargenot 1 hour ago
        It is more common now to improve models in agentic systems "in the loop" with reinforcement learning. Anthropic is [very likely] doing this in the backend to systematically improve the performance of their models specifically with their tools. I've done this with Goose at Block with more classic post-training approaches because it was before RL really hit the mainstream as an approach for this.

        If you want to look at some of the tooling and process for this, check out verifiers (https://github.com/PrimeIntellect-ai/verifiers), hermes (https://github.com/nousresearch/hermes-agent) and accompanying trace datasets (https://huggingface.co/datasets/kai-os/carnice-glm5-hermes-t...), and other open source tools and harnesses.

      • MrScruff 1 hour ago
        It's a good question, I've wondered that myself. I haven't used GLM-5 with CC but I've used GLM-4.7 a fair amount, often swapping back and forth with Sonnet/Opus. The difference is fairly obvious - on occasions I've mistakenly left GLM enabled running when I thought I was using Sonnet, and could tell pretty quickly just based on the gap in problem solving ability.
      • esafak 1 hour ago
        They're just dumber. I've used plenty of models. The harness is not nearly as important.
        • vidarh 20 minutes ago
          The harness if anything matters more with those other models because of how much dumber they are... You can compensate for some of the stupidity (but by no means all) with harnesses that tries to compensate in ways that e.g. Claude Code does not because it isn't necessary to do so for Anthropics own models.
  • armcat 2 hours ago
    I still find it incredible at the power that was unleashed by surrounding an LLM with a simple state machine, and giving it access to bash
    • HarHarVeryFunny 1 hour ago
      At it's heart it's prompt/context engineering. The model has a lot of knowledge baked into it, but how do you get it out (and make it actionable for a semi-autonomous agent)? ... you craft the context to guide generation and maintain state (still interacting with a stateless LLM), and provide (as part of context) skills/tools to "narrow" model output into tool calls to inspect and modify the code base.

      I suspect that more could be done in terms of translating semi-naive user requests into the steps that a senior developer would take to enact them, maybe including the tools needed to do so.

      It's interesting that the author believes that the best open source models may already be good enough to complete with the best closed source ones with an optimized agent and maybe a bit of fine tuning. I guess the bar isn't really being able to match the SOTA model, but being close to competent human level - it's a fixed bar, not a moving one. Adding more developer expertise by having the agent translate/augment the users request/intent into execution steps would certainly seem to have potential to lower the bar of what the model needs to be capable of one-shotting from the raw prompt.

    • Yokohiii 1 hour ago
      That is why I am currently looking into building my own simple, heavily isolated coding agent. The bloat is already scary, but the bad decisions should make everyone shiver. Ten years ago people would rant endlessly about things with more then one edge, that requires a glimpse of responsibility to use. Now everyone seems to be either in panic or hype mode, ignoring all good advice just to stay somehow relevant in a chaotic timeline.
    • stanleykm 1 hour ago
      unfortunately all the agent cli makers have decided that simply giving it access to bash is not enough. instead we need to jam every possible functionality we can imagine into a javascript “TUI”.
      • HarHarVeryFunny 50 minutes ago
        If all you want is a program that calls the model in a loop and offers a bash tool, then ask Claude Code to build that. You won't like it though!

        For a preview of what it'd be like, just tell your AI chat app that you'll run bash commands for it, and please change the app in your "current directory" to "sort the output before printing it", or some such request.

        • senko 16 minutes ago
          Claude Code with Opus 4.6 regularly uses sed for multi-line edits, in my experience. On top of it, Pi is famously only exposing 4 tools, which is not just Bash, but far more constrained than CCs 57 or so tools.

          So, yes, it can work.

        • Yokohiii 21 minutes ago
          I think you get him wrong? He is already concerned about "bash on steroids" and current tools add concerning amounts of steroids to everything.
        • stanleykm 18 minutes ago
          i did.. and thats what i use. obviously its a little more than just a tool that calls bash but it is considerably less than whatever they are doing in coding agents now.
    • esafak 2 hours ago
      Tools gave humans the edge over other animals.
      • Yokohiii 19 minutes ago
        And those tools regularly burnt cities to ashes. Took a long time to get it under control.
  • Yokohiii 1 hour ago
    The example is really lean and straightforward. I don't use coding agents, but this is some good overview and should help everyone to understand that coding agents may have sophisticated outcomes, but the raw interaction isn't magical at all.

    It's also a good example that you can turn any useful code component that requires 1k LOC into a mess of 500k LOC.

  • crustycoder 1 hour ago
    A timely link - I've just spent the last week failing to get a ChatGPT Skill to produce a reproducible management reporting workflow. I've figured out why and this article pretty much confirms my conclusions about the strengths & weaknesses of "pure" LLMS, and how to work around them. This article is for a slightly different problem domain, but the general problems and architecture needed to address them seem very similar.
  • aplomb1026 30 minutes ago
    [dead]
  • Adam_cipher 1 hour ago
    [dead]
  • nareyko 4 hours ago
    [dead]