On Spec-Driven Development and AI

Engineering is hard

Let me start with a couple stories.

A team picks up notification preferences, the kind of feature where users choose how they want to be contacted. Brief discussion, general agreement, shipped in a couple weeks. Two days into production, it falls apart. The preference model didn’t account for how two downstream services resolve conflicts, and edge cases start cascading. Some users get duplicate notifications. Others get none. The fix isn’t a patch, it’s a re-architecture, a data migration, and a rollback. Two weeks of development, six weeks of cleanup, eroded trust, and a PM who no longer believes engineering estimates.

Second story. A manager says “we need to clean up the auth flow.” The engineer hears it and knows exactly what to do. No questions, no hesitation, the meaning is obvious to them. They go off, do the work, come back, and the manager goes “that’s not what I meant at all.” But to the engineer there was no ambiguity. It never crossed either of their minds that those words could mean anything else. They were both certain, just about different things.

This probably rings a bell. Most of you have seen it, maybe been on one or both sides of it. And some of you are also thinking, wait, this sounds a lot like working with AI. You ask for something simple, it overcomplicates it. You give it clear instructions, it runs off confidently in the wrong direction.

That’s exactly the point. If humans and AI fail in the same ways, the problem isn’t the person or the AI. It’s that nothing is in place to keep everyone aligned on the basics: the scope of the work, the definition of done, what’s deliberately out.

That’s what spec-driven development is about.

What is spec-driven development

The idea isn’t new. Automotive, aeronautics, medical devices have been doing some form of this for decades. In some of those fields it’s legally required. You’re accountable before a regulatory body, and the obvious reason is that when you’re designing, for example, an implant that goes inside a patient’s skull, the acceptable number of bugs is exactly zero. Ask me how I know.

The reasons it works go well beyond safety-critical systems though. The problems it solves are the ones we just talked about: people misaligned on what the work is, context trapped in one person’s head, late and expensive course corrections, decisions no one can trace or explain six months later.

In most software projects the path goes straight from idea to code. Maybe a Jira ticket with a few bullet points sits in between. The result is a chronic lack of context, clarity, and direction. Add AI to that picture and it gets worse. An AI agent given vague instructions will confidently produce code that compiles, maybe passes a few tests, but completely misses the point. The less context it has, the more it fills in the blanks with assumptions. And in software, assumptions are where most bugs come from.

SDD fixes that with a layered workflow. Each layer is more concrete than the one before it, and each one acts as guardrails for the next.

Layer Objective
Spec What are we building, and why?
Plan How are we going to build it?
Tasks Who does what, in what order?
Implementation Write the code

Before any code gets written, everyone reviews and signs off on the first three layers. That’s the alignment checkpoint, the explicit moment where everyone confirms in writing that they’re building the same thing.

Why it works

Shared context. In most projects, context lives in people’s heads. It’s scattered across Slack messages, meeting notes, half-remembered conversations. Anyone stepping into the work, a new team member, a contractor, an AI agent, has to piece things together from fragments. The spec and plan centralize all of that. Everyone operates from the same source of truth.

Traceability. When something breaks six months from now, or someone questions a decision, you want to be able to trace from the code back to the reasoning behind it. In a spec-driven workflow that trail exists. Code points to a task, the task to a plan, the plan to a spec. You can reconstruct the full chain without hunting down the one person who remembers why. Assuming they’re still with the company.

Better AI output. AI agents are only as good as the context you give them. The same rule applies to humans, with the added charm that the AI will never tell you the request was unclear. It just guesses, confidently, and ships its guess. A one-line prompt produces something generic that may or may not fit your codebase. A structured spec and plan with clear constraints and acceptance criteria produce output you can actually merge. The difference isn’t subtle.

Cheaper to get right. Fixing a spec costs almost nothing, you’re changing a document. Fixing code after it’s been written, reviewed, merged, deployed, and delivered to a customer is a different story. A misunderstood requirement that makes it to production (say, a payment service that rounds currency wrong, or an API that exposes fields it shouldn’t) means a hotfix, a rollback, regression testing, possibly a security disclosure, and a few uncomfortable meetings. Beyond the direct cost, there’s the human side. When things break, people look for someone to blame, even unconsciously. With SDD, if something was missed, it was missed collectively, at the document stage. The dynamic shifts from blame to correction, which is the only direction post-mortems should ever go.

The hard conversations happen early. This is the most important one. Writing a spec and a plan forces PMs, EMs, architects, and engineers to confront ambiguity, disagreement, and gaps in understanding before any code exists. What happens in this edge case? Which team owns this data? Do we even agree on what the feature is supposed to do? In a code-first workflow, those conversations happen during implementation or code review, when the cost of changing direction is high and people are already attached to code they’ve written. Or worse, they happen after the proverbial you-know-what has hit the fan. Move them upstream, to the spec stage, and changing your mind is free.

If you can’t write it down clearly, you don’t understand it well enough to build it.

SDD in practice

So that’s the theory. Here’s what it looks like when AI is in the loop.

The first thing that changes is how you start. In a typical workflow you get an ask, open your editor, and start coding. Maybe you sketch something on a whiteboard first, but pretty quickly you’re writing code. With SDD, you don’t open your editor at all. You open a document, and you talk.

Call this the spec phase. You take the problem and you start asking domain questions, not implementation ones. What are we actually dealing with here? What are the assumptions? What are the unknowns? You have a back-and-forth with your coworkers, possibly with AI too, and the goal is to surface ambiguity and risk before you commit to anything. We sometimes do this already, but not often enough, not systematically enough, and without a framework to make it stick.

What you find, almost every time, is that there are questions nobody has answered yet. Edge cases nobody thought about. Assumptions two people on the team would answer differently. That’s exactly what you want to discover now, not three weeks into implementation.

As the discussions happen, you naturally write the spec. The spec answers what you’re building and why. It captures requirements, constraints, edge cases, and what’s explicitly out of scope. A good spec isn’t a novel. It’s specific enough that two people reading it independently would build the same thing. It’s also where you define what success looks like: acceptance criteria, performance targets, security constraints. If it affects whether the work is done correctly, it belongs in the spec.

Keep it small. Five or fewer user stories, twenty or fewer functional requirements, under 500 lines. You’re not writing a hundred-page requirements document. You’re writing user stories and functional requirements specific enough to be unambiguous, but that don’t dictate implementation. You care about the what and the why. The how belongs in the plan. Every word matters, because if a junior engineer or AI is going to work from this spec, contradictions and vagueness will show up in the output.

Iterate in tight loops. Write a draft, review it, possibly with the help of AI, then discuss it with your team. If anyone or the AI misunderstands something, your spec isn’t clear enough. If they ask questions you can’t answer, you need to go back to research. The back-and-forth is cheap, you’re fixing a document, not refactoring code.

Once the spec is solid, you move to the plan. The plan answers how. Architecture decisions, data models, technology choices and rejections, integration points. A good plan makes trade-offs explicit. If you picked approach A over B, say why you picked A and also why you didn’t pick B. If there’s a known risk, name it. This is also where you catch feasibility problems early. Something that sounded perfectly reasonable in the spec might turn out to be impractical once you actually think through the implementation. Sometimes the plan feeds back into the spec and a requirement turns out to need rethinking. That’s fine, it’s the process working.

Then you break the plan into tasks. Each task is small enough to be manageable, with enough context that whoever picks it up doesn’t need to reverse-engineer the intent. Dependencies are explicit. Sequencing is deliberate. For a human team it means reducing the critical path. When AI is doing the implementation, sequencing matters even more. The agent won’t stop and ask, so your task order is the execution path. Think of it less as a backlog and more as a set of rails for the agent.

And then, before anyone writes a line of code, everything gets reviewed. The spec, the plan, the tasks. Everyone involved signs off. This is the alignment checkpoint, and it’s not a formality. It’s the moment you make sure the thing you’re about to build is the exact thing everyone agreed to build. Without this checkpoint, all the upstream work was for nothing.

Only then does implementation start. If everything upstream was done well, implementation becomes the easy part. The engineer or the AI has all the context, constraints, and acceptance criteria they need. There’s nothing left to guess at. And because everything traces back, when something doesn’t look right, you know exactly where to look. You don’t just fix the code, you fix the spec or the plan and the code follows.

The code is still what gets shipped, but the spec and the plan are now the source of truth, and the code follows from them. If the result isn’t what you expected, don’t go straight to patching the code. That risks drifting out of alignment with what was agreed, and drifting out of alignment can be very expensive. Fix the document, then reimplement.

Spec tests

Letting AI write the implementation on its own, even with a solid spec and plan, can feel uncomfortable. Maybe even scary. Things tend to feel scary when they’re unknown though. Spec tests close that gap.

Spec tests are not unit tests or integration tests. Unit tests check internal behavior, integration tests check the seams between components, and spec tests check that the implementation does what the spec says it should. Every functional requirement gets a corresponding test, derived directly from the spec, expressing behavior and outcomes, not internal code paths. If the spec says the API returns a 403 when the user lacks permission, there’s a test for exactly that.

This closes the loop. Implementation isn’t complete until there’s 100% spec test coverage and every test passes. It doesn’t matter whether a human wrote the code or an AI generated it. The bar is the same, and it’s objective.

The fear of AI writing code comes from the feeling you’re giving up control. Spec tests give that control back. You defined what the system should do, and you wrote the tests that verify it. The AI just fills in the middle.

Yes, but

Three objections come up almost every time SDD is presented. They’re worth addressing.

“The documentation phase takes too long.” True, it does take time. But consider what that time is buying you. The documents themselves aren’t the point. The point is the conversations they force: ambiguities surfaced, disagreements made visible, gaps exposed. Those problems exist whether you write a spec or not. The only question is when you discover them. Now, when fixing them costs a paragraph, or later, when fixing them costs a sprint, a rollback, and a few difficult meetings. Ignoring problems doesn’t make them disappear. The upstream time genuinely is longer, but because you catch misunderstandings before they become hotfixes and rollbacks, SDD shortens the distance between “we agreed” and “it’s in production, working as intended.”

“Spec language feels stiff.” Engineers who aren’t used to reading specifications find them unnatural compared to a casual Slack message. Fair, it does feel different. There’s a reason for it though. Specification language is precise because it has to be. Every sentence is written to close the door on misinterpretation. There are no gaps to fill in with assumptions, no room for “well, I thought it meant…”. That’s the whole point. It reads the way it reads because natural language is full of ambiguity, and ambiguity is exactly what we’re trying to eliminate. For what it’s worth, if you’ve ever tried reading a patent, you know it can get much worse. Spec language is the friendly version.

“I lose control of my code.” With AI writing the code, it can feel like you’re no longer in the driver’s seat, and in a literal sense that’s true. With SDD, you still are. Control just shifted left, to the documentation phase. You define what gets built in the spec. You define how in the plan. You challenge both during review. The discussions and alignment checkpoints are where the real engineering decisions happen. When you write C and compile it, you don’t agonize over the assembly that comes out. You trust the process because you controlled the input. The same principle applies here.

When SDD doesn’t apply

There are cases where the full SDD workflow is overkill, and it’s worth being honest about them.

If you’re prototyping or exploring, trying to figure out whether an idea even makes sense, writing a full spec before touching code will slow you down for no good reason. You don’t need alignment on something you might discard by Friday.

The same goes for small, well-understood changes. If the task is fixing a typo or a straightforward bug where the problem and solution are both obvious, running it through spec, plan, and tasks is ceremony for ceremony’s sake.

If production is on fire, you patch first and document after. SDD prevents fires. It isn’t paperwork to fill out while the building burns.

And if you’re the domain expert working on complex, specialized code you know inside out, SDD can get in the way more than it helps. The whole point is to surface assumptions and align people. If there’s only one person and the assumptions live in their head with high fidelity, the cost can outweigh the benefit.

As a rule of thumb, the more people or agents are involved, the more ambiguity exists, and the higher the cost of getting it wrong, the more SDD pays for itself. A solo developer hacking on a personal project doesn’t need a spec. A team of twelve building a payment system absolutely does. Most real work falls somewhere in between, and the judgment call is about how much process the situation actually warrants.

The toolkit

We settled on Spec Kit and OpenCode, and we’ve been using them for about a month with good results.

Spec Kit is an open-source toolkit built specifically for this workflow. It gives your AI agent a set of structured commands that map directly to the SDD phases. It’s agent-agnostic, currently the most mature option in the space, and it does a good job keeping the AI in its lane, in the right phase at the right time. It’s extensible and customizable, which we’ve taken advantage of.

For the coding agent we use OpenCode. Claude Code and OpenCode are comparable in capability, both terminal-based, both handle multi-file edits and agentic workflows well. We picked OpenCode because it’s model-agnostic, you plug in whichever provider fits your needs, and because it supports plugins, which makes it easy to extend. Claude Code locks you into Anthropic’s ecosystem. That’s fine, we just value the flexibility.

Two more tools sit alongside these. Sesame is an internal (to Anaconda) MCP server I built to expose Anaconda-specific context to the AI flow. A well-written ticket lets it surface broad context, including Slack “watercooler” discussions, and that’s typically the critical missing piece when working with AI. We’ve found this kind of context tool essential to the workflow, so if you’re outside Anaconda you’ll want to build your own equivalent or wire up existing MCP servers for Jira, Confluence, Slack, and the rest. The off-the-shelf ones work, though not as well as something purpose-built.

oh-my-openagent is a heavily customized OpenClaw fork that adds an orchestration layer with four top-level agents and six sub-agents. It’s OpenCode-only, but if you’re already on OpenCode it’s worth turning on.

A warning on Cursor: it doesn’t play well with Spec Kit. Persistent issues recognizing Spec Kit’s commands, the experience is unreliable. Terminal-based agents are a much better fit for this workflow.

Some Spec Kit commands generate files. Treat them as a starting point, not a finished product. Discussions with the team, possibly with the AI, and a few iterations are what make them final.

/review deserves its own callout because it shows up at several points in the workflow. It’s a slash command that comes with OpenCode and Claude Code, not with Spec Kit. In our context, its job is narrow and useful: surface gaps and conflicts in whatever documents you’ve produced so far. Use it after /speckit.clarify to pressure-test the spec, after /speckit.analyze to catch what slipped through, and after /speckit.implement to check the code against the spec and plan. It’s cheap, it’s worth running more than once, and it tends to find things a single pass missed.

Static analysis

AI generates code with a higher rate of issues that automated tooling catches easily, things like type mismatches, convention violations, and security anti-patterns. Configure linters, formatters, and type checkers to be strict. Treat warnings as errors in CI.

The agent uses these checks as a real-time feedback loop. The linter or type checker flags problems as soon as they appear, and the agent corrects course before moving on. Cranking these up to maximum strictness can be abrasive when humans are writing the code, but the AI doesn’t get frustrated and it doesn’t push back. It just fixes the issue. Every skipped rule is a class of bugs the AI can quietly introduce. Consider adding SAST tools like CodeQL or Semgrep for security-relevant patterns. Automate every check you’d otherwise run by hand, then never think about it again.

The constitution

The constitution is the foundational document for your project, the non-negotiable principles your AI coding agent must respect at all times. Spec Kit has a /speckit.constitution command that helps you draft an initial version, but the constitution isn’t something you regenerate. You write it once, carry it forward, and refine it over time. A good constitution isn’t specific to one project, it’s something you carry across them.

This is where you encode what matters. Engineering principles like DRY, explicit over implicit, fail early and loudly. Quality gates: testing conventions, linting, type checks, review requirements. Hard prohibitions: type suppressions, swallowed exceptions, hardcoded values. A good constitution is mostly restrictive, not aspirational. It tells the AI what it cannot do, because left to its own devices it defaults to whatever patterns are most common in its training data. If your project deviates from those, you need to say so.

It’s a living document, but it evolves through deliberate improvement, not replacement. As your team works with the AI, you’ll notice gaps. Patterns you assumed were obvious that the AI keeps getting wrong. Conventions the team has internalized but never wrote down. Go back and update it. That iteration is expected.

The workflow, step by step

Step 1: the ticket. The leads co-author it. EM, PM, tech lead. Include detailed background, especially when there’s no parent epic or sibling tickets for context. Every functional requirement, every acceptance criterion, every deliverable. Make them precise, unambiguous, and AI-consumable. A well-written ticket is what lets Sesame pull the right context out of Confluence, Google Docs, and Slack.

Step 2: draft the spec. Run /speckit.specify to create a branch and a directory where all SDD documents for this feature will live. We’ve modified Spec Kit to require a Jira ticket as part of the name, which makes tracking much easier across tools. The spec is the most important document in the entire workflow. It captures what the program is supposed to accomplish and why. Everything downstream flows from it. A good spec contains user stories (objectives, not technical steps), acceptance criteria (scenarios, failure modes, edge cases), and functional requirements (testable, unambiguous, what not how).

Step 3: edit the spec. Review it line by line, word by word. If you have spare time anywhere in the process, this is where to spend it. Let the AI make the changes rather than editing manually. It handles downstream consequences and retains the change in context.

Step 4: clarify and review. Run /speckit.clarify, then accept, modify, or answer its questions and recommendations. Then run /review to surface gaps and conflicts. You can run it more than once, one pass won’t always catch everything. Address all suggestions, not just the critical ones. Minor issues compound once you’re in implementation. Run /review again every time the spec is updated.

Step 5: the plan. Run /speckit.plan to turn the spec’s intent into a concrete technical design covering architecture, data models, technology choices, API contracts, integration points, trade-offs, and risks. The plan and spec inform each other, and that back-and-forth is where a lot of real engineering happens. Review research.md and plan.md with the same rigor as the spec. Review data_model.md and api.md if they exist. The other generated files only need a glance, but commit them all, Spec Kit’s scripts depend on them being present.

Step 6: the tasks. Run /speckit.tasks to generate a sequenced task list with explicit dependencies. The resulting tasks.md only needs a quick glance.

Step 7: verify alignment. Run /speckit.analyze. It checks that tasks align with the plan, the plan aligns with the spec, and nothing has drifted. It lists issues ranked by criticality and offers to fix only the most critical ones. Fix them all. Ask the AI to suggest remediations, then approve or iterate. Even the smallest issue has a tendency to snowball into a big problem by the end of implementation. Run /review again after changes.

Step 8: final review. This is the real alignment checkpoint. Consult everyone concerned, expect discussions, and have all leads sign off. You’re trying to catch issues before implementation (or worse, production), and to make sure the AI-assisted implementation is on rails.

Step 9: implementation. Run /speckit.implement and the agent works through the task list, building and testing as it goes. You’re still the human in the loop, you can intervene at any point, but the goal is that intervention is the exception. Run /review after the agent is done, and run it again after every substantive change. Minor cosmetic fixes by hand are fine. For anything substantive, update the spec or plan first, then re-run /speckit.implement. Re-implementing is cheap, and drifting out of alignment is expensive.

Conclusion: What SDD is and isn’t

SDD isn’t AI. AI is a tool that benefits enormously from SDD, but you can do SDD without ever touching an LLM. SDD’s job, when AI is in the picture, is to keep the agent on rails. A vague prompt produces vague code, a well-reviewed spec and plan produce code that solves the actual problem.

SDD isn’t about generating documents. The artifacts exist to surface risks, ambiguities, and disagreements while they’re still cheap to fix. The point is the conversations it triggers, not the files it produces. The gates work the same way: the clarify pass, the review pass, the sign-off all exist because each forces a conversation that would otherwise happen too late, or not at all. As Eisenhower put it: “Plans are useless, but planning is indispensable.”

SDD isn’t mandatory. A one-line bug fix or a throwaway prototype doesn’t need a four-layer review process. Use it when ambiguity, multiple stakeholders, or production stakes enter the picture. That’s where it earns its weight.

SDD isn’t set in stone. The workflow we use today isn’t the one we used six months ago, and it won’t be the one we use six months from now. Your constitution gets updated, templates change, steps get added or dropped as we learn what’s actually load-bearing. Treat it as a living practice your team owns, not a checklist handed down from above.

SDD isn’t a solo exercise. The leads co-author the ticket and the spec gets reviewed by everyone concerned. The plan triggers discussions, and the final review is a real checkpoint with real sign-off. If one person is writing the spec alone in a corner, you’ve reproduced the original problem under a different name.

One last thing: SDD is invisible when it works. No firefighting, no rollbacks, no last-minute discoveries, which can make it look like the process wasn’t necessary. It’s the same trap as questioning a vaccine when you don’t get sick. The absence of problems is proof it’s working, not evidence you could have skipped it.