My first Spec-Driven Development project
Previous: Introducing FrozenDB
“Look ma, no hands” – I wrote FrozenDB without writing lines of code. Instead, I used spec-driven development to generate all of the user stories, map them carefully to technical requirements and have AI generate the code. I’ll go into SDD more deeply later, but for this post I wanted to share some of the learnings I had over the 40 specs I wrote as part of the process.
Start small
My first few specs took a lot of development cycles. I spent a lot of time in the /speckit.plan stage refining the research and approach I wanted to take. Then, when I let the AI implement the specs, it went off and did something I didn’t want it to do. So I would have to go back to an earlier step (usually the spec.md), refine the requirements, and update every other document. This takes a lot of time and context switching. I realized my user stories were too big. For example, my first spec covered creating the database file, defining a header, and making it append-only. This took a very long time to get right, and involved a lot of tuning of every step of the spec.
After doing this for awhile, my rule of thumb is: 5-10 functional requirements for a spec is the sweet spot. More than 20 requirements is a definite sign you’re doing too much in one spec. The easiest way to reduce functional requirements is to simplify or remove user stories. I would suggest starting off with one or two user stories per-spec when you get started, until you get better at the process.
Every word matters, so say less
“Why is this error class here? How dare you!” is a common problem I had with many of the specs. FrozenDB has structured errors, so I was particular in the spec phase about which types of errors should be thrown, under what conditions. What was happening was that spec-kit was generating a quickstart.md file. However, there are very little instructions in the spec-kit template for what actually goes in this file, so the output tended to be very verbose, as well as making up new patterns along the way. I wasn’t really reviewing the file, since the data was supposed to be redundant with my api contract (until it wasn’t). So, I removed it. I changed the speckit templates to never reference quickstart.md and I deleted all of the files from my repository.
This highlights a key tradeoff with LLM’s – The more text you provide to the LLM – the greater your control over its output. However, context has no inherent hierarchy and the more text you introduce gives the LLM more control over which parts of the context to give attention to. Especially if your text contains inconsistencies.
Write just enough to define the behavior you care about, but no more. Identify common mistakes and duplication the AI makes, and modify your templates/skills/commands to remove these redundancies.
Here are some of the changes I made:
Avoid full code snippets in specs
Writing full code implementations in your specs is a bad idea. It forces the implementation down one specific path, without ever ensuring the code compiles or logically makes sense. Then, any future edits have to always deal with the tension of the incorrect code in the specification. It’s very tempting at first to think this is a good idea, since you can control exactly what the AI generates. However, this is actually a crutch and indicates your written documentation doesn’t properly define what you want. I solved this tendency with comments like: Exclude Implementation details that limit implementation flexibility, and I repeated this type of instruction for the different document types being produced
Don’t repeat yourself
Often times the data-model, the research, and the api.md files would all contain the same things (including quickstart usage, implementations, and more). So, my templates now contain language like: Do not include error handling patterns or usage examples (put in api.md instead), or Do not include redundant documentation of existing codebase structure in the research.md
Over time, this has cut down on the word count of my specs, without sacrificing the accuracy of implementation. This saves me time when creating and reviewing the specs. It helps me to make sure my documentation is clear, concise, and focuses on the details that matter
All tests green – the cake is a lie
The constitution, plan, tasks will all say “create tests for this feature”. And then the AI will do the dumbest thing to not actually write coherent tests. Some failure patterns I saw with FrozenDB include:
- Stubbing out tests with at TODO and never coming back to them
- Writing tests that don’t have any assertions
- Writing one or two tests, then checking all of the tasks that say “implement tests” as completed
- Mocking out the entire implementation, so the tests always pass
In the world of SDD, tests become more important because it’s the primary way of giving feedback to the AI implementation loop.
Here’s what I found to work for FrozenDB in order to produce high-quality tests that actually verify the functional requirements of the specs
- When authoring the spec.md, make sure all the functional tests can actually be verified through some form of test, and iterate on the functional requirements until they are verifiable. I had to do the most amount of work here with any performance-based feature and often derived a specific proxy metric for the AI to use for correctness
- I created a new terminology called a “spec test” and defined how every functional requirement must have a test that validates through explicit assertions that the tests pass.
- I put mentions to spec tests everywhere, including AGENTS.md, constitution.md, and the task template
- This usually caused the there to be one task for every functional requirement to implement a spec test
- I broke up the implementation into pieces, like this:
/speckit.implement Implement just the spec tests for User Story X. Once implemented, I would take a quick browse over the tests to ensure these are proper tests, before running/speckit.implementto implement the remaining code
It’s worth calling out the last point – Depending on your agentic setup, I found that AI is not really willing to follow tasks in-order as directed. The willingness to skip steps is roughly proportional to the task length, so I would suggest limiting the number of tasks to 15 or less at a time. If your tasks.md contains 50 tasks you should: Either remove or consolidate the steps, or run it in 3-4 chunks at a time (e.g. /speckit.implement tasks 1-15 or whatever
Pay attention to implementation difficulties
One of the primary differences between vibe coding and SDD is that SDD has direct supervision from the developer. Primarily this comes with specs, but even still specs get things wrong. You can detect and correct problems by looking at the AI’s reasoning process in addition to the final code output. For example, several times in development, I saw the reasoning step of implementation look like this: “The user wants ABC but wait there’s XYZ. Oh I know! Let me do 123. But wait, the user wants CDE”. If you’ve done enough AI programming, you’ve seen this play out many times. Even if the AI figures out an answer, this type of looped reasoning points to a flaw in your specs that you should correct. Additionally, when looking at the final output, it’s important to look for any coding patterns that the AI implemented, but were not defined during the planning process. These are signs of unexpected complexity. For instance, I created a spec to add fswatcher notifications, so that frozenDB would be updated when a different writer updated the database. During implementation, I noticed an oddity. The code was supposed to be using structured updates to know when the file size was being updated, but instead it was reading directly from disk
func (fm *FileManager) Size() int64 {
// In READ mode, we need to get the actual file size from the OS because
// external writers may have appended to the file. The cached currentSize
// is only accurate for WRITE mode where this FileManager controls all writes.
if fm.mode == MODE_READ {
file := fm.file.Load().(*os.File)
if file == nil {
// File is closed, return cached size
return int64(fm.currentSize.Load())
}
// Get actual file size from OS via stat
stat, err := file.Stat()
if err != nil {
// On error, fall back to cached size
return int64(fm.currentSize.Load())
}
return stat.Size()
}
// In WRITE mode, use the cached atomic size which is accurate
return int64(fm.currentSize.Load())
}
The red flac is that file.Stat() call. So the AI did all this work to have notifications, piped the data all the way through, and then realized the code didn’t work so it did file.Stat(). Seeing this implementation helped me realize that there was a coupling problem. So, I stopped working on this branch, and created a new spec to decouple the logic of knowing when file sizes change from the write path. Afterwards, the code for implementing the read detection was much simpler, and the AI was successfully able to implement the feature. The new logic was just about the fswatcher, because the previous refactoring made consuming updates simple.
This type of subtle, but important observation is the thing that makes a difference between a well-crafted program that happens to use AI for development, vs a toy that solves a problem but is difficult to maintain or evolve. It is extremely tempting to glaze over the code implementation, but you must resist this temptation. Instead, you need to have a mental model of what you expect the code to do, and become an expert at quickly identifying places where the implementation diverges, or otherwise has key, critical points to work correctly