The WhatsApp Chatbot That Took 34 Iterations (And Why That’s Normal)
We are on version 3.05 of a WhatsApp chatbot we built for a legal services client.
Version 1.0 was a skeleton. It worked in the sense that it didn’t crash. It didn’t work in the sense that it would say things we hadn’t asked it to say, ignore things we needed it to do, and occasionally take conversational turns that made perfect sense to a language model and made no sense to a human trying to get legal help at 11pm on a Wednesday.
Between version 1.0 and version 3.05: 34 iterations.
And before you assume we made a mess of the initial build — we didn’t. The 34 iterations weren’t the product of sloppy engineering. They were the product of the fundamental nature of AI conversational agents.
This is the part nobody tells you when you decide to build a chatbot.
The Assumption That Gets Everyone in Trouble
When most businesses decide they want a chatbot, they’re picturing a scripted flow. A decision tree. If the user says X, the bot says Y. If the user selects option 3, show them the option 3 content. Press 1 for sales, press 2 for support. You’ve been through that menu. You know it.
Scripted flows are deterministic. You can test every path. You know every possible output before you launch. When something goes wrong, it goes wrong in a predictable way and you can patch the specific branch.
AI conversational agents are not scripted flows. They’re language models. They don’t follow a decision tree — they generate a response based on everything they’ve been told (the system prompt, the conversation history, their training) and then produce the most probable next thing. Every conversation is a new inference. The same input does not always produce the same output.
That’s the non-determinism problem. And it’s not a bug. It’s how the technology works.
But it means that testing an AI chatbot is fundamentally different from testing a scripted flow. You can’t enumerate every path. You can’t guarantee behavior through exhaustive testing. What you can do is define the system prompt tightly, test for edge cases relentlessly, and iterate based on what you find in real conversations.
Which is what we did. 34 times.
What Actually Changed Between Versions
Version 1.0 gave us a system prompt and a basic personality. It could answer questions about the service area, take basic intake information, and handle simple routing. That’s all it was trying to do.
And it was immediately apparent that “trying to do” and “actually doing reliably” are two different things.
The chatbot would occasionally be helpful in ways we hadn’t asked for — answering adjacent legal questions that were outside our scope, being expansive when it should have been concise. It would sometimes be unhelpful in ways we hadn’t anticipated — misinterpreting informal language, or responding to frustration with a cheerful non-answer that made things worse.
Each iteration addressed specific failure modes. Tightened the system prompt. Added constraints. Defined the boundaries more explicitly. Tested again. Found new failure modes. Iterated again.
This is not a process that has a natural endpoint. There is no version of the chatbot that you finish. You ship something that works well enough to be useful, you run it, you watch it, you fix what you find, and you keep going. Version 3.05 works meaningfully better than version 1.0. Version 4 will work meaningfully better than 3.05.
That’s the rhythm of AI development. Iteration isn’t a sign of failure. It’s the methodology.
Why This Is Hard to Explain to Clients
Here’s the conversation I’ve had more than once.
Client: “Is the chatbot done?”
Me: “It’s in production and it’s handling conversations well.”
Client: “But is it done?”
And I understand the question. Software, in the traditional sense, has a done state. You spec it, you build it, you test it, you launch it. Done. The ongoing investment is maintenance and feature additions — discrete projects with beginnings and endings.
AI agents don’t have a done state in the same way. They have a functional enough state, and then an ongoing refinement process. The initial build gets you functional enough. The refinement process gets you from functional enough to genuinely good. And “genuinely good” keeps moving as you see more real conversations.
This is a mindset shift that not everyone is ready to make. If you’re evaluating an AI chatbot project expecting to pay for a build and then own a finished product, you’re going to be frustrated. The build is the beginning, not the end.
What you’re actually paying for — when you do this properly — is iteration velocity. How fast can we find the problems? How fast can we fix them? How quickly can we get from “this sometimes works” to “this works reliably”?
That’s the service. Not the version number.
What Makes a Good Iteration Cycle
From what we’ve learned building this thing: the quality of an iteration depends almost entirely on the quality of the feedback going into it.
Vague feedback produces marginal improvements. “The bot feels off sometimes” — I cannot do much with that. “The bot answered a question about property law when the user clearly meant to ask about employment law, here’s the transcript” — now we have something.
The best feedback comes from watching real conversations. Not test conversations where someone on the team plays the role of a user — actual users, with actual needs, with informal language and imperfect phrasing and frustration when things go wrong.
The first few weeks in production are the most valuable testing environment you’ll ever have. The production conversations teach you things no amount of pre-launch testing can. They show you the full range of how real people phrase their needs, the edge cases you didn’t imagine, the moments where the bot’s response was technically correct but humanly wrong.
Watch those conversations. Every single one, in the early stages. Build a log of failure modes. Prioritize by frequency and severity. Iterate in batches — don’t change one thing and re-launch, change a cluster of related things so you’re making meaningful progress per cycle.
And — this is important — track what you fix. Version numbering exists for a reason. When you can say “version 3.02 fixed the property/employment law confusion and reduced irrelevant-answer rate by X percent,” you have evidence that the iteration process is working. That evidence matters when you’re explaining the investment to the people who sign the invoices.
Setting Expectations Before You Build
The biggest thing we’ve changed as a result of this project is how we scope AI chatbot work in discovery.
We now spend a significant amount of the scoping conversation on iteration expectations. Not just what the bot will do, but what happens after launch. How do we handle feedback? What does the review process look like? Who watches the conversations? How often do we iterate? What does a successful 3-month post-launch period look like?
Getting these questions answered before the contract is signed means the 34 iterations don’t come as a surprise. They come as evidence that the process is working.
A client who understands that version 3.05 is a good outcome — not a sign that version 1.0 was a failure — is a client you can do genuinely good work with. They’re not measuring success by whether the bot was perfect at launch. They’re measuring it by whether the bot is getting meaningfully better over time.
That reframe changes everything.
The Honest Assessment of Where AI Chatbots Are Right Now
Good at: handling a high volume of consistent, similar queries with reasonable accuracy. Being available at 11pm. Collecting intake information without a human having to do it. Handling the initial qualifying layer before a human takes over.
Not good at: complex, multi-step reasoning across a long conversation. Situations where the stakes are high and getting it wrong has serious consequences. Anything that requires genuine nuance or professional judgment.
For legal services — and for a lot of professional services — the right model is handoff, not replacement. The chatbot handles the front end. A qualified human handles anything that requires real expertise. The chatbot’s job is to make sure the human’s time is spent on the high-value conversations, not on “do you operate in my area?” at 11pm on a Wednesday.
Version 3.05 does that job well. We got there in 34 iterations. If someone had told us at the start that it would take 34 iterations, we probably would have said that sounds like a lot.
It’s not a lot. It’s what good looks like.
