The WhatsApp Chatbot That Took 34 Iterations (And Why That’s Normal)

We're on version 3.05 of a WhatsApp chatbot built for a legal services client — after 34 iterations. This isn't a story about sloppy engineering. It's a story about the fundamental nature of AI conversational agents, and why iteration isn't a sign of failure.

What You Need to Know

The fast version. All killer. No filler.

Why do AI chatbots require so many iterations after launch?

AI conversational agents are non-deterministic — the same input doesn't always produce the same output. Unlike scripted flows where you can enumerate every path, AI chatbots generate responses based on context and training. You can't test exhaustively before launch. Real conversations reveal failure modes that no pre-launch testing can anticipate. Iteration is the methodology, not evidence of failure.

What makes a good iteration cycle for an AI chatbot?

Quality feedback. Vague feedback produces marginal improvements. Watching real conversations produces specific, actionable findings. Prioritise by frequency and severity. Iterate in batches — change a cluster of related things per cycle, not one thing at a time. Track version numbers and what each version fixed so you have evidence the process is working.

What should clients expect after an AI chatbot launches?

An ongoing refinement process, not a finished product. The build gets you to 'functional enough.' Post-launch iteration gets you to 'genuinely good.' What you're paying for is iteration velocity — how fast problems are found and fixed. Scope the post-launch process before signing the contract so the 34 iterations are expected, not a surprise.

The WhatsApp Chatbot That Took 34 Iterations (And Why That’s Normal)

We are on version 3.05 of a WhatsApp chatbot we built for a legal services client.

Version 1.0 was a skeleton. It worked in the sense that it didn’t crash. It didn’t work in the sense that it would say things we hadn’t asked it to say, ignore things we needed it to do, and occasionally take conversational turns that made perfect sense to a language model and made no sense to a human trying to get legal help at 11pm on a Wednesday.

Between version 1.0 and version 3.05: 34 iterations.

And before you assume we made a mess of the initial build — we didn’t. The 34 iterations weren’t the product of sloppy engineering. They were the product of the fundamental nature of AI conversational agents.

This is the part nobody tells you when you decide to build a chatbot.

The Assumption That Gets Everyone in Trouble

When most businesses decide they want a chatbot, they’re picturing a scripted flow. A decision tree. If the user says X, the bot says Y. If the user selects option 3, show them the option 3 content. Press 1 for sales, press 2 for support. You’ve been through that menu. You know it.

Scripted flows are deterministic. You can test every path. You know every possible output before you launch. When something goes wrong, it goes wrong in a predictable way and you can patch the specific branch.

AI conversational agents are not scripted flows. They’re language models. They don’t follow a decision tree — they generate a response based on everything they’ve been told (the system prompt, the conversation history, their training) and then produce the most probable next thing. Every conversation is a new inference. The same input does not always produce the same output.

That’s the non-determinism problem. And it’s not a bug. It’s how the technology works.

But it means that testing an AI chatbot is fundamentally different from testing a scripted flow. You can’t enumerate every path. You can’t guarantee behavior through exhaustive testing. What you can do is define the system prompt tightly, test for edge cases relentlessly, and iterate based on what you find in real conversations.

Which is what we did. 34 times.

What Actually Changed Between Versions

Version 1.0 gave us a system prompt and a basic personality. It could answer questions about the service area, take basic intake information, and handle simple routing. That’s all it was trying to do.

And it was immediately apparent that “trying to do” and “actually doing reliably” are two different things.

The chatbot would occasionally be helpful in ways we hadn’t asked for — answering adjacent legal questions that were outside our scope, being expansive when it should have been concise. It would sometimes be unhelpful in ways we hadn’t anticipated — misinterpreting informal language, or responding to frustration with a cheerful non-answer that made things worse.

Each iteration addressed specific failure modes. Tightened the system prompt. Added constraints. Defined the boundaries more explicitly. Tested again. Found new failure modes. Iterated again.

This is not a process that has a natural endpoint. There is no version of the chatbot that you finish. You ship something that works well enough to be useful, you run it, you watch it, you fix what you find, and you keep going. Version 3.05 works meaningfully better than version 1.0. Version 4 will work meaningfully better than 3.05.

That’s the rhythm of AI development. Iteration isn’t a sign of failure. It’s the methodology.

Why This Is Hard to Explain to Clients

Here’s the conversation I’ve had more than once.

Client: “Is the chatbot done?”

Me: “It’s in production and it’s handling conversations well.”

Client: “But is it done?”

And I understand the question. Software, in the traditional sense, has a done state. You spec it, you build it, you test it, you launch it. Done. The ongoing investment is maintenance and feature additions — discrete projects with beginnings and endings.

AI agents don’t have a done state in the same way. They have a functional enough state, and then an ongoing refinement process. The initial build gets you functional enough. The refinement process gets you from functional enough to genuinely good. And “genuinely good” keeps moving as you see more real conversations.

This is a mindset shift that not everyone is ready to make. If you’re evaluating an AI chatbot project expecting to pay for a build and then own a finished product, you’re going to be frustrated. The build is the beginning, not the end.

What you’re actually paying for — when you do this properly — is iteration velocity. How fast can we find the problems? How fast can we fix them? How quickly can we get from “this sometimes works” to “this works reliably”?

That’s the service. Not the version number.

What Makes a Good Iteration Cycle

From what we’ve learned building this thing: the quality of an iteration depends almost entirely on the quality of the feedback going into it.

Vague feedback produces marginal improvements. “The bot feels off sometimes” — I cannot do much with that. “The bot answered a question about property law when the user clearly meant to ask about employment law, here’s the transcript” — now we have something.

The best feedback comes from watching real conversations. Not test conversations where someone on the team plays the role of a user — actual users, with actual needs, with informal language and imperfect phrasing and frustration when things go wrong.

The first few weeks in production are the most valuable testing environment you’ll ever have. The production conversations teach you things no amount of pre-launch testing can. They show you the full range of how real people phrase their needs, the edge cases you didn’t imagine, the moments where the bot’s response was technically correct but humanly wrong.

Watch those conversations. Every single one, in the early stages. Build a log of failure modes. Prioritize by frequency and severity. Iterate in batches — don’t change one thing and re-launch, change a cluster of related things so you’re making meaningful progress per cycle.

And — this is important — track what you fix. Version numbering exists for a reason. When you can say “version 3.02 fixed the property/employment law confusion and reduced irrelevant-answer rate by X percent,” you have evidence that the iteration process is working. That evidence matters when you’re explaining the investment to the people who sign the invoices.

Setting Expectations Before You Build

The biggest thing we’ve changed as a result of this project is how we scope AI chatbot work in discovery.

We now spend a significant amount of the scoping conversation on iteration expectations. Not just what the bot will do, but what happens after launch. How do we handle feedback? What does the review process look like? Who watches the conversations? How often do we iterate? What does a successful 3-month post-launch period look like?

Getting these questions answered before the contract is signed means the 34 iterations don’t come as a surprise. They come as evidence that the process is working.

A client who understands that version 3.05 is a good outcome — not a sign that version 1.0 was a failure — is a client you can do genuinely good work with. They’re not measuring success by whether the bot was perfect at launch. They’re measuring it by whether the bot is getting meaningfully better over time.

That reframe changes everything.

The Honest Assessment of Where AI Chatbots Are Right Now

Good at: handling a high volume of consistent, similar queries with reasonable accuracy. Being available at 11pm. Collecting intake information without a human having to do it. Handling the initial qualifying layer before a human takes over.

Not good at: complex, multi-step reasoning across a long conversation. Situations where the stakes are high and getting it wrong has serious consequences. Anything that requires genuine nuance or professional judgment.

For legal services — and for a lot of professional services — the right model is handoff, not replacement. The chatbot handles the front end. A qualified human handles anything that requires real expertise. The chatbot’s job is to make sure the human’s time is spent on the high-value conversations, not on “do you operate in my area?” at 11pm on a Wednesday.

Version 3.05 does that job well. We got there in 34 iterations. If someone had told us at the start that it would take 34 iterations, we probably would have said that sounds like a lot.

It’s not a lot. It’s what good looks like.

Start here →

Frequently Asked Questions

The fast version. All killer. No filler.

How is an AI chatbot different from a scripted decision-tree chatbot?

Scripted chatbots follow deterministic paths — every output is predictable and testable before launch. AI chatbots generate responses dynamically based on context and training. You can't enumerate every possible conversation path. They're more flexible and natural, but require a different testing and iteration approach.

Is 34 iterations on a chatbot too many — does it mean something went wrong?

No. Iteration count is not a quality signal in AI development — it's the evidence of a functioning process. Version 3.05 working well is the outcome. The 34 iterations are how you get there. A chatbot launched and never updated is almost certainly not working as well as it should.

What role should a WhatsApp chatbot play in a legal services business?

Triage and handoff, not replacement. The chatbot handles high-volume, consistent queries — availability, intake information, basic routing — so the qualified humans can focus on conversations that require real expertise. The chatbot's job is to make the human's time more valuable, not to replace the human.

How do you set client expectations for AI chatbot projects?

In discovery, before the contract is signed. Discuss what happens after launch: how feedback is collected, who watches conversations, how often iterations happen, what a successful 3-month post-launch period looks like. A client who understands upfront that version 3.05 is a good outcome is a client you can do genuinely good work with.

What are the current limitations of AI chatbots for professional services?

They're not good at complex multi-step reasoning across long conversations, high-stakes situations where errors have serious consequences, or anything requiring genuine professional judgment. For legal, medical, or financial services, the right model is always human handoff — the bot qualifies and routes, the professional advises.

How do you track whether an AI chatbot is actually improving over time?

Version numbering with documented changes. When you can say 'version 3.02 fixed the property/employment law confusion and reduced irrelevant-answer rate by X%,' you have evidence the iteration process is working. That evidence matters when explaining the ongoing investment to decision-makers.

Mark Smith

April 20, 2026

Dive Deeper

More insights from the pack at Grey Wolf

- Agency
- AI
Meta’s API broke yesterday morning. Did your agency tell you?
A Meta API issue broke lead automations globally in May 2026. Most businesses found out when their sales…

Mark Smith
18 Oct 2023
- Agency
- AI
Your AI Pilot Is Not Failing. It Is Dipping.
78% of organisations have deployed AI. Only 6% report meaningful earnings impact. The San Antonio Spurs went from…

Mark Smith
18 Oct 2023
- Agency
- B2B
The Conversion Rate That Disappeared
CPL 28% below forecast. Sales conversion 57% below target. At the same time, in the same campaign. This…

Mark Smith
18 Oct 2023
- Agency
- Ecommerce
The 49% YoY Drop That Wasn’t a Drop
A DTC brand showed a 49% YoY sales decline for the first two weeks of April. The business…

Mark Smith
18 Oct 2023

View all

The WhatsApp Chatbot That Took 34 Iterations (And Why That’s Normal)

What You Need to Know

Why do AI chatbots require so many iterations after launch?

What makes a good iteration cycle for an AI chatbot?

What should clients expect after an AI chatbot launches?

The WhatsApp Chatbot That Took 34 Iterations (And Why That’s Normal)

The Assumption That Gets Everyone in Trouble

What Actually Changed Between Versions

Why This Is Hard to Explain to Clients

What Makes a Good Iteration Cycle

Setting Expectations Before You Build

The Honest Assessment of Where AI Chatbots Are Right Now

Frequently Asked Questions

How is an AI chatbot different from a scripted decision-tree chatbot?

Is 34 iterations on a chatbot too many — does it mean something went wrong?

What role should a WhatsApp chatbot play in a legal services business?

How do you set client expectations for AI chatbot projects?

What are the current limitations of AI chatbots for professional services?

How do you track whether an AI chatbot is actually improving over time?

Mark Smith

Dive Deeper

Meta’s API broke yesterday morning. Did your agency tell you?

Your AI Pilot Is Not Failing. It Is Dipping.

The Conversion Rate That Disappeared

The 49% YoY Drop That Wasn’t a Drop