When the answer is always 'yes, automate it,' you stop noticing the cases where the right answer is 'no, instrument it' or 'just write a Python script.' Most failed AI projects we audit started with a workflow that should have been left alone or replaced with rules, not with an LLM.
The cost of a wrong yes is not zero. It is six months of build, an ongoing API bill, a debug surface no one wanted, and a precedent that says 'this team ships things that do not need to ship.' The cost of a thoughtful no is one meeting.
Five questions. Pick the closest answer; you can always Back. The verdict at the end is one of seven, with the rationale and next steps built in.
{
"start": "freq",
"nodes": {
"freq": {
"type": "choice",
"question": "How often does this workflow run?",
"hint": "Frequency drives whether automation has a payback period at all.",
"options": [
{"label": "Multiple times a day", "next": "judgment"},
{"label": "Daily or weekly", "next": "judgment"},
{"label": "Monthly or rarely", "next": "result_low_volume"}
]
},
"judgment": {
"type": "choice",
"question": "How much human judgment does each instance need?",
"hint": "Same shape every time = automation candidate. Heavy judgment = humans + better tools.",
"options": [
{"label": "Same shape every time, just tedious", "next": "rule"},
{"label": "Varied, but pattern-recognizable", "next": "stakes"},
{"label": "Heavy human judgment per case", "next": "result_keep_human"}
]
},
"rule": {
"type": "choice",
"question": "Could a Python script with a few rules handle this today?",
"hint": "If yes, automate it without an LLM. Rules are cheaper, faster, and debuggable.",
"options": [
{"label": "Yes, just rules", "next": "result_rule_based"},
{"label": "No, the variance is too wide", "next": "stakes"}
]
},
"stakes": {
"type": "choice",
"question": "What happens when the system gets it wrong?",
"hint": "An LLM will get some cases wrong. Wrong needs to be recoverable.",
"options": [
{"label": "Mild annoyance, easy to fix", "next": "data"},
{"label": "Customer-visible, costs money", "next": "result_review"},
{"label": "Regulated, irreversible, or safety-critical", "next": "result_keep_human"}
]
},
"data": {
"type": "choice",
"question": "Do you have eval examples for this task?",
"hint": "If you cannot test it, you cannot ship it.",
"options": [
{"label": "Yes, 30 or more examples", "next": "result_yes"},
{"label": "Some, fewer than 30", "next": "result_yes_with_evals"},
{"label": "None yet", "next": "result_build_evals"}
]
},
"result_low_volume": {
"type": "result",
"tone": "negative",
"verdict": "Not yet. The math does not work.",
"rationale": "Low frequency means automation cost rarely pays back. Build it manually for now and revisit when the volume crosses daily.",
"next_steps": [
"Document the workflow so it is portable when volume grows.",
"Set a tripwire: if it crosses N runs per week, revisit this question.",
"Spend the freed time on something with a higher payback ratio."
]
},
"result_keep_human": {
"type": "result",
"tone": "caution",
"verdict": "Keep it human. Instrument around it.",
"rationale": "High judgment or high stakes is exactly where LLMs fail in expensive ways. Do not automate the call. Automate the surrounding tasks so the human's decisions are easier and faster.",
"next_steps": [
"Build dashboards or briefing docs that surface the inputs to the decision.",
"Use AI for summaries, retrieval, and drafting. Not for the final call.",
"Revisit only after eval coverage exists for the entire decision space."
]
},
"result_rule_based": {
"type": "result",
"tone": "positive",
"verdict": "Automate it. No LLM needed.",
"rationale": "If a few rules handle it, an LLM is the wrong tool. Rules are cheaper, deterministic, debuggable, and orders of magnitude faster.",
"next_steps": [
"Sketch the rules in pseudo-code first.",
"Decide where the rules live: cron job, webhook handler, scheduled script.",
"Add a fallback path for the cases the rules miss."
],
"cta": {"label": "Take the AI Readiness Quiz", "href": "../tools/ai-readiness-quiz.html"}
},
"result_review": {
"type": "result",
"tone": "caution",
"verdict": "Automate it. With a human in the loop.",
"rationale": "Customer-visible mistakes need a review step. The LLM drafts, a human approves. You still get the speed; you absorb the variance.",
"next_steps": [
"Define the approval interface (Slack thread, queue, dashboard).",
"Set a target rate for human-required overrides; iterate prompts to lower it.",
"Track approve/reject patterns as fresh eval data."
]
},
"result_build_evals": {
"type": "result",
"tone": "caution",
"verdict": "Build evals first. Then automate.",
"rationale": "Without eval coverage you have no way to know if a prompt change made things better or worse. Shipping LLM features without evals is shipping a black box. Pause for a week.",
"next_steps": [
"Pull 30-50 historical examples of the task with the correct output.",
"Build a simple grading harness (exact match, structured rubric, or LLM-as-judge).",
"Then come back to this question."
]
},
"result_yes_with_evals": {
"type": "result",
"tone": "positive",
"verdict": "Automate it. Expand evals as you go.",
"rationale": "Less than 30 examples is enough to start, not enough to ship at full scale. Build the workflow on the partial evals and let production traffic feed the eval set.",
"next_steps": [
"Build the workflow on the existing evals.",
"Capture every production input + output as a candidate eval.",
"Block scaling on hitting 100+ evals with stable scores."
]
},
"result_yes": {
"type": "result",
"tone": "positive",
"verdict": "Strong candidate. Ship it.",
"rationale": "High frequency, pattern-recognizable judgment, recoverable stakes, eval data ready. This is the canonical LLM workflow target.",
"next_steps": [
"Pick the model. Start with Sonnet 4.6, try Haiku 4.5 for cost-sensitive paths.",
"Build the prompt against your evals.",
"Ship to a small slice of traffic, measure, expand."
],
"cta": {"label": "Take the AI Readiness Quiz", "href": "../tools/ai-readiness-quiz.html"}
}
}
}
How you arrived at the question matters more than the answer. Two people can ask 'should we automate this?' and mean opposite things. Quick self-check:
Neither end of this is right or wrong. But knowing your bias before you run the tree saves you from rationalizing toward a verdict you already had.
Even after the tree gives you a verdict, here are the three signals we see most often that the verdict is wrong, regardless of which way it leaned:
The tree is a starting point, not a verdict generator. The real value is forcing the conversation to happen out loud, in front of a single concrete prompt, instead of in the swirl of a roadmap meeting.