Designing an Evaluation Loop for AI Coaching Products

while (world) { observe(); infer(); hallucinate?(); self-correct(); } while (world) { observe(); infer(); hallucinate?(); self-correct(); } while (world) { observe(); infer(); hallucinate?(); self-correct(); }

001101 010011 sigil::bagua seed::entropy trace::mechanical awe::human 001101 010011 sigil::bagua seed::entropy trace::mechanical awe::human 001101 010011 sigil::bagua seed::entropy trace::mechanical awe::human

temperature=0.92 top_p=0.95 sampler=stochastic runtime=deterministic user=astonished temperature=0.92 top_p=0.95 sampler=stochastic runtime=deterministic user=astonished temperature=0.92 top_p=0.95 sampler=stochastic runtime=deterministic user=astonished

☰☱☲☳☴☵☶☷ oracle != truth ritual == interface meaning <- interpretation ☰☱☲☳☴☵☶☷ oracle != truth ritual == interface meaning <- interpretation ☰☱☲☳☴☵☶☷ oracle != truth ritual == interface meaning <- interpretation

seed = hash(question + state); noise = sample(temperature); pattern = deterministic(seed, noise); if (human_can_track === false) mark("mystic"); return explainability_gap(pattern); seed = hash(question + state); noise = sample(temperature); pattern = deterministic(seed, noise); if (human_can_track === false) mark("mystic"); return explainability_gap(pattern);

## Not a person, still persuasive - interface implies intention - language implies confidence - user infers agency => design for interpretability ## Not a person, still persuasive - interface implies intention - language implies confidence - user infers agency => design for interpretability

cast.bagua = pickTrigrams(seed); cast.moonBlocks = deriveHexagram(seed); cast.fortuneSticks = burnModel(intensity); cast.scapula = generateCracks(seed); return readable_fiction(cast); cast.bagua = pickTrigrams(seed); cast.moonBlocks = deriveHexagram(seed); cast.fortuneSticks = burnModel(intensity); cast.scapula = generateCracks(seed); return readable_fiction(cast);

{ "observe": true, "decide": constrained, "act": reversible, "measure": behavior, "learn": weekly } { "observe": true, "decide": constrained, "act": reversible, "measure": behavior, "learn": weekly }

def perceivable_randomness(system): return complexity(system) > attention_budget if perceivable_randomness(llm): user.labels_output = "fate" def perceivable_randomness(system): return complexity(system) > attention_budget if perceivable_randomness(llm): user.labels_output = "fate"

[trace] t=02:13 system murmurs [trace] tokens fall like ash [trace] certainty simulated [trace] mechanism remains [trace] human names it chance [trace] t=02:13 system murmurs [trace] tokens fall like ash [trace] certainty simulated [trace] mechanism remains [trace] human names it chance

AI coaching products usually fail in a quiet way: the model sounds supportive, but user behavior does not improve.

That happens when teams evaluate response quality in isolation. “Helpful tone” and “coherent output” are necessary, but they are not sufficient. A coaching product only works if it changes what users do after the message.

The loop

Evaluate generation quality.
Evaluate user action after seeing the response.
Evaluate repeated use over time.
Feed failures back into prompt and UX updates.

This keeps evaluation tied to outcomes rather than aesthetics.

Scorecard example

Dimension	Signal	Threshold
Response clarity	User understood next step	> 85%
Actionability	User completes suggested step	> 60%
Trust	User reports response as useful/safe	> 80%
Retention	User returns within 7 days	> 40%

Thresholds vary by product stage, but explicit thresholds matter. Otherwise, teams drift into subjective quality arguments.

Where teams misdiagnose

Many “model quality” complaints are actually product design failures:

unclear context collection before generation
no affordance for editing user goals
weak handoff from advice to concrete action
no safe fallback when confidence is low

Prompt tuning alone cannot fix those gaps.

Implementation pattern

Instrument the coaching loop as a sequence, not a snapshot:

context captured
recommendation delivered
action accepted/rejected
follow-through completed or abandoned

This makes weak links visible and actionable.

Bottom line

AI coaching should be treated as behavior infrastructure, not conversational theater.

If your evaluation loop measures only response fluency, you will optimize for persuasive copy. If it measures behavioral outcomes, you can improve real user progress while keeping safety and trust visible.

Best,
Oli
July 9, 2025