Designing an Evaluation Loop for AI Coaching Products

while (world) { observe(); infer(); hallucinate?(); self-correct(); } while (world) { observe(); infer(); hallucinate?(); self-correct(); } while (world) { observe(); infer(); hallucinate?(); self-correct(); }

001101 010011 sigil::bagua seed::entropy trace::mechanical awe::human 001101 010011 sigil::bagua seed::entropy trace::mechanical awe::human 001101 010011 sigil::bagua seed::entropy trace::mechanical awe::human

temperature=0.92 top_p=0.95 sampler=stochastic runtime=deterministic user=astonished temperature=0.92 top_p=0.95 sampler=stochastic runtime=deterministic user=astonished temperature=0.92 top_p=0.95 sampler=stochastic runtime=deterministic user=astonished

☰☱☲☳☴☵☶☷ oracle != truth ritual == interface meaning <- interpretation ☰☱☲☳☴☵☶☷ oracle != truth ritual == interface meaning <- interpretation ☰☱☲☳☴☵☶☷ oracle != truth ritual == interface meaning <- interpretation

seed = hash(question + state); noise = sample(temperature); pattern = deterministic(seed, noise); if (human_can_track === false) mark("mystic"); return explainability_gap(pattern); seed = hash(question + state); noise = sample(temperature); pattern = deterministic(seed, noise); if (human_can_track === false) mark("mystic"); return explainability_gap(pattern);

## Not a person, still persuasive - interface implies intention - language implies confidence - user infers agency => design for interpretability ## Not a person, still persuasive - interface implies intention - language implies confidence - user infers agency => design for interpretability

cast.bagua = pickTrigrams(seed); cast.moonBlocks = deriveHexagram(seed); cast.fortuneSticks = burnModel(intensity); cast.scapula = generateCracks(seed); return readable_fiction(cast); cast.bagua = pickTrigrams(seed); cast.moonBlocks = deriveHexagram(seed); cast.fortuneSticks = burnModel(intensity); cast.scapula = generateCracks(seed); return readable_fiction(cast);

{ "observe": true, "decide": constrained, "act": reversible, "measure": behavior, "learn": weekly } { "observe": true, "decide": constrained, "act": reversible, "measure": behavior, "learn": weekly }

def perceivable_randomness(system): return complexity(system) > attention_budget if perceivable_randomness(llm): user.labels_output = "fate" def perceivable_randomness(system): return complexity(system) > attention_budget if perceivable_randomness(llm): user.labels_output = "fate"

[trace] t=02:13 system murmurs [trace] tokens fall like ash [trace] certainty simulated [trace] mechanism remains [trace] human names it chance [trace] t=02:13 system murmurs [trace] tokens fall like ash [trace] certainty simulated [trace] mechanism remains [trace] human names it chance

AI coaching products fail quietly when teams only evaluate model responses in isolation.

You need an evaluation loop that measures user outcomes, not just output quality.

The loop

Evaluate generation quality
Evaluate user action after seeing the response
Evaluate repeated use over time
Feed failures back into prompt and UX updates

Scorecard example

Dimension	Signal	Threshold
Response clarity	User understood next step	> 85%
Actionability	User completes suggested step	> 60%
Trust	User reports response as useful/safe	> 80%
Retention	User returns within 7 days	> 40%

Implementation note

Do not isolate prompt tuning from product design. Many “model issues” are actually UX issues: unclear context, poor affordances, or no recovery path.