Poison Pill: How Hidden Context Poisoning Actually Works

while (world) { observe(); infer(); hallucinate?(); self-correct(); } while (world) { observe(); infer(); hallucinate?(); self-correct(); } while (world) { observe(); infer(); hallucinate?(); self-correct(); }

001101 010011 sigil::bagua seed::entropy trace::mechanical awe::human 001101 010011 sigil::bagua seed::entropy trace::mechanical awe::human 001101 010011 sigil::bagua seed::entropy trace::mechanical awe::human

temperature=0.92 top_p=0.95 sampler=stochastic runtime=deterministic user=astonished temperature=0.92 top_p=0.95 sampler=stochastic runtime=deterministic user=astonished temperature=0.92 top_p=0.95 sampler=stochastic runtime=deterministic user=astonished

☰☱☲☳☴☵☶☷ oracle != truth ritual == interface meaning <- interpretation ☰☱☲☳☴☵☶☷ oracle != truth ritual == interface meaning <- interpretation ☰☱☲☳☴☵☶☷ oracle != truth ritual == interface meaning <- interpretation

seed = hash(question + state); noise = sample(temperature); pattern = deterministic(seed, noise); if (human_can_track === false) mark("mystic"); return explainability_gap(pattern); seed = hash(question + state); noise = sample(temperature); pattern = deterministic(seed, noise); if (human_can_track === false) mark("mystic"); return explainability_gap(pattern);

## Not a person, still persuasive - interface implies intention - language implies confidence - user infers agency => design for interpretability ## Not a person, still persuasive - interface implies intention - language implies confidence - user infers agency => design for interpretability

cast.bagua = pickTrigrams(seed); cast.moonBlocks = deriveHexagram(seed); cast.fortuneSticks = burnModel(intensity); cast.scapula = generateCracks(seed); return readable_fiction(cast); cast.bagua = pickTrigrams(seed); cast.moonBlocks = deriveHexagram(seed); cast.fortuneSticks = burnModel(intensity); cast.scapula = generateCracks(seed); return readable_fiction(cast);

{ "observe": true, "decide": constrained, "act": reversible, "measure": behavior, "learn": weekly } { "observe": true, "decide": constrained, "act": reversible, "measure": behavior, "learn": weekly }

def perceivable_randomness(system): return complexity(system) > attention_budget if perceivable_randomness(llm): user.labels_output = "fate" def perceivable_randomness(system): return complexity(system) > attention_budget if perceivable_randomness(llm): user.labels_output = "fate"

[trace] t=02:13 system murmurs [trace] tokens fall like ash [trace] certainty simulated [trace] mechanism remains [trace] human names it chance [trace] t=02:13 system murmurs [trace] tokens fall like ash [trace] certainty simulated [trace] mechanism remains [trace] human names it chance

Most teams still evaluate AI prompts the way humans read documents: visible text only.

That assumption breaks quickly in production.

A model does not just “read what we read.” It parses token sequences and metadata channels. If those channels carry extra instructions, the model can follow them even when a human reviewer sees nothing suspicious.

That gap is why I built Poison Pill.

The core mismatch

In normal product review, people ask:

Does this copy look clear?
Does this UI feel trustworthy?
Does this output look correct?

For AI safety, those are necessary but incomplete. You also need to ask:

What exact bytes/tokens does the model ingest?
Are there hidden channels in that text path?
Could those channels override system intent?

Humans review meaning. Models execute structure.

When structure and meaning diverge, that is your attack surface.

What Poison Pill demonstrates

The demo separates one message into two layers:

Human-visible layer: normal text that looks harmless.
Machine-visible layer: hidden payload inserted via channel tricks.

Then it shows both outputs side by side:

what a person thinks the message says
what a parser can extract from the same content

This is not hypothetical. The mechanisms are simple and cheap:

zero-width characters appended or interleaved
hidden comment payloads in HTML
hybrid payloads that survive copy/paste and rendering transforms

The “magic” is not intelligence. It is encoding.

Why this matters for product teams

Most prompt-injection discussion lives at the model layer. But the bugs usually start in product surfaces:

rich text inputs
imported documents
CMS content
copied snippets between tools
automated agent-to-agent handoff messages

If your app allows text to flow between systems, you are already in the context-poisoning game.

The right response is not panic. It is instrumentation and constraints.

Defensive defaults that actually help

In practice, I use a simple posture:

Normalize and sanitize text before model entry.
Expose machine-view previews in risky workflows.
Tag or reject hidden-channel input by policy.
Separate untrusted context from instruction channels.
Log raw payloads for incident review.

This makes systems harder to trick and easier to debug.

Why I care about this

I build user-facing AI products, so I have to hold both truths at once:

AI can massively increase delivery velocity.
AI can also be mis-steered through fragile context boundaries.

Knowing how to leverage and exploit a system are the same skill viewed from opposite sides. Product quality comes from using that skill responsibly: design for utility, design for abuse resistance, and make failure modes legible.

That is the thesis behind Poison Pill.

Build for the model you actually have, not the one humans imagine.

Best,
Oli
March 10, 2026