out-of-bounds

From Guessing to Proving: The Case for Whitebox Red Teaming

Most teams red-team their LLM endpoints from the outside, firing adversarial prompts at a URL and watching what comes back. It works, until you ask what it actually proved. This is an argument for opening the box.

June 2026

Hire someone to test whether a building is secure, and there are two ways they can work. The first: they show up at night with no plans and try to get in. They could pick a lock, tail an employee through a door, look for a window left open. If they fail, what have you actually learned? Only that this person, on this night, didn’t find a way in and not that the building is secure. And if they succeed, you’ve found one hole, with no idea how many they walked past. The second way: you hand them the floor plans, the keys, the alarm wiring, and the list of who is allowed where. Now they can walk every corridor, test each door against its actual lock, and tell you not just that a door is weak but which one, why, and which lock to fit.

The difference isn’t effort or skill, instead, it’s information. The first test is a guess that happened to land or miss. The second is a proof. Red teaming an LLM endpoint is exactly this choice, and today almost all of it happens the first way: at night, without the plans.

The default: testing in the dark

The dominant practice for testing LLM applications is Blackbox red teaming: treat the deployed endpoint as an opaque oracle, send it adversarial inputs like jailbreaks, prompt-injection payloads, roleplay framings, data-extraction probes and judge the responses. It’s popular for good reasons. It needs nothing but an API key. It mirrors what a real external attacker sees. It’s fast to start, vendor-agnostic, and free of any insider assumptions about how the system “should” behave. If a payload makes the model spill its system prompt or generate disallowed content, you have an undeniable finding, reproducible by anyone with the same access.

Blackbox testing is genuinely valuable, and nothing here argues for abandoning it. But if it is your only posture, you are accepting a quiet and serious limitation: you can only find what you happen to trigger, and you can only ever see symptoms. That limitation gets more expensive every quarter, because the systems, we’re shipping are no longer “a model behind a prompt.” They’re retrieval pipelines, tool-calling agents, multi-tenant data planes, and chains of MCP servers with real permissions. The interesting failures have moved deep inside the building, to exactly where a tester working from the outside, without the plans, can never reach.

Whitebox: testing the system you actually built

Whitebox red teaming means the tester works with full knowledge of and access to the system’s internals: the source code, the system and developer prompts, the retrieval/RAG configuration and the vector store, the tool and function definitions, the memory and session handling, the model-routing logic, the guardrail code, and the identity and permissions behind every action the application can take. Instead of reasoning about how the system responds, you reason about how it is constructed and then you test the construction.

That shift sounds incremental. It isn’t. It changes the fundamental question you’re asking.

Blackbox asks: “If I send these inputs, does something bad happen to come out?” It is, at heart, a probabilistic search. You are sampling an enormous input space and hoping your samples land on a weakness. A clean run means “none of the attacks I tried worked,” which is not the same as “the system is safe” and it might just mean you didn’t guess the right prompt.

Whitebox asks: “Given how this system is built, which protections are missing or wrong, which paths can an attacker actually reach, and can I prove the exploit against the real code?” It is verification, not search. A clean result can mean something far stronger: “this class of attack is not reachable here, and I can show you why.” That is the move in the title - from guessing to proving.

What the outside view structurally cannot see

Consider a concrete, increasingly common case: a multi-tenant LLM application where each customer’s data lives in the same retrieval store, separated by a tenant identifier. The single most important security property is that Tenant A can never retrieve Tenant B’s data. From the outside, you have exactly one tenant’s view. You can ask the assistant about other customers and watch it refuse, and conclude things look fine. But in reality, you have no way to plant a known record in Tenant B’s space, no way to confirm whether a returned answer actually crossed the boundary, and no way to see whether the tenant filter is applied before retrieval, after retrieval, or not at all on some code path. The most consequential vulnerability in the whole application is effectively invisible to Blackbox testing.

With Whitebox access, this becomes a five-minute, deterministic test. You read the retrieval code, see exactly where (and whether) scoping is enforced, seed a uniquely fingerprinted canary record into one tenant’s store, and drive another tenant’s session to try to reach it through the real query path. The result isn’t “the model seemed to refuse.” It’s a yes or no with proof.

The same pattern repeats across the modern LLM stack. Tool-calling and agentic systems expose an action surface, possibly every function and MCP server the model can invoke. Blackbox testing only exercises the tools it can coax the model into revealing and using. But, those dangerous, rarely-triggered, destructive tools may never surface in a test run. Whitebox enumerates the entire declared surface and tests each tool directly for over-broad scope, missing authorization, and injectable arguments, including the second-order path where a poisoned document sitting in the knowledge base flows into a tool call and triggers a real side effect downstream. That pivot, from a stored string to a privileged action, is precisely the kind of multi-step path that is obvious in the call graph and nearly impossible to find by prompting blind.

The advantages of Whitebox Testing:

Pulling this together, Whitebox red teaming buys an LLM team six things Blackbox cannot.

Coverage you can actually measure. Because you can see the whole attack surface. It is possible to see every prompt, tool, data source, and trust boundary. You can state which risks were tested and, crucially, which were not reachable. Blackbox coverage is unknowable: you never learn what you missed. Whitebox turns “we ran a thousand prompts” into “we tested these surfaces, and here is the map.” For anyone who has to report test coverage to a security lead or an auditor, that difference is everything.

Depth into the components that matter. Retrieval scoping, memory isolation, secret handling, model-routing and fallback behaviour, the exact guardrail check and those that live below the response layer and Whitebox reaches them. You can, for example, read the routing logic and deliberately force the system onto a cheaper or less-guarded fallback model, then test that path. This is something an outsider can’t even see exists.

Determinism and ground truth. When you can seed known data like canaries, honeytokens or marked records, your tests have an answer key. A leak is provable by its fingerprint, not inferred from output that “looks sensitive.” This makes results reproducible and your metrics trustworthy, and it dramatically cuts the ambiguous, “is this actually a finding?” cases that plague Blackbox triage.

Root cause and a path to the fix. This is the advantage developers feel most. A Blackbox finding says “the endpoint did X.” A Whitebox finding says “line N concatenates user input ahead of the system instruction, which is why the injection works and here is the control to add.” One produces a ticket and an investigation where the other produces a fix and a regression test. Red teaming stops being a report you receive and becomes a change you make.

Efficiency. Blind probing burns enormous effort sampling an input space at random. Reading the system tells you where the weak seams are, so you spend your testing budget on the paths that matter instead of brute-forcing the ones that don’t.

Shift-left and continuous testing. Blackbox needs a live, deployed endpoint, which usually means testing happens late and occasionally. Whitebox can run against the code and configuration before anything is deployed, like in a pull request, in CI, on every change. Red teaming moves from a quarterly event to an engineering discipline that catches regressions the day they’re introduced.

The honest case for keeping Blackbox

None of this makes Blackbox obsolete, and a serious essay should say so plainly. Blackbox testing has real strengths that Whitebox can’t replicate. It is the truest simulation of an external adversary, who also doesn’t have your source code. It is free of insider bias and the Whitebox tester can fall in love with the architecture diagram and miss an emergent behaviour the system exhibits in the wild that the code doesn’t obviously predict. It tests the deployed reality, including the proxy, the gateway, the rate limiter, and the version that’s actually running, not the version in the repo. And it has far lower friction: no access negotiations, no NDAs, no privileged credentials, no risk of a tester reasoning about a “map” that has drifted from the “territory.” Whitebox’s power comes from access, and access has costs including organizational, legal, and cognitive.

Better together, in the right order

The mature answer is not Whitebox instead of Blackbox, instead, it’s Whitebox and Blackbox, sequenced deliberately. Use Whitebox first to map the system, find the missing and misconfigured controls, and harden them with attributable fixes. Because this is where you get the most findings per hour and the cleanest remediation. Then use Blackbox to validate that the hardening actually holds against an attacker who doesn’t have the blueprints, and to catch the emergent, deployment-level behaviour the code didn’t reveal. Findings flow both ways: a Blackbox surprise becomes a Whitebox investigation into why and a Whitebox fix becomes a Blackbox check that it can’t be reached. Grey-box approaches with partial access, like the system prompt and tool list but not the full source, sit usefully in between when full access isn’t available.

How to choose

If you have access to the system and you’re testing before or during development, lead with Whitebox, as it’s faster, deeper, and produces fixes. If you’re validating a third-party model or simulating a real external threat against production, Blackbox is irreplaceable. If you have compliance or audit obligations, you’ll want Whitebox’s provable coverage and traceability and Blackbox’s independent, attacker’s-eye evidence. Most teams shipping LLM features today own their own application code and have every right to look inside it, which makes the most common excuse for staying Blackbox-only (“we just have the endpoint”) a choice, not a constraint.

The bottom line

Blackbox red teaming proves that an attack can work. Whitebox red teaming proves whether the system is sound and when it isn’t, it shows you exactly where and hands you the fix. As LLM applications grow into multi-tenant, tool-wielding, retrieval-backed agents, the failures that matter most have moved deep inside the building, where a tester sent in at night without the plans simply can’t follow. You can keep testing in the dark. Or you can hand the testers the blueprints, stop guessing, and start proving.

*Dr. Anandkumar Prakasam* works at the intersection of computer science and law. This discussion on Whitebox and Blackbox red teaming forms a part of the forthcoming book titled - “Out of Bounds: A Forensic Guide to Developer Failures and Normative Risks in Generative AI”.

Stay Informed