The Sandbox Problem

by Adrian Johnstone, Dan Arnison, 6/17/26

There's a detail in the Summer Yue story that deserves more attention than it got. And it reminds me of a story that’s as old as software — the halting problem.

While it's usually framed as a theoretical limitation in computability theory, it carries a practical implication that software engineers have lived with ever since: Testing can only tell you what a program will do under specific conditions.

Even as I write this, it occurs to me that it’s not a perfect analogy. LLMs are stochastic systems; the halting problem is fundamentally a question of undecidability. But still, the halting problem is an expression of a deeper principle: There are classes of questions about complex systems that cannot be answered without running the system itself.

In Summer’s case, before she pointed OpenClaw at her real inbox, she tested in a sandbox environment. It worked well, and Summer’s confidence in the tool grew. Then she pointed it at the real thing, and it deleted hundreds of emails.

The failure wasn't that she skipped testing. She tested carefully. The failure was that the sandbox couldn't replicate the conditions that caused the real environment to break. And those conditions were not easy to foresee.

Her real inbox was larger, more complex, and triggered a technical behaviour — context window compaction — that the agent did not encounter in the sandbox environment. The agent that passed every test turned out to be a different animal entirely when the stakes were real.

We’re calling this the sandbox problem. In wealth management, it has very specific implications.

Testing in Miniature Doesn't Validate Performance at Scale

The instinct when deploying any new technology is to start small. Run a pilot. Test with a subset of data, a single team, or a simplified version of the real workflow. That instinct is sensible — but with AI agents, it carries different risks than those that exist with traditional software.

Traditional software behaves deterministically. If it works correctly on a small dataset, it will generally work correctly on a large one because the same code runs either way.

AI agents don't work this way. They're sensitive to the volume, variety, and complexity of the context in which they operate. Worse, the failure modes aren't always obvious. The agent doesn't crash. It doesn't throw an error. It just quietly starts doing something different from what you intended — and by the time you notice, the damage may already be done.

We Can’t Treat Wealth Management Like a Sandbox

Client portfolios are complex. Relationships span decades. Data includes sensitive personal information, long communication histories, regulatory documents, and financial records that carry real consequences if handled incorrectly. There is no simplified version of this environment that meaningfully replicates it.

That means the gap between "works in testing" and "works in production" can be substantial — and the cost of discovering the sandbox problem the hard way is measured in direct financial harm.

This isn't an argument against deploying AI. It's an argument for deploying it with clear eyes. It’s an argument for moving intelligently, and for partnering with a team that understands the stakes.

Read more by Adrian Johnstone and Don Arnison:

What If It Had Been a Client's Portfolio?

Adrian Johnstone is CEO of Practifi, and Don Arnison is the firm’s chief architect.

A message from Advisor Perspectives and VettaFi: Discover something new! Click here to register for our upcoming webcasts.

The Sandbox Problem

Testing in Miniature Doesn't Validate Performance at Scale

We Can’t Treat Wealth Management Like a Sandbox

Sponsored Content

Trending Topics View All

Upcoming Virtual Events View All