← Back to all writing

The Accountability Gap

December 2025·8 min read·AI Infrastructure

Last month, a returns agent at an e-commerce company approved a $12,000 refund it shouldn't have, and the strange thing was that nothing looked wrong. The logs showed no errors, the model returned a valid response, every validation check passed. The problem wasn't that the agent failed. It's that no one could explain why it made that call in the first place. This is becoming the new failure mode for AI systems: not crashes, not exceptions, but decisions that technically work and yet can't be justified to anyone after the fact.

Every serious software company is eventually going to run two backends, and I think most people building in this space haven't fully internalized what that means. The first backend is the one we all know: request comes in, code runs, response goes out. We've spent fifty years building tooling for this, compilers, databases, observability, IAM, all built around a single assumption that humans make decisions, encode them in code, and machines execute them exactly as instructed.

The second backend handles reasoning, and that's where everything we've built starts to break down. It doesn't follow instructions in the traditional sense. It makes judgment calls. Same input can produce different output depending on context, and there's no stack trace that explains why it picked one path over another because the reasoning itself is the product, not just a means to an output.

Authority Without Accountability

When you hand off a decision to a person, authority and accountability travel together. You can't really use one without bearing the other. This has been true as long as organizations have existed, and most of our management structures, legal frameworks, and audit practices are built on this assumption without even thinking about it.

AI breaks this arrangement in a way that I don't think we've fully grappled with yet. These agents can approve loans, route support tickets, authorize refunds, flag transactions. They process more calls in an hour than a human team handles in a month. But they can't be held responsible in any meaningful sense, because responsibility requires something they don't have: understanding, intention, the capacity to have chosen otherwise.

This is the structural problem we're dealing with: we're building agents that make judgment calls without being able to answer for those calls. Authority gets passed down but accountability stays behind, distributed across the humans who trained, deployed, and supervised the system.

What we actually need are chains of provenance that trace every call back to the humans who authorized it, audit records that show not just what was decided but why, and identity systems that make handoffs explicit and verifiable.

The Deliberation Record

When a doctor makes a diagnosis, the chart doesn't just say "prescribed X." It captures the thought process: what symptoms pointed where, which tests ruled out what, why one treatment was chosen over another. The record is about the thinking, not just the decision.

Software has never needed this because code executes deterministically. If you want to know what happened, you read the logs, and logs tell you everything because the machine did exactly what you told it to do. There's no gap between instruction and action that needs explaining.

AI reasons differently, and this is where things get interesting. The model considers possibilities, weights them, settles on an answer, but unless you designed for it, that consideration vanishes the moment the response is generated. You're left with an outcome disconnected from the process that produced it, like a patient chart that says "prescribed X" with no reasoning at all.

Production AI is going to need what I've started calling a deliberation record: a structured capture of what was considered, what weight each factor carried, and why the final judgment was reached. Logs are linear: event A, then B, then C. But deliberation is networked because factors relate to each other and trade off against each other.

The unit of execution is an operation. The unit of judgment is a consideration. That distinction matters more than most people realize.

Confidence as Product

Type systems, assertions, unit tests: every tool we've built in the last fifty years exists to guarantee certainty. If the output is ambiguous, something went wrong and you need to fix it.

AI inverts this completely. A model that says "80% chance of fraud" is actually more useful than one that says "fraud" or "not fraud" because the uncertainty is the signal. Suppress it and you've thrown away information; force a binary classification and you've hidden the cases where the model was basically guessing.

This requires different architecture than what we're used to building. Instead of code that fails when confidence is low, you need code that routes on confidence: high-confidence calls go straight through while low-confidence ones get more reasoning, different models, or human review. Confidence becomes a resource you budget, like compute or memory.

The observability we need isn't about uptime or latency, it's about calibration. Does this model know what it doesn't know?

What Comes Next

Companies are going to automate judgment. Every company with more calls than people to make them will eventually reach this conclusion because the economics are too compelling and the technology has crossed the threshold of usefulness.

The choice is whether to deploy with accountability tooling or without it.

Most teams are deploying without, building on frameworks designed for prototyping and running in production environments built for deterministic code. Some of those gaps won't matter. Others will surface as audit failures, unexplainable calls, accountability vacuums when something goes wrong and nobody can figure out why.

The agents that make judgment calls need tooling that can account for them. That's the work.

← Back to all writing