
The Work AI Left for Humans
In the first piece I wrote about what AI does to the individual: the compulsion loops, the vanishing stopping cues, the way the cognitive load migrates upward until the nature of the fatigue changes. In the second, I wrote about what it does to the team: the daily rituals that stopped fitting, the meaning of “done” when nothing forces you to stop, the quiet shift in what code review is actually checking for. Those pieces were about the containers of work, the structures we inherited and the ways they are breaking.
This one is about what is inside the containers. Building production infrastructure for AI agents, systems where autonomous components make consequential decisions, route work to each other, and execute for hours without human involvement, forces a confrontation with every question those earlier pieces raised, because each observation about stopping cues and accountability and knowledge transfer eventually becomes a design decision that has to be made in code. The answers feel provisional because the level at which these questions matter is itself shifting; what requires human judgment today may be confidently automated tomorrow, and the organizational structures designed around current capability will need redesigning as capability grows.
The Sprint Needs a New Trigger

The sprint had a hidden function more valuable than its visible one. On the surface it time-boxed execution into manageable chunks, but underneath it forced a reflection cycle, a moment every two weeks when the team stopped building and asked whether they were building the right thing, whether the direction was still correct, what they had learned. That periodic forced reflection matters more now than it ever has, because the volume of consequential decisions per unit time has increased by an order of magnitude.
A workflow trace makes this concrete in a way the abstract never can. Forty-three agent-to-agent handoffs completing between midnight and six in the morning, each one a consequential judgment executed autonomously: agents evaluating inputs, routing decisions to other agents, producing structured outputs that trigger downstream workflows. The execution trace renders as a directed acyclic graph, clean and traceable, every decision logged with its reasoning. It also represents roughly three weeks of what used to be human decision-making, compressed into six hours. Evaluating whether those decisions were correct means making more architectural calls before lunch than a team used to make in a sprint.
When a single overnight execution compresses three sprints worth of consequential decisions into a few hours, the reflection cycle cannot wait eleven more days; the reasoning behind decision thirty-seven has already evaporated from working memory by decision forty-two. The infrastructure makes this mismatch visible because the decision density can be graphed over time, and the graph shows a cadence that the two-week sprint was never designed for. The reflection mechanism needs to trigger on decision density rather than elapsed time: when the system shows that a threshold of consequential choices has been crossed, that is when you stop and ask whether you are still building the right thing.
That threshold is where the interesting question lives, because it is specific to each team. A team operating in a high-stakes domain with tight regulatory constraints will cross it faster than a team building internal tooling with generous error margins; a team of three senior engineers will absorb decision density differently than a team of ten with mixed experience; a team whose AI infrastructure produces structured reasoning at every step will have a different evaluation burden than a team working with opaque model outputs. The threshold emerges from the intersection of domain complexity, judgment capacity, and the maturity of the AI systems the team operates, which means it cannot be standardized. The sprint was a universal cadence that worked because the underlying work was roughly uniform in its demand for human attention. The replacement will be specific to each team's relationship with its autonomous systems, closer to a vital sign than a calendar event, something the infrastructure monitors and surfaces rather than something a project manager schedules.
As AI capability grows and confidence in autonomous operation increases, what qualifies as a “consequential decision” will itself migrate upward. Today it might be an architectural choice or a routing decision that demands human evaluation; as systems earn trust at that level, the consequential decisions will be feature-level or product-level choices, and the density threshold will need to recalibrate accordingly. The organizational cadence will never fully stabilize. The teams that build their reflection cycles as adaptive systems rather than fixed processes will navigate the transition more gracefully than those waiting for a new standard to replace the old one.
Knowledge Lives in the Trajectory

When one AI-augmented engineer builds an entire subsystem in a few days, the output is readable, the architecture navigable, the tests comprehensive. By every traditional metric of knowledge transfer the code speaks for itself. What the code cannot convey is why it exists in this form instead of the seventeen other forms that were explored and discarded in the conversation between the builder and the machine. The architectural variations tried mid-dialogue, the constraints that surfaced during exploration and were held in mind but never written down, the trade-offs that felt obvious at the time but would take an hour to reconstruct: all of it lives in a conversation that was never designed to be a knowledge artifact and is rarely saved as one.
The irony is visible in the infrastructure being built for these systems. An agent designed to make a consequential decision requires a structured output definition: action, confidence, reasoning. The reasoning field exists because six months from now, when someone asks why the system approved that claim or flagged that transaction, the reasoning is the only artifact that will matter. The execution trace captures the full trajectory: which agents were consulted, what data was evaluated, what confidence score triggered the decision, what the structured reasoning said at each step. The infrastructure preserves this decision landscape automatically for every machine decision.
For the human decisions that shaped the system, no equivalent exists. The engineer who spent three days in conversation with an AI exploring seventeen architectural approaches before settling on the one that shipped has no comparable trace. The commit history shows what changed but reveals nothing of the decision landscape that was traversed; the conversation transcript, if it was saved at all, is an unstructured wall of text that lacks the context to be useful. The constraints held in mind, the directions tried and abandoned, the reasoning that made the chosen path feel inevitable in the moment, all of it evaporated when the conversation ended.
A team that inherits such a system inherits fully traceable machine decision history and zero record of why the human designed it that way rather than any of the other ways they considered. They can maintain the system and fix bugs in it, but the moment someone needs to make a significant architectural decision about that subsystem, they are operating without the context that informed its design, which is exactly the context they need most. What documentation should capture in an AI-first world is what else was considered and why it was ruled out; the negative space of decisions, the paths explored and discarded, may carry more value than the decisions themselves. The infrastructure built for autonomous agents already captures this automatically. Nothing equivalent exists for the humans who design the agents.
The granularity at which this problem matters will shift as AI capability grows. Today the meaningful trajectory is at the architectural decision level, because that is where humans still exercise judgment with enough frequency and consequence that losing the reasoning trail is dangerous. As AI systems earn confidence at that level, the consequential human reasoning will migrate upward to feature-design conversations and product-strategy explorations, and the infrastructure for capturing decision trajectories will need to follow it there. The problem recurs at each new level of abstraction.
Fewer Minds, More Judgment

Most of the code in an AI-first system is specification: schemas that define what a decision looks like, confidence thresholds that define when a human must intervene, routing rules that govern how work flows between autonomous components, and policies that define which agents can call which functions, enforced by the infrastructure rather than by trust. The implementation, the function bodies and API calls and data transformations, is largely handled by agents. The human contribution is the boundary: the articulation of where autonomy ends and what “good enough” means for each class of decision.
The craft has migrated from execution to specification, and the interesting question is less about headcount than about what kind of cognitive work remains when AI absorbs the execution layer. The tasks that got automated were the most legible and transferable: writing the function, building the component, wiring the integration. What remains is the work that was always harder to distribute and impossible to commoditize: specification and evaluation and architectural reasoning, the judgment about whether the system should exist in the form it is taking. A team of three doing what ten used to do is three people whose every specification has disproportionate leverage and disproportionate blast radius. A poorly calibrated confidence threshold produces a category of bad decisions that compounds silently until something breaks in production; a routing architecture that misunderstands failure modes passes testing and fails at scale, when agents coordinate around the wrong assumptions and the consequences are proportional to the judgment rather than the effort.
This is why team sizing modeled on output capacity, which is how nearly every engineering organization plans today, probably needs to shift toward something modeled on judgment bandwidth: how many consequential decisions this group can make well per unit time before the quality degrades. The distinction matters because judgment bandwidth is shaped by fundamentally different factors than output capacity. It depends on cognitive recovery time, the complexity of the domain, the quality of the observability infrastructure that supports evaluation, and the individual's ability to maintain calibration under sustained decision load. It is a deeply personal and team-specific constraint, and it produces very different answers about how to structure a team than throughput metrics ever did.
The specification level itself is rising. Today the meaningful specifications are schemas and confidence thresholds; as AI systems mature and earn confidence at this layer, the human contribution will migrate further upward to higher abstractions, to the product-level intent or the organizational strategy that the specification serves. The pattern repeats at each level: the most recently automated layer becomes commodity, and the meaningful human work migrates one level of abstraction higher.

Which brings me to the question the second piece ended with: what does seniority mean when the skills that historically defined it are being automated? The traditional hierarchy in engineering organizations mapped to depth of execution experience. Ten years of writing distributed systems gave you pattern recognition that two years could not, and that pattern recognition was legible, demonstrable, and organisationally valuable. When execution is automated, that mapping breaks. Seniority in an AI-first world maps to something different: the quality of constraints you can specify, the calibration of your judgment about where autonomy should end, the ability to recognize when the system's confidence is misplaced. The senior engineer's schema is tighter; their thresholds are better calibrated; their routing architectures reveal deeper understanding of failure modes and recovery paths.
But these are capacities that develop through a different kind of practice, closer to the judgment a physician develops through years of diagnosis than to the skill a programmer develops through years of writing code, and they do not accumulate the way execution experience did. Whether organizational hierarchies designed around execution-depth seniority can adapt to judgment-quality seniority is a question most organizations have not yet confronted. It becomes more urgent as the abstraction level rises, because each upward shift automates the previous layer's specification work and demands a new kind of judgment at the level above. The question of what seniority means in this landscape is recursive: it will need to be re-answered at every new level of capability, and the organizations that build adaptable structures rather than fixed hierarchies will handle that recursion better than those looking for a permanent new org chart.
I suspect software is, again, just where this is visible first. Every industry with a production pipeline built on human specialization and sequential handoffs will face a version of this question as AI absorbs enough of the execution layer, from legal teams to financial operations to clinical workflows. Writing a new organizational playbook turns out to be a fundamentally different capability from executing the old one.
The physicist in me recognizes the shape of this work. It is closer to specifying the Hamiltonian of a system, the constraints that govern how it evolves, than to solving the equations of motion. The craft after AI is the specification of the space within which the output must live.
But the person practicing this craft is still subject to the dynamics described in the first two pieces. The specification that needs one more refinement is always visible, always specific, and always appears closable with one more attempt: the schema could be tighter, the threshold better calibrated, the routing extended to handle one more edge case. The craft migrated, and the compulsion followed it.
Those dynamics, and the physics of where the work sustains and where it shatters, are where this series goes next.

Santosh Kumar Radha
Physicist & CTO at agentfield.ai — building AI infrastructure for the future.