June 28, 2026 · iri bone

The real metric for AI-assisted work: time between human decisions

The common way to measure AI's effect on knowledge work is "how much faster did the machine do the task?" — lines of code per hour, tickets closed, words drafted. It's the easy number to reach for, and it's mostly the wrong one.

On real projects, the bottleneck is rarely raw production speed. It's the waiting. Work stalls at the points where it needs a person — to approve a direction, resolve an ambiguity, make a judgment call only a human with context can make. The useful question isn't "how fast can the AI type?" It's:

How long can the work run, unattended, before it next requires a human decision — and how good is that stretch of work when you check it?

Why this is the number that matters

Every handoff back to a human carries fixed costs: context-switching, scheduling, the latency of a busy person's attention. Shave a task from 60 minutes to 6 and you've saved 54 minutes — but if a human still has to weigh in every few minutes, you've barely moved the project's wall-clock timeline. The person is still the rate-limiter.

Lengthen the interval between necessary human decisions — from "every few minutes" to "every few hours" to "once a day" — and something different happens. The human stops being a bottleneck and starts being a decision-maker: fewer, higher-leverage calls, each one better-informed because more work has accumulated underneath it.

What it means in practice

Optimizing for this metric changes how you build. You invest in the things that let work run safely without supervision: clear specifications up front, strong tests and verification the agent can check itself against, tight feedback loops, and well-chosen guardrails so that "unattended" doesn't mean "unaccountable." You design the human checkpoints deliberately, putting them where judgment genuinely adds value rather than where the tooling happens to interrupt.

It also reframes what good AI-assisted work looks like. A system that produces output quickly but needs constant correction is worse, in this light, than one that's slightly slower but runs clean for an hour. The second one is the one that actually compresses your timeline.

The honest caveat

Longer intervals are only a win if the quality holds. Stretching the leash on work that quietly goes wrong just means you discover the problem later, with more to unwind. So the metric is really two numbers held together: interval and trustworthiness of the interval. Chasing the first while ignoring the second is how AI projects generate impressive demos and disappointing outcomes.

This is the thread running through much of Grimalkin's work — figuring out, for a given product or team, where the durable leverage is, and designing the checkpoints so the humans spend their attention on the decisions that deserve it.