Velocity Metrics Critique is the Goldratt simulation's application of constraint theory to the measurement frameworks dominating contemporary software development. Story points, sprint velocity, feature counts, deployment frequency — the standard metrics of Agile and DevOps culture — measure the rate at which engineering produces output. When coordination was the constraint, these metrics had some validity as indirect proxies for system throughput, since engineering output was limited by coordination and improving coordination improved engineering output. In the AI era, the metrics have become measurements of the non-constraint, and their celebration produces the exact pattern Goldratt spent his career diagnosing: locally improving metrics, systemically degrading outcomes.
There is a parallel reading in which velocity metrics represent not local optimization but democratic accountability in organizations where judgment has historically been a site of unexamined power. Story points and sprint velocity emerged not merely as coordination tools but as mechanisms to make engineering work legible to non-technical stakeholders, to create shared language between functions, and to resist the tyranny of gut-feel prioritization by executives whose "judgment" often encoded market prejudices, personal preferences, or political convenience rather than user insight.
The critique assumes judgment can be cleanly separated from production and elevated above it, but the actual workflow of software development is iterative discovery through building. Users don't know what they want until they see it; product-market fit is found through rapid experimentation, not prolonged evaluation. Velocity metrics, for all their flaws, keep organizations shipping — and shipping is how technology companies learn. The risk in the AI era is not that we'll build too much, but that we'll return to a pre-Agile world where small groups of "visionaries" control roadmaps through assertion rather than evidence, where judgment becomes another word for executive prerogative, and where the difficulty of measuring evaluation depth becomes an excuse not to measure anything at all. Goldratt's framework is clear, but so is organizational history: unmeasured processes drift toward capture by whoever holds positional authority. Velocity metrics, however imperfect, created space for engineering voice in product decisions. Their replacement must come with guarantees that judgment won't simply mean "what the VP believes."
The critique applies with specific force to sprint velocity tracking. Sprint velocity measures how many story points a team completes in a sprint. Teams that increase their velocity are celebrated as improving; teams whose velocity declines are examined for dysfunction. The measurement assumes that completed story points correspond to system value — that more points completed means more value delivered. This assumption holds only if the features represented by the points have been evaluated, validated, and found worth building. In AI-augmented teams, where generation capacity vastly exceeds evaluation capacity, the assumption breaks. Velocity can increase while system throughput — the rate at which genuine value reaches users — stagnates or declines.
The specific mechanism is the one Goldratt diagnosed repeatedly in manufacturing: local optimization of a non-constraint produces inventory, not throughput. AI-augmented engineers can generate features faster than product managers can evaluate them, QA can test them, and users can absorb them. The features ship — velocity increases — but the downstream capacity to validate their value has not scaled. The system accumulates features of uncertain value, creating cognitive inventory and product incoherence that will eventually manifest as maintenance burden, user confusion, and competitive vulnerability.
The critique extends to deployment frequency, pull request counts, code review throughput, and virtually every quantitative metric of engineering activity. Each measures the rate of a non-constraint and implicitly assumes the constraint is elsewhere. The measurement frameworks were designed for an era when they were approximately correct. In the AI era, they are systematically misleading. Organizations celebrating them are celebrating what Goldratt would immediately recognize as the wrong thing.
The alternative the simulation proposes is explicit measurement of the judgment constraint: decision quality, evaluation depth, system coherence, product-user fit. These metrics are harder to measure precisely, which is why organizations default to the easier proxies. But Goldratt's framework is clear: measure the wrong thing and you will optimize for the wrong thing. The difficulty of measuring judgment does not justify measuring velocity; it justifies the harder work of building measurement systems adequate to the actual constraint. An organization that measures judgment imperfectly is in better shape than an organization that measures velocity precisely — because the first is aiming at the right target, however badly, while the second is aiming at the wrong target, however accurately.
The critique synthesizes Goldratt's long-standing attack on local-optimization metrics with the specific constraint migration produced by AI. It draws on the Berkeley study's documentation of task seepage and the broader empirical record of AI-augmented work intensification.
Velocity measures the non-constraint. Story points, sprint velocity, and feature counts measure engineering output, which is no longer the system's binding constraint.
Local optimization of velocity produces inventory. Features generated faster than they can be evaluated accumulate as cognitive and product inventory — liabilities masquerading as assets.
The measurement framework was built for a prior era. Agile metrics were approximately right when coordination was the constraint; they are systematically wrong now that judgment is.
Judgment metrics are harder but necessary. Decision quality, evaluation depth, and product coherence are difficult to measure, but measuring them badly is superior to measuring velocity precisely.
Organizational culture resists the critique. Velocity metrics are embedded in performance reviews, compensation, and professional identity — making their replacement a political project as much as a technical one.
The core insight—that velocity measures the non-constraint in AI-augmented environments—is approximately 85% correct for the specific case of feature-complete products in mature markets, where the question is "what should we build" rather than "can we build fast enough." The critique accurately diagnoses the inventory accumulation pattern when generation vastly exceeds validation capacity. But the weighting shifts dramatically (to perhaps 40% Goldratt, 60% velocity-defense) in true discovery contexts: early-stage products, new market entry, rapid experimentation scenarios where the cost of building is now so low that building-to-learn beats analyzing-to-decide.
The political economy concern—that unmeasured judgment drifts toward executive capture—is legitimate and worth roughly 30% weight in any implementation strategy. The answer isn't to preserve velocity metrics but to build judgment transparency: decision logs, evaluation frameworks, outcome tracking that makes product choices as legible as story points made engineering work. This is genuinely harder, which is why organizations resist it, but the difficulty is surmountable rather than fundamental.
The synthesis the territory itself suggests is constraint-regime awareness: organizations need measurement systems that shift with the constraint. In discovery mode, velocity matters more (though still not exclusively). In optimization mode, judgment depth matters more. In scaling mode, coherence matters more. The mistake isn't velocity metrics per se—it's velocity metrics everywhere, always, without recognizing what you're actually constrained by right now. Goldratt's framework doesn't say "never measure non-constraints"—it says "don't confuse non-constraint metrics for system throughput." An organization sophisticated enough to know which regime it's in can use velocity when appropriate, judgment when appropriate, and switch between them deliberately rather than defaulting to whichever is easier to measure.