Scientific judgment is Polanyi's term for the evaluative capacity that distinguishes genuine scientific expertise from mechanical competence in following procedures. It is the ability to recognize which problems matter, which data are significant, which hypotheses are worth testing, which results are surprising, which explanations are elegant. This judgment operates tacitly: the experienced researcher feels when an experiment is on the right track, when a result is too good to be true, when a theory has been stretched beyond its domain of validity. The judgment cannot be reduced to explicit rules—two equally competent scientists examining the same evidence may reach different evaluations, both grounded in genuine expertise, because their tacit grounds differ. What makes scientific judgment reliable is not its conformity to rules but its grounding in the researcher's years of embodied engagement with the domain—the accumulated sensitivity to patterns, anomalies, and the boundaries between the understood and the mysterious. AI systems lack this judgment entirely: they produce outputs consistent with training-data patterns but cannot assess whether those outputs represent genuine advances in understanding or merely probable recombinations of existing knowledge.
Polanyi developed scientific judgment to explain how peer review actually works. The official account presents reviewers as applying explicit methodological standards: did the researcher follow proper experimental protocols? Do the data support the conclusions? Is the reasoning valid? But Polanyi observed that the most consequential evaluations scientists make concern significance—does this work advance understanding in a way that matters?—and significance cannot be assessed by explicit criteria. The reviewer must exercise connoisseurial judgment, drawing on deep tacit understanding of what the field knows, what it needs to know, and what directions are promising. This judgment is fallible, personal, and irreplaceable. No algorithm can substitute for it because the judgment concerns the relationship between the work and the field's tacit understanding of its own frontiers.
The AI research community faces a version of this problem in evaluating its own models. Benchmarks provide explicit measures of performance—accuracy on standard tests, perplexity scores, human preference ratings. But these metrics cannot capture the tacit dimension of what makes a model genuinely better rather than merely different. When do improvements on benchmarks represent real advances in capability, and when do they represent overfitting to the specific distributions the benchmarks measure? The judgment requires the tacit sensibility of researchers who have spent years working with models, who have developed an intuitive feel for the difference between genuine capability and sophisticated pattern-matching. The field's best practitioners possess this judgment. The worst substitute benchmark performance for understanding—optimizing metrics without asking whether the metrics measure what matters.
Organizations deploying AI face the scientific judgment problem when deciding which AI-generated analyses to act on. An AI system can produce comprehensive strategic analysis of a market opportunity—data on competitors, customer segments, regulatory environment, technical feasibility. The analysis may be factually accurate and logically organized. But whether the opportunity is worth pursuing requires judgment that no analysis can contain: tacit assessment of organizational capability, intuitive sense of market timing, feel for which risks are acceptable and which are catastrophic. The executive who possesses this judgment can use AI analysis as input. The executive who lacks it—who has risen through organizations where strategic thinking was delegated to consultants and analysis teams—has no ground from which to evaluate whether the AI-generated analysis is genuinely insightful or merely plausible. The analysis gets accepted because it meets explicit standards, and the absence of tacit judgment goes unnoticed until the strategic bet fails.
Scientific judgment appears throughout Polanyi's work but receives extended treatment in "The Republic of Science" (1962) and in the sections of Personal Knowledge concerned with the authority of science. Polanyi argued that science is governed not by method—which any competent technician can follow—but by the collective judgment of the scientific community, whose members have internalized the standards of their disciplines through years of practice. This internalization is what produces the capacity for evaluative judgment that peer review depends on and that no explicit methodology can replace.
Significance defies specification. The capacity to distinguish work that matters from work that is merely competent cannot be reduced to explicit criteria—it is tacit judgment built through deep domain engagement.
Grounded in embodied practice. Scientific judgment develops through years of doing research, encountering successes and failures, building sensitivity to what problems are tractable and what directions are promising.
Irreducibly personal. Two experts may disagree in their evaluative judgments without either being wrong—both draw on tacit grounds shaped by different biographical trajectories through the domain.
Peer review depends on it. The evaluative function of scientific community—separating genuine advances from incremental work, identifying significant from trivial contributions—operates through tacit judgment that no rubric captures.
AI cannot develop it. Pattern-matching from training data produces outputs consistent with existing knowledge but cannot exercise the tacit sensibility that recognizes when a pattern represents genuine insight versus statistical coincidence.