Brier Score — Orange Pill Wiki
CONCEPT

Brier Score

The quadratic scoring rule — measuring the squared distance between predicted probability and observed outcome — that Tetlock used to evaluate expert forecasts and that AI output urgently requires.

The Brier score, developed by meteorologist Glenn Brier in 1950, quantifies forecast accuracy by measuring the mean squared difference between predicted probabilities and actual outcomes. A perfect forecast receives a score of zero; the worst possible forecast receives a score of two. The score is 'proper,' meaning it incentivizes honest reporting of subjective probabilities rather than strategic hedging. Tetlock adopted the Brier score as the primary metric for Expert Political Judgment because it captured both calibration (whether seventy-percent predictions are right seventy percent of the time) and resolution (whether the forecaster distinguishes between more- and less-probable events). The Brier score made expert accountability possible by transforming vague predictions into measurable claims.

In the AI Story

Hedcut illustration for Brier Score
Brier Score

The mathematical simplicity of the Brier score conceals its conceptual power. For a binary event (will it happen or not?), the score is calculated as: BS = (f - o)², where f is the forecast probability and o is the outcome (1 if the event occurred, 0 if it did not). A forecaster who predicts seventy-percent probability for an event that happens receives a score of 0.09; if the event does not happen, the score is 0.49. Aggregated across many predictions, the average Brier score reveals both how well-calibrated the forecaster is and how much information the forecasts contain. A forecaster who always predicts fifty-fifty is perfectly calibrated but provides zero information. A forecaster whose confidence tracks with outcomes provides both.

Tetlock's twenty-year study scored 28,000 expert predictions using Brier scores, creating the most comprehensive evaluation of professional judgment ever conducted. The scoring was unforgiving: there was no partial credit for being 'close,' no interpretive charity for predictions that were 'directionally correct.' The forecast said sixty percent; the outcome was binary; the correspondence was measured. This rigor made the findings credible in ways that qualitative assessments of expert performance could never be. The numbers were what they were, and what they were was humbling: expert Brier scores clustered just slightly better than the 0.5 baseline representing no discriminative ability.

In the AI age, the Brier score acquires new relevance as a diagnostic for AI-assisted judgment. A professional who uses AI to generate forecasts or recommendations should, in principle, be able to track Brier scores for AI-assisted versus unassisted predictions and determine empirically whether the AI is improving judgment or merely amplifying confidence. The infrastructure for this evaluation exists — it requires only the willingness to make predictions specific enough to score and the discipline to track outcomes. The absence of such infrastructure in most organizations means that professionals using AI have no systematic way to know whether the tool is making their judgment better or worse. They have feelings about the tool's value. They do not have measurements.

Origin

Glenn W. Brier, a meteorologist at the U.S. Weather Bureau, introduced the score in his 1950 paper 'Verification of Forecasts Expressed in Terms of Probability' in the Monthly Weather Review. Weather forecasting provided the natural laboratory: predictions were made daily, outcomes were observed reliably, and the volume of data allowed statistical evaluation of forecaster skill. Brier wanted a metric that would reward both calibration and resolution — that would distinguish between a forecaster who was merely well-calibrated at the aggregate level and one who could discriminate between high- and low-probability events. The quadratic penalty structure (squaring the error) accomplished this by punishing overconfidence more severely than underconfidence. The score became standard in meteorology and was later adopted in psychology, economics, and intelligence analysis.

Key Ideas

Quadratic penalty for error. Squaring the difference between predicted probability and outcome means that confident wrong predictions are punished more severely than tentative wrong predictions — the score incentivizes honesty.

Calibration and resolution together. A low Brier score requires both that confidence levels match hit rates (calibration) and that the forecaster distinguishes between probable and improbable events (resolution).

Proper scoring rule. The Brier score is 'proper,' meaning a forecaster's expected score is optimized by reporting their true subjective probability — no strategic hedging improves performance.

Accountability infrastructure. The score transforms vague expertise into measurable claims, enabling the feedback loops that judgment improvement requires.

AI output demands scoring. Without systematic Brier scoring of AI-assisted predictions, professionals have no basis for knowing whether the tool improves judgment or merely amplifies overconfidence.

Appears in the Orange Pill Cycle

Further reading

  1. Brier, G.W. (1950). 'Verification of Forecasts Expressed in Terms of Probability.' Monthly Weather Review, 78(1), 1–3.
  2. Tetlock, P.E. (2005). Expert Political Judgment, Methodological Appendix.
  3. Murphy, A.H. (1973). 'A New Vector Partition of the Probability Score.' Journal of Applied Meteorology, 12, 595–600.
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT