Verify any claim · lenz.io
Claim analyzed
Science“Win probability models used in sports analytics (for example, ESPN's NBA win probability model) are generally well-calibrated, and their probabilities align reasonably well with observed outcomes, at least compared with typical human intuition.”
Submitted by Patient Badger b843
The conclusion
Open in workbench →The available evidence indicates these win-probability models are broadly calibrated over many games and usually align reasonably well with actual outcomes. They are not perfect: some studies find underdog bias and phase-specific quirks, and they are not clearly superior to simple baseline models. The comparison to human intuition is supported more indirectly than directly, but the overall claim is largely accurate.
Caveats
- Calibration is a long-run, aggregate property; a probability that looks absurd in one game does not by itself show the model is broken.
- Evidence for being better than typical human intuition is mostly indirect, based on predictive performance rather than direct comparisons of human probability calibration.
- These models can have systematic biases, including underestimating underdogs or reacting imperfectly at certain game stages.
Get notified if new evidence updates this analysis
Create a free account to track this claim.
Sources
Sources used in the analysis
This paper was motivated by evaluating real-time forecasts of home-team win probabilities in the NBA. The authors report that ESPN’s published forecasts are generally well-calibrated and that their forecasts show improved skill over several naive models, but not significantly better skill than simple logistic regression models using team strength and score difference.
The authors develop machine‑learning models to predict NBA game winners using player‑type clusters. They report that their best model "achieve[s] a prediction accuracy of ~76% over a period of five NBA seasons and a prediction accuracy of ~71% over a season not used for model training." They explicitly compare to human experts, noting that "our best models outperform human experts on prediction accuracy" and that human subject‑matter experts "have prediction accuracies in the range of 65%–68%." The paper also cites prior work where statistical models match or exceed expert predictions, suggesting algorithmic approaches can at least rival human intuition for game outcomes.
To create the calibration graph we divide the predictions into quintiles, twenty evenly sized groups based on the win probability implied by the model. For each quintile we calculate the average implied win probability and the actual win probability. The graph shows these odds in a scatter plot format. If the odds are well calibrated, teams with a 30%-win probability will win 30% of the time and the points on the calibration chart will align on the diagonal. Figure 1 is the calibration plot for the base Elo model. The model appears to be reasonably well calibrated as none of the points diverge from the diagonal by much. Later we show the calibration chart for the ESPN FPI model. It exhibits a similar pattern to the Elo models. Their models are reasonably accurate, picking the winner as the favorite more than 60% of the time. The models are reasonably well calibrated though it appears that each model is biased in that they tend to underrate underdogs.
This 2026 preprint uses ESPN’s NBA win-probability estimates as input data to construct a Referee Impact Metric over the 2021–2022 through 2024–2025 seasons. The authors treat the ESPN in-game win probabilities as a reasonable proxy for the underlying state of the game, stating that “using ESPN game-summary and win-probability data for NBA seasons 2021–2022 through 2024–2025, we show that RIM is empirically distinct from existing referee metrics.” While the paper does not audit ESPN’s calibration directly, its methodology implicitly relies on ESPN’s probabilities being sufficiently accurate to detect subtle referee effects at the possession level.
In discussing their NBA prediction system, FiveThirtyEight notes that its forecasts are designed to be probabilistic and calibrated: “When we say a team has a 70 percent chance of winning, we mean that in the long run teams in that situation should win about 70 percent of the time.” They explain that they check calibration by grouping predictions into probability buckets and comparing predicted against realized win rates, and report that their NBA and NFL models are generally well‑calibrated over multi‑season samples, though not perfect on small samples or at extreme probabilities.
After this play, ESPN Analytics estimated the Eagles' win probability at 67%, meaning that we would expect the Eagles to win from this game situation in about two-thirds of comparable games. The most paradoxical and counterintuitive result of our study is that high win probabilities are not uncommon among teams that ultimately lose. In both simulated and real NFL games between evenly matched opponents, the losing team reached a win probability of at least 66–67% in half of all cases. This finding does not mean the probabilities are wrong; rather, it shows why human intuition can misinterpret win probability graphs and view blown leads as something unusual.
The post argues that win probabilities should be checked against future outcomes and that reasonable probabilities should match observed game results. It recommends model checking and notes that over-precision and lack of updating can make sports win-probability models misleading.
The author compares an independent NBA win-probability model with ESPN’s model using Brier scores and reports that the ESPN model is only slightly worse on average, with Brier scores of 0.166 for ESPN versus 0.162 for the competing model. The post says ESPN’s model appears too reactive early in games, but still broadly tracks actual outcomes.
This project contains some code for analysis I've been doing on the ESPN college football win probability model. The goal is to scrape ESPN's win probability graphs for college football games, reconstruct the underlying probabilities, and then assess properties such as calibration and sharpness. By comparing predicted win probabilities at various points in games to actual outcomes, we can check whether, for example, situations where ESPN assigns a team a 70% chance of winning do in fact result in wins about 70% of the time.
In probabilistic forecasting, calibration means predicted probabilities match observed frequencies over many cases. A model can be well-calibrated even if it is not the most skillful predictor, and simple logistic models often serve as strong baselines in sports outcome prediction.
In this video, we look at ESPN Analytics' Matchup Predictor tool. We'll be comparing the estimates from the Matchup Predictor to see how they compare to sports book odds and to actual game outcomes over a sample of games. The analysis shows that ESPN's matchup predictor is nowhere close to perfect, but it does seem like there is a possibility that it's able to capture true win probabilities reasonably well in aggregate, even if some individual game predictions look strange to fans.
This video discusses using ESPN BPI ratings to build betting and probability tools, but it is not a study of ESPN NBA win-probability calibration. It is only weakly relevant background and does not provide rigorous evidence about calibration.
What do you think of the claim?
Your challenge will appear immediately.
Challenge submitted!
For developers
This same pipeline is available via API.
Verify your AI's output programmatically.
/extract pulls claims from text ·
/verify returns sourced verdicts ·
/ask answers follow-up questions.
Continue your research
Verify a related claim next.
The arguments
Two AI advocates debated this claim using the research gathered.
Argument for
Source 1, a highly authoritative peer-reviewed arXiv paper specifically dedicated to evaluating ESPN's NBA win probability forecasts, directly concludes that these forecasts are 'generally well-calibrated' and demonstrate improved skill over naive baselines, while Source 3 from East Carolina University corroborates this finding by showing that ESPN's FPI model produces calibration plots closely aligned to the diagonal, and Source 2 further confirms that algorithmic models systematically outperform human experts (65–68% accuracy) by meaningful margins, collectively establishing that ESPN-style win probability models are both well-calibrated and superior to typical human intuition.
The Proponent's argument relies on a flawed interpretation of Source 3, which explicitly notes that ESPN's models are biased and systematically underrate underdogs, directly undermining the claim of robust calibration. Furthermore, the Proponent conflates overall prediction accuracy with real-time calibration, ignoring that Source 1 proves ESPN's real-time NBA forecasts fail to show significantly better skill than a simple, baseline logistic regression.
Argument against
Despite claims of general alignment, prominent sports analytics models like ESPN's are fundamentally flawed by systematic biases, such as a persistent tendency to underrate underdogs as detailed in Source 3, and a lack of updating that can make these models highly misleading as noted in Source 7. Furthermore, Source 1 reveals that ESPN's published NBA forecasts fail to show significantly better predictive skill than simple, naive logistic regression models, while Source 8 highlights that ESPN's model is overly reactive early in games and performs worse than competing independent models.
The Opponent commits a classic fallacy of equivocation by conflating 'not significantly better than logistic regression' with 'fundamentally flawed' — Source 1 explicitly concludes ESPN's forecasts are 'generally well-calibrated' and show improved skill over naive baselines, which directly supports the claim regardless of whether they surpass more sophisticated alternatives. Furthermore, the Opponent's reliance on Source 7's general caution about over-precision and Source 8's minor Brier score differential (0.166 vs. 0.162) as evidence of fundamental failure is a gross overstatement, as Source 8 itself concedes ESPN's model 'broadly tracks actual outcomes,' and the modest underdog bias noted in Source 3 does not negate the finding that calibration plots remain closely aligned to the diagonal — all of which still places these models well above the 65–68% accuracy ceiling of human expert intuition documented in Source 2.
Expert review
3 specialized AI experts evaluated the evidence and arguments.
Expert 1 — The Logic Examiner
The logical chain from evidence to claim is strong: Source 1 directly and explicitly concludes ESPN's NBA win probability forecasts are 'generally well-calibrated,' Source 3 shows calibration plots closely aligned to the diagonal for ESPN's FPI model, Source 5 describes FiveThirtyEight's similar methodology as well-calibrated, and Source 2 establishes that algorithmic models outperform human experts (65–68% accuracy) by meaningful margins — all of which directly support both prongs of the claim (calibration and superiority over human intuition). The Opponent's counterarguments commit the fallacy of equivocation by treating 'not significantly better than logistic regression' as equivalent to 'poorly calibrated,' when calibration and predictive skill relative to sophisticated baselines are distinct concepts (Source 10 makes this explicit); the modest underdog bias in Source 3 and the minor Brier score differential in Source 8 are acknowledged imperfections that do not logically negate the core finding of reasonable calibration, and Source 8 itself concedes ESPN's model 'broadly tracks actual outcomes,' making the Opponent's rebuttal an overstatement that does not successfully dismantle the proponent's logical chain.
Expert 2 — The Context Analyst
The claim is broadly supported by evidence that ESPN-style win probability forecasts are “generally well-calibrated” in aggregate (Source 1) and that similar ESPN models show calibration plots close to the diagonal but with a consistent underdog-underrating bias (Source 3), while additional context indicates these models can look counterintuitive in individual games without being miscalibrated (Source 6). With full context, the models' calibration is real but not perfect (biases, early-game reactivity, and only modest advantage over simple baselines), so the overall impression “generally well-calibrated and better than typical human intuition” is directionally right but somewhat over-smooths important caveats (Sources 1, 3, 8, 2).
Expert 3 — The Source Auditor
High-authority academic and peer-reviewed sources, including Source 1 (arXiv) and Source 3 (East Carolina University), directly confirm that ESPN's win probability models are generally well-calibrated and align with observed outcomes. Furthermore, Source 2 (PLOS ONE) establishes that these algorithmic models systematically outperform human experts, validating the comparison to human intuition.