# Scoring rules part 2: Calibration does not imply classification

[This is Part 2 of a three-part series on scoring rules. If you aren’t familiar with scoring rules (and Brier’s quadratic scoring rule in particular), you should read Part 1 before reading this post. If you’d like, you can skip straight to Part 3.]

One of the most important skills of good probabilistic forecasting is calibration. Calibration is the art of not being overconfident or underconfident. You’re calibrated if 10% of the things you assign a 10% chance to actually happen, if 70% of the things you assign a 70% chance to actually happen, and so forth. Early each year, Scott Alexander checks whether his predictions for the previous year were calibrated; I’ve committed to doing the same in early 2021. FiveThirtyEight prides itself on having calibrated forecasts across the board.

Calibration is not enough.

I know almost nothing about basketball, but I could give you perfectly calibrated forecasts for NBA games: I’ll just say 50% no matter what game you ask about. Will the Wizards beat the Rockets? 50% chance. Will the Lakers beat the Knicks? 50%. The events I assign a 50% chance to will happen 50% of the time!

The other part of making useful probabilistic forecasts is expertise, or classification. Classification is the art of discriminating between events, classifying some as likely and others as unlikely. FiveThirtyEight’s NBA forecasts are better than mine not because they’re more calibrated, but because they’re more informed. FiveThirtyEight, unlike me, will tell you that the Wizards will most likely lose to the Rockets, but that the Lakers will probably beat the Knicks.

As the old saying goes, calibration does not imply classification.

The reverse is also true. Maybe you have a lot of basketball knowledge and you know that the Lakers are better than the Knicks, so you say there’s a 99% chance that the Lakers will beat the Knicks. If in practice there’s only an 80% chance — and if you’re consistently overconfident in your predictions — then you’re better than me at classification, but worse at calibration.

So how do we assess whether a forecaster is calibrated and whether a forecaster is informed? If we have a large sample of probabilistic predictions, how can we measure the forecaster’s calibration and classification?

Let’s say, concretely, that a forecaster has made probabilistic predictions for n different events (think of n as being large). And for the sake of simplicity let’s say each probability is a multiple of 1% (so one of 0%, 1%, …, 100%). For each $p \in \{0, .01, \dots, 1\}$, let $n_p$ be the number of events for which the forecaster predicted a probability of p, and let $x_p$ be the number of these events that actually happened.

So how do we measure calibration? Well ideally, for each p, $x_p$ should be roughly p-fraction of $n_p$, i.e. $x_p \approx pn_p$. So we can measure calibration using the formula $\text{Calib} = \sum_{p \in \{0, .01, \dots, 1\}} \frac{1}{n_p}(pn_p - x_p)^2$.

The closer this number — which we will call the calibration index — is to zero, the more calibrated the forecast. (We divide by $n_p$ so this number scales linearly instead of quadratically with the number of forecasts in bucket p. Otherwise a forecaster can artificially make their calibration score lower (better) by evening out their bucket sizes.)

And how do we measure classification? A forecast is perfectly classified if it discriminates really well between low- and high-likelihood events — that is, if the probabilities it assigns carry lots of information about whether events are likely to happen. Put another way, a forecast is well-classified if, within each bucket (i.e. each p), either a large majority of the events in the bucket  happen or a large majority don’t happen. One way to put a number on this is by taking each bucket, labeling all events that happened with 1 (there are $x_p$ of these) and events that didn’t happen with 0 (there are $n_p - x_p$ of these), and taking the variance of the label (multiplied by $n_p$, since bigger buckets matter more). This gives us a concrete classification index. $\text{Class} = \sum_{p \in \{0, .01, \dots, 1\}} \frac{x_p(n_p - x_p)}{n_p}$.

Since we care about both calibration and classification, let’s add up these two indices. And remember, the lower these numbers, the better the forecast. $\text{Calib} + \text{Class} = \sum_p \frac{(pn_p - x_p)^2}{n_p} + \frac{x_p(n_p - x_p)}{n_p} = \sum_p p^2 n_p + (1 - 2p)x_p$.

Okay, interesting formula. Does it have a nice interpretation? Let’s try rewriting it a little. $\text{Calib} + \text{Class} = \sum_p p^2 n_p + (1 - 2p)x_p = \sum_p p^2 n_p + (1 - 2p + p^2)x_p - p^2 x_p = \sum_p p^2(n_p - x_p) + (1 - p)^2 x_p.$

Now remember, $x_p$ is the number of predictions that the forecaster assigned probability p to that happened, and $n_p - x_p$ is the number of such forecasts that didn’t happen. So each event that was assigned probability p that came to pass contributes $(1 - p)^2$ to this quantity, and each one that didn’t happen contributes $p^2$.

But if you recall, that is precisely how Brier’s quadratic scoring rule works! If you assign probability p to an event, the quadratic scoring rule penalizes you $(1 - p)^2$ if the event happens and $p^2$ if it doesn’t.

This means that the quadratic scoring rule has a really nice interpretation. It rewards predictors on a combination of two metrics: calibration and classification.

Check out Part 3 of this series, where I discuss my research into which scoring rules are particularly good at incentivizing experts to put effort into their predictions (so as to improve their classification)!