[This is Part 3 of a three-part series on scoring rules. If you aren’t familiar with scoring rules, you should read Part 1 before reading this post. You don’t need to read Part 2, but I think it’s pretty cool.]
In 9th grade I learned the difference between accuracy and precision from a classroom poster. The poster looked something like this:

Accuracy means that you’re unbiased: maybe you’ll never hit the bull’s eye exactly, but you aren’t consistently off in the same direction. Precision means hitting near the same spot (not necessarily the bull’s eye) every time.
Generally speaking, precision without accuracy is pointless. Accuracy without precision… well, it depends. If you’re hunting rabbits, it doesn’t get you very far. If you’re conducting a survey, on the other hand, an accurate (unbiased) estimate is useful even if it’s not precise. Nevertheless, it’s better to be accurate and precise than just accurate.
.
Let’s say you’re forecasting the probability that it will rain one week from today. I don’t know how weather forecasts work, but let’s pretend they work like this: say the true probability that it will rain a week from now is (you don’t know what
is). You can run a simulation of the coming week’s weather. Each time you run the simulation you change the initial conditions, because you don’t know the precise state of the weather right now. Each simulation you run shows rain with probability
and no rain with probability
; the simulations are independent. Your estimate for the probability that it will rain a week from today, then, is just the fraction of your simulations in which it ends up raining.1 You report a probability, and in a week you’re rewarded based on a scoring rule. For example, if the quadratic scoring rule
is used and you report a 70% chance of rain, you get a reward of 0.91 if it rains and 0.51 if it doesn’t.
If you’re rewarded with a proper scoring rule, you will report honestly; and since on average fraction of your simulations will say “rain”, the estimate you report will be an unbiased estimate of the true probability of rain (i.e.
). That is, you might say something lower than
or you might say something higher than
, but on average you’ll say
. In this sense, proper scoring rules incentivize accuracy. You may not hit the bull’s eye — you may not say exactly
— but you won’t have a consistent bias in your predictions.
.
In a new research paper — joint work with George Noarov and Matt Weinberg — we asked the following question: how good are scoring rules at incentivizing precision? To be more concrete, let’s consider the simulation above, but now suppose that there’s a cost to running each simulation. Maybe your simulations are expensive because they use a lot of computing power.2 Let’s say you’ve run several simulations so far. There’s a trade-off to running another simulation: on the one hand, your estimate becomes more precise, so your expected reward from the scoring rule gets larger; on the other hand, you have to pay the cost of running an extra simulation. At first you’ll be willing to pay the cost: after all, you learn a lot from the first simulations you run, when you have no idea what the probability of rain is. But after a while, when you already have a pretty good idea of the probability of rain, your expected increase in reward from another simulation falls below the cost of simulation, and you stop simulating.3
So then the question is: which proper scoring rules incentivize a weather forecaster to run the most simulations before returning with an answer?
Let’s formalize this a bit. We’ll say that an expert (i.e. predictor) has a coin which lands heads with probability , where
is drawn uniformly from
. The expert can flip the coin (this is analogous to running a simulation) as many times as they want, and then outputs their estimate for
.4 Each flip costs the expert a small cost
. Afterward, the coin is flipped once more, and the expert is rewarded based on the probability they reported and the outcome of the flip.
.
So we ask: which scoring rules incentivize the expert to flip the coin the most? However, this turns out not to be an interesting question, because it depends on the scale of the scoring rule. Take the aforementioned quadratic scoring rule and multiply it by a million, and suddenly the expert wants to flip the coin way more. This motivates us to normalize scoring rules, and we found that the most natural way of doing so was as follows:
(1) If the expert is perfect, the expected reward received by the expert is 1.5
(2) No matter what is, if the expert reports
then their expected reward is non-negative, and zero for the worst-case value of
.6
Thus our new question is: what normalized proper scoring rule incentivizes the expert to flip the most? Or more precisely — so as to capture precision, our true goal — In the limit as approaches zero, what normalized proper scoring rule minimizes the expected absolute difference between the expert’s prediction and the true probability
?
.
If you look at our paper, you’ll find that a lot of it — Appendices A and B — is a detailed analytic argument that proves our result in as much generality as possible (i.e. making as few assumptions as possible about the scoring rule ). That said, if you just wanted to skip the rigor and get to the right answer, this is a problem you might be able to get a handle on! Perhaps you’d like to try thinking about it yourself — or if not, keep reading below!
.
.
.
.
.
.
.
The key question we must ask ourselves is: let’s say you’ve flipped the coin times,
of which have been heads. What is your expected reward from flipping the coin an additional time? We need to figure out when this number falls below
.
Let be the expected reward you get from the scoring rule when you report
(and believe that the probability is
). We can express
in terms of
: in particular,
. After flipping
heads out of
coins, your estimate of the probability is
(it’s not exactly
because the expert takes into account the uniform prior over
, but this detail is not crucial). So if you don’t flip the coin anymore, you your expected reward is
. On the other hand, if you flip a coin, it comes up heads with probability
(in which case your new probability is
), and otherwise comes up tails (in which case your new probability is
). Thus, your expected increase in reward from flipping an extra time (call it
) is
.
A natural thing to do now is to approximate in terms of the Taylor expansion of
around
, and similarly for
. If you do so, you’ll find that
.
Assuming is large enough that
, we have that
.
That is, the expected increase in reward is proportional to the second derivative of the expected reward function. Since the expert keeps flipping until the expected increase in reward falls below
, the expected number of flips is roughly the value of
that makes
, i.e.
.
Now, how far off should we expect an expert to be after this many flips of the coin? Well, the distribution of the expert’s guess after flips is essentially a normal distribution with mean
and standard deviation
. The expected distance from
to a point drawn from this distribution is
, which means that the expected error of the expert is
.
Finally, since is drawn uniformly from
, the overall expected error of the expert is
.
This means that for fixed, small values of , the expected error of the expert is proportional to
.
We called this quantity the incentivization index. The smaller this quantity, the better the scoring rule is at incentivizing the expert to give a precise answer.
.
Here are two very natural questions:
(1) How do the incentivization indices of the most frequently used scoring rules (log, quadratic, spherical) compare?
(2) Which proper scoring rule minimizes the incentivization index, i.e. is best at incentivizing precision by our metric? (This was the question we set out to answer.)
The answer to question 1 is: the log scoring rule has incentivization index 0.260. The quadratic rule has index 0.279. The spherical rule has index 0.296. So by our metric, the log scoring rule is the best of the three commonly used rules at incentivizing precision.
Before I answer question 2, let me point out that you shouldn’t necessarily expect there to be an answer to this question! Maybe there is no optimal scoring rule — perhaps you can get the incentivization index arbitrarily close to some constant (maybe 0) without ever reaching it.
However, it turns out that question 2 does have an answer! The answer isn’t particularly nice, but sometimes that’s just the way things work out. The precision-optimal proper scoring rule, with an incentivization index of 0.253, is
where is a normalizing constant. Here’s a graph of the (normalized) logarithmic, quadratic, spherical, and optimal scoring rules:

.
A natural follow-up question is: what if you’re instead interested in minimizing the expected squared error of the expert, rather than the expected error? Or for that matter, the expected -th power error? It isn’t hard to generalize the math we did above. The formula for the expected
-th power error is
where is the
-th moment of a standard Gaussian. We can therefore define a generalized version of the incentivization index:
.
The scoring rule that minimizes the expected -th power of the expert’s error turns out to be
(where is a normalizing constant). An interesting particular case is the limit as
approaches infinity. If you want to minimize the expected value of the expert’s error “raised to the infinity-eth power”, that’s equivalent to something like “avoiding large errors at all costs”. That is, this is the scoring rule you’d want to use if you’d rather have the expert always be 1% off than have the expert be exactly right 99% of the time and 1.01% off the remaining 1% of the time. If you take this limit, you’ll find that the optimal rule in this regime is a polynomial:
.

You might be wondering where this polynomial came from. This is the scoring rule such that
(times a constant). Recall that the expert’s output is normally distributed around the true probability
with standard deviation proportional to
. So
is precisely the scoring rule that makes the expert’s error distributed the same way for all values of
. It makes sense that this is the rule that avoids large errors at all costs, because balancing errors across all values of
minimizes the “worst-case error”.
Another interesting thing to do is to compare how well scoring rules do for different values of . The table below quantifies how good the incentivization index of a given scoring rule is relative to the optimal rule for different values of
, and the results are fascinating.7

In the first column () we see what we saw earlier: the log scoring rule is quite good, the quadratic rule is okay, and the spherical rule is not good. But as
increases, the picture changes! Around
, the logarithmic scoring rule is super impressive — near perfect. Then it starts to get worse, and around
the quadratic scoring rule is even more impressive — ridiculously close to perfect. Then the quadratic scoring rule starts doing worse, and around
the spherical scoring rule shines. Again, I have no intuition for this, but I think it’s a really cool result.
Here’s this same table, but in chart form, for between 1 and 200:

And here’s a zoomed-in chart for between 1 and 10:

Pretty cool, isn’t it? If you’re interested in learning more about the details of our research, check our our paper or leave a comment!
.
1. It’s technically not quite that: if you only run one simulation and it says “no rain” then presumably your estimate for the probability of rain isn’t 0. But if the number of simulations is sufficiently large, this is a good approximation.↩
2. I happen to know that weather models take a long time to run. If you go to Tropical Tidbits at the right time, you’ll catch a weather model in the middle of an update. These models update quite slowly — it typically takes a few minutes to simulate a day of weather — and these models are presumably run on clusters of supercomputers.↩
3. You might be wondering whether sometimes it might make sense to take a more long-term strategy: even if simulating once more will in expectation cause the expert’s reward to go up less than the cost of simulation in the short term, maybe it will give the expert an opportunity to realize a larger gain in the future. This is indeed sometimes the case! However, it turns out that all of our results carry through even if the expert pursues the optimal strategy rather than the greedy one.↩
4. If the expert flipped heads times out of
flips, the expert’s guess for the true probability is
— see this article for an explanation.↩
5. The expected reward of a perfect expert is , so this is equivalent to saying that
.↩
6. This turns out to be equivalent to saying that .↩
7. The numbers in the table for a given and scoring rule
are given by the formula
.
The better (lower) the incentivization index of , the closer this number is to 1.↩
Why aren’t the odds of updating after one more flip based on p, not on the current estimate of p?
LikeLike
Great question — that’s exactly right; the odds of updating are based on the current estimate of p. Eliding many details, this approximation is okay to make in the limit as c goes to zero, because with high probability the expert’s estimate of p will be very close to p. But that’s an incomplete answer and it shouldn’t be obvious that this approximation is in fact okay to make. If you’re interested in digging into the details, check out the paper, particularly Section 3.4.
LikeLike
This was a very interesting read. I understood all of the math transformation/derivations. But, I wish you had included some example scenario what x institutes in a different scenario. Say I estimate building cost to be 1200 dollars, and it comes out to be 1000, what would x be. My understanding is that X is a random variable of (Real-Predicted)?
LikeLike
Never mind, I had to read this again – X is the probability of occurrence of an event. So in the case, I wanted to score construction estimators from the city contractors – I would have to constitute a binary outcome. So for example – I can say X is the probability of an event when the estimated cost is within a 10% range – X than will be provided by the contractor.
LikeLike
Yup! Not sure if you saw Part 1, but if you’re still unclear on what scoring rules are, you might want to check it out: https://ericneyman.wordpress.com/2020/02/14/scoring-rules-part-1-eliciting-truthful-predictions/
LikeLike