Scoring rules part 3: Incentivizing precision

[This is Part 3 of a three-part series on scoring rules. If you aren’t familiar with scoring rules, you should read Part 1 before reading this post. You don’t need to read Part 2, but I think it’s pretty cool.]

In 9th grade I learned the difference between accuracy and precision from a classroom poster. The poster looked something like this:

targets

Accuracy means that you’re unbiased: maybe you’ll never hit the bull’s eye exactly, but you aren’t consistently off in the same direction. Precision means hitting near the same spot (not necessarily the bull’s eye) every time.

Generally speaking, precision without accuracy is pointless. Accuracy without precision… well, it depends. If you’re hunting rabbits, it doesn’t get you very far. If you’re conducting a survey, on the other hand, an accurate (unbiased) estimate is useful even if it’s not precise. Nevertheless, it’s better to be accurate and precise than just accurate.

.

Let’s say you’re forecasting the probability that it will rain one week from today. I don’t know how weather forecasts work, but let’s pretend they work like this: say the true probability that it will rain a week from now is p (you don’t know what p is). You can run a simulation of the coming week’s weather. Each time you run the simulation you change the initial conditions, because you don’t know the precise state of the weather right now. Each simulation you run shows rain with probability p and no rain with probability 1 - p; the simulations are independent. Your estimate for the probability that it will rain a week from today, then, is just the fraction of your simulations in which it ends up raining.1 You report a probability, and in a week you’re rewarded based on a scoring rule. For example, if the quadratic scoring rule f(x) = 1 - (1 - x)^2 is used and you report a 70% chance of rain, you get a reward of 0.91 if it rains and 0.51 if it doesn’t.

If you’re rewarded with a proper scoring rule, you will report honestly; and since on average p fraction of your simulations will say “rain”, the estimate you report will be an unbiased estimate of the true probability of rain (i.e. p). That is, you might say something lower than p or you might say something higher than p, but on average you’ll say p. In this sense, proper scoring rules incentivize accuracy. You may not hit the bull’s eye — you may not say exactly p — but you won’t have a consistent bias in your predictions.

.

In a new research paper — joint work with George Noarov and Matt Weinberg — we asked the following question: how good are scoring rules at incentivizing precision? To be more concrete, let’s consider the simulation above, but now suppose that there’s a cost to running each simulation. Maybe your simulations are expensive because they use a lot of computing power.2 Let’s say you’ve run several simulations so far. There’s a trade-off to running another simulation: on the one hand, your estimate becomes more precise, so your expected reward from the scoring rule gets larger; on the other hand, you have to pay the cost of running an extra simulation. At first you’ll be willing to pay the cost: after all, you learn a lot from the first simulations you run, when you have no idea what the probability of rain is. But after a while, when you already have a pretty good idea of the probability of rain, your expected increase in reward from another simulation falls below the cost of simulation, and you stop simulating.3

So then the question is: which proper scoring rules incentivize a weather forecaster to run the most simulations before returning with an answer?

Let’s formalize this a bit. We’ll say that an expert (i.e. predictor) has a coin which lands heads with probability p, where p is drawn uniformly from [0, 1]. The expert can flip the coin (this is analogous to running a simulation) as many times as they want, and then outputs their estimate for p.4 Each flip costs the expert a small cost c. Afterward, the coin is flipped once more, and the expert is rewarded based on the probability they reported and the outcome of the flip.

.

So we ask: which scoring rules incentivize the expert to flip the coin the most? However, this turns out not to be an interesting question, because it depends on the scale of the scoring rule. Take the aforementioned quadratic scoring rule and multiply it by a million, and suddenly the expert wants to flip the coin way more. This motivates us to normalize scoring rules, and we found that the most natural way of doing so was as follows:

(1) If the expert is perfect, the expected reward received by the expert is 1.5

(2) No matter what p is, if the expert reports p then their expected reward is non-negative, and zero for the worst-case value of p.6

Thus our new question is: what normalized proper scoring rule incentivizes the expert to flip the most? Or more precisely — so as to capture precision, our true goal — In the limit as c approaches zero, what normalized proper scoring rule minimizes the expected absolute difference between the expert’s prediction and the true probability p?

.

If you look at our paper, you’ll find that a lot of it — Appendices A and B — is a detailed analytic argument that proves our result in as much generality as possible (i.e. making as few assumptions as possible about the scoring rule f). That said, if you just wanted to skip the rigor and get to the right answer, this is a problem you might be able to get a handle on! Perhaps you’d like to try thinking about it yourself — or if not, keep reading below!

.

.

.

.

.

.

.

The key question we must ask ourselves is: let’s say you’ve flipped the coin n times, h of which have been heads. What is your expected reward from flipping the coin an additional time? We need to figure out when this number falls below c.

Let R(x) be the expected reward you get from the scoring rule when you report x (and believe that the probability is x). We can express R in terms of f: in particular, R(x) = x f(x) + (1 - x) f(1 - x). After flipping h heads out of n coins, your estimate of the probability is \frac{h + 1}{n + 2} (it’s not exactly \frac{h}{n} because the expert takes into account the uniform prior over p, but this detail is not crucial). So if you don’t flip the coin anymore, you your expected reward is R \left( \frac{h + 1}{n + 2} \right). On the other hand, if you flip a coin, it comes up heads with probability \frac{h + 1}{n + 2} (in which case your new probability is \frac{h + 2}{n + 3}), and otherwise comes up tails (in which case your new probability is \frac{h + 1}{n + 3}). Thus, your expected increase in reward from flipping an extra time (call it \Delta) is

\Delta = \frac{h + 1}{n + 2} R \left( \frac{h + 2}{n + 3} \right) + \frac{n - h + 1}{n + 2} R \left( \frac{h + 1}{n + 3} \right) - R \left( \frac{h + 1}{n + 2} \right).

A natural thing to do now is to approximate R \left( \frac{h + 2}{n + 3} \right) in terms of the Taylor expansion of R around \frac{h + 1}{n + 2}, and similarly for R \left( \frac{h + 1}{n + 3} \right). If you do so, you’ll find that

\Delta \approx \frac{h(n - h)}{2n^4} R'' \left( \frac{h + 1}{n + 2} \right).

Assuming n is large enough that \frac{h}{n} \approx p, we have that

\Delta \approx \frac{p(1 - p) R''(p)}{2n^2}.

That is, the expected increase in reward is proportional to the second derivative of the expected reward function. Since the expert keeps flipping until the expected increase in reward \Delta falls below c, the expected number of flips is roughly the value of n that makes \frac{p(1 - p) R''(p)}{2n^2} = c, i.e.

n \approx \sqrt{\frac{p(1 - p)R''(p)}{2c}}.

Now, how far off should we expect an expert to be after this many flips of the coin? Well, the distribution of the expert’s guess after n flips is essentially a normal distribution with mean p and standard deviation \sqrt{p(1 - p)}{n}. The expected distance from p to a point drawn from this distribution is \sqrt{2p(1 - p)}{\pi n}, which means that the expected error of the expert is

\mathbb{E}[\text{Err}_c(p)] \approx \sqrt[4]{\frac{8c p(1 - p)}{\pi^2 R''(p)}}.

Finally, since p is drawn uniformly from [0, 1], the overall expected error of the expert is

\mathbb{E}[\text{Err}_c] \approx \int_0^1 \sqrt[4]{\frac{8c x(1 - x)}{\pi^2 R''(x)}} dx.

This means that for fixed, small values of c, the expected error of the expert is proportional to

\text{Ind(f)} := \int_0^1 \sqrt[4]{\frac{x(1 - x)}{R''(x)}} dx.

We called this quantity the incentivization index. The smaller this quantity, the better the scoring rule f is at incentivizing the expert to give a precise answer.

.

Here are two very natural questions:

(1) How do the incentivization indices of the most frequently used scoring rules (log, quadratic, spherical) compare?

(2) Which proper scoring rule minimizes the incentivization index, i.e. is best at incentivizing precision by our metric? (This was the question we set out to answer.)

The answer to question 1 is: the log scoring rule has incentivization index 0.260. The quadratic rule has index 0.279. The spherical rule has index 0.296. So by our metric, the log scoring rule is the best of the three commonly used rules at incentivizing precision.

Before I answer question 2, let me point out that you shouldn’t necessarily expect there to be an answer to this question! Maybe there is no optimal scoring rule — perhaps you can get the incentivization index arbitrarily close to some constant (maybe 0) without ever reaching it.

However, it turns out that question 2 does have an answer! The answer isn’t particularly nice, but sometimes that’s just the way things work out. The precision-optimal proper scoring rule, with an incentivization index of 0.253, is

f_\text{OPT}(x) = \begin{cases}\kappa \int_{\frac{1}{2}}^x (t^{-7} (1 - t)^{6})^{1/5} dt & x \le \frac{1}{2} \\\kappa \int_{\frac{1}{2}}^x (t(1 - t)^{-2})^{1/5} dt & x \ge \frac{1}{2} \end{cases}

where \kappa is a normalizing constant. Here’s a graph of the (normalized) logarithmic, quadratic, spherical, and optimal scoring rules:

Log_Quad_Spher_Opt

.

A natural follow-up question is: what if you’re instead interested in minimizing the expected squared error of the expert, rather than the expected error? Or for that matter, the expected \ell-th power error? It isn’t hard to generalize the math we did above. The formula for the expected \ell-th power error is

\mathbb{E}[\text{Err}_c^\ell] \approx \mu_\ell \int_0^1 \left( \frac{2c x(1 - x)}{R''(x)} \right)^{\ell/4} dx

where \mu_\ell := \frac{2^{\ell/2} \Gamma \left( \frac{\ell + 1}{2} \right)}{\sqrt{\pi}} is the \ell-th moment of a standard Gaussian. We can therefore define a generalized version of the incentivization index:

\text{Ind}^\ell(f) := \int_0^1 \left( {\frac{x(1 - x)}{R''(x)}} \right)^{\ell/4} dx.

The scoring rule that minimizes the expected \ell-th power of the expert’s error turns out to be

f^\ell_\text{OPT}(x) = \begin{cases}\kappa_\ell \int_{\frac{1}{2}}^x (t^{\ell - 8} (1 - t)^{2\ell + 4})^{1/(\ell + 4)} dt & x \le \frac{1}{2} \\\kappa_\ell \int_{\frac{1}{2}}^x (t^\ell (1 - t)^{2\ell - 4})^{1/(\ell + 4)} dt & x \ge \frac{1}{2} \end{cases}

(where \kappa_\ell is a normalizing constant). An interesting particular case is the limit as \ell approaches infinity. If you want to minimize the expected value of the expert’s error “raised to the infinity-eth power”, that’s equivalent to something like “avoiding large errors at all costs”. That is, this is the scoring rule you’d want to use if you’d rather have the expert always be 1% off than have the expert be exactly right 99% of the time and 1.01% off the remaining 1% of the time. If you take this limit, you’ll find that the optimal rule in this regime is a polynomial:

f^\infty_\text{OPT}(x) = \frac{5}{9}(48x^4 - 128x^3 + 96x^2 - 11).

Intinifty_Opt

You might be wondering where this polynomial came from. This f is the scoring rule such that R''(x) = x(1 - x) (times a constant). Recall that the expert’s output is normally distributed around the true probability p with standard deviation proportional to \frac{p(1 - p)}{R''(p)}. So f^\infty_\text{OPT} is precisely the scoring rule that makes the expert’s error distributed the same way for all values of p. It makes sense that this is the rule that avoids large errors at all costs, because balancing errors across all values of p minimizes the “worst-case error”.

Another interesting thing to do is to compare how well scoring rules do for different values of \ell. The table below quantifies how good the incentivization index of a given scoring rule is relative to the optimal rule for different values of \ell, and the results are fascinating.7

Excel_Chart

In the first column (\ell = 1) we see what we saw earlier: the log scoring rule is quite good, the quadratic rule is okay, and the spherical rule is not good. But as \ell increases, the picture changes! Around \ell = 4, the logarithmic scoring rule is super impressive — near perfect. Then it starts to get worse, and around \ell = 16 the quadratic scoring rule is even more impressive — ridiculously close to perfect. Then the quadratic scoring rule starts doing worse, and around \ell = 100 the spherical scoring rule shines. Again, I have no intuition for this, but I think it’s a really cool result.

Here’s this same table, but in chart form, for \ell between 1 and 200:

Comparison_200

And here’s a zoomed-in chart for \ell between 1 and 10:

Comparison_10

Pretty cool, isn’t it? If you’re interested in learning more about the details of our research, check our our paper or leave a comment!

.

1. It’s technically not quite that: if you only run one simulation and it says “no rain” then presumably your estimate for the probability of rain isn’t 0. But if the number of simulations is sufficiently large, this is a good approximation.

2. I happen to know that weather models take a long time to run. If you go to Tropical Tidbits at the right time, you’ll catch a weather model in the middle of an update. These models update quite slowly — it typically takes a few minutes to simulate a day of weather — and these models are presumably run on clusters of supercomputers.

3. You might be wondering whether sometimes it might make sense to take a more long-term strategy: even if simulating once more will in expectation cause the expert’s reward to go up less than the cost of simulation in the short term, maybe it will give the expert an opportunity to realize a larger gain in the future. This is indeed sometimes the case! However, it turns out that all of our results carry through even if the expert pursues the optimal strategy rather than the greedy one.

4. If the expert flipped heads h times out of n flips, the expert’s guess for the true probability is \frac{h + 1}{n + 2} — see this article for an explanation.

5. The expected reward of a perfect expert is p f(p) + (1 - p) f(1 - p), so this is equivalent to saying that \int_0^1 (x f(x) + (1 - x) f(1 - x)) dx = 1.

6. This turns out to be equivalent to saying that f \left( \frac{1}{2} \right) = 0.

7. The numbers in the table for a given \ell and scoring rule f are given by the formula

\left( \frac{\text{Ind}^\ell(f^\ell_{\text{OPT}})}{\text{Ind}^\ell(f)} \right)^{1/\ell}.

The better (lower) the incentivization index of f, the closer this number is to 1.

7 thoughts on “Scoring rules part 3: Incentivizing precision

    1. Great question — that’s exactly right; the odds of updating are based on the current estimate of p. Eliding many details, this approximation is okay to make in the limit as c goes to zero, because with high probability the expert’s estimate of p will be very close to p. But that’s an incomplete answer and it shouldn’t be obvious that this approximation is in fact okay to make. If you’re interested in digging into the details, check out the paper, particularly Section 3.4.

      Like

  1. This was a very interesting read. I understood all of the math transformation/derivations. But, I wish you had included some example scenario what x institutes in a different scenario. Say I estimate building cost to be 1200 dollars, and it comes out to be 1000, what would x be. My understanding is that X is a random variable of (Real-Predicted)?

    Like

    1. Never mind, I had to read this again – X is the probability of occurrence of an event. So in the case, I wanted to score construction estimators from the city contractors – I would have to constitute a binary outcome. So for example – I can say X is the probability of an event when the estimated cost is within a 10% range – X than will be provided by the contractor.

      Like

Leave a comment