Yesterday I submitted for publication a paper I’ve been working on for a long time. The paper was on scoring rules, which I think are really interesting. In this three-part series, I’ll tell you a bit about scoring rules and hopefully convey why I find them so cool. In this post I’ll define scoring rules and tell you why some scoring rules are good and others are bad. In the next post I’ll explain a really cool fact I learned the other day about a particular scoring rule. And in my third post I’ll tell you about my paper.
Let’s say you want to know the likelihood of some future event. Maybe you’re trying to figure out whether your company should expand to China this year, and you’d like to make that assessment based how likely the coronavirus is to spread a lot more. So you think of paying an epidemiologist to assess the probability that the coronavirus outbreak will get really serious this year (a million coronavirus cases by the end of 2020, say).
Now, a problem with this approach, you realize, is that the epidemiologist isn’t incentivized to tell you their true probability: they can just take your money and give you whatever number they want.
So you come up with a solution: at the end of 2020, you will pay the epidemiologist based on their prediction and what actually ended up happening. In particular, if the epidemiologist tells you that there will be a serious outbreak with probability p, you’ll pay them $1000*p if there ends up being a serious outbreak, and $1000*(1 – p) if there doesn’t. So for instance, if they tell you that there’s a 30% chance of a serious outbreak you’ll pay them $300 if there is one and $700 if there isn’t. This makes intuitive sense: you pay the epidemiologist in proportion to the probability they assigned to the event that ended up happening.
So you do just that: you contact an epidemiologist and offer them this payment scheme. They get back to you with their probability: 0%. “That doesn’t seem quite right,” you think to yourself: “they can’t possibly be that confident that there won’t be a serious outbreak.” So what went wrong?
Well, let’s think about this from the perspective of the epidemiologist. Let’s say the epidemiologist thinks there’s a 30% chance of a serious outbreak. If they tell you this, what’s their expected payout? Well, with probability 30% they’ll get $300 and with probability 70% they’ll get $700 — so, $580 in expectation.
Or, the epidemiologist could lie. They could say that there’s a 0% chance of a serious outbreak. Then — from their perspective — with probability 70% they’ll get $1,000 and with probability 30% they’ll get nothing — so, a $700 expected reward.
Well that explains it then: the epidemiologist lied to you because you incentivized them to do just that. So you just wasted hundreds of dollars and didn’t learn very much.
So what could you have done instead? Maybe you could pay the epidemiologist some other way. Let’s say, in particular, that if they tell you that the probability of a serious outbreak is p, you’ll pay them if there’s a serious outbreak and if there isn’t one. That is, you’ll pay them f(probability they assigned to the event that ended up happening). In the example above, you used the function .
In this context, such functions f are called scoring rules: that’s because they score and reward predictions. A scoring rule f is called proper if it incentivizes honesty: that is, when you think the probability is p, it’s in your best interest to say that the probability is p.
So, what are some proper scoring rules? Let’s use math to find some!
Let’s think — much as we did earlier — about the epidemiologist’s expected reward if they believe the probability of a serious outbreak is p but they tell you that the probability is x. Their expected reward (which we will write as )
because with probability p, there’s an outbreak — in which case they’ll get reward — and with probability 1 – p, there’s no outbreak — in which case they’ll get reward .
In order for f to be proper, better be maximized at ; otherwise the epidemiologist won’t be incentivized to report their probability truthfully. So, let’s take the derivative of with respect to x.
This quantity needs needs to be zero when , for all p. That is, any (differentiable) proper scoring rule f must satisfy
for all p. And now we see that — the function you used to reward the epidemiologist — isn’t so great: but .
So, what functions can we think of that satisfy Equation 1? Think for a moment if you can come up with any, then read on.
First, satisfies Equation 1 (but c needs to be positive for the extremum at to be a maximum). This corresponds to Brier’s quadratic scoring rule, , which meteorologist Glenn Brier came up with in 1950.
The quadratic scoring rule can be thought of as a penalty based on the squared distance between the prediction and what ends up happening (i.e. 0 if the event doesn’t happen and 1 if it does). So if you predict 30%, you get penalized if the event doesn’t happen and if it happens.
The quadratic scoring rule is one way to reward the epidemiologist in a way that makes them answer honestly. So for instance, you could give them , where x is the probability the epidemiologist assigns to the event that ends up happening.
The other really simple solution to Equation 1 is . This corresponds to the logarithmic scoring rule, . So another way you could incentivize the meteorologist is to give them .
The quadratic and logarithmic scoring rules are used very widely. But they aren’t the only proper scoring rules; far from it. In fact, you can take any continuous positive function on the interval and extend it to via Equation 1:
This is guaranteed to result in a proper scoring rule.
So, let’s say you want to incentivize the epidemiologist with a proper scoring rule. Which one should you use? Well, it depends on your goals! If you want the epidemiologist to be really good at discriminating between small probabilities, e.g. you really care about the difference between 1% and 2%, then you should probably go with the log scoring rule, since it strongly penalizes assigning a really low probability to the eventual outcome. If you don’t particularly care about the difference between 1% and 2%, maybe you’d prefer to go with the quadratic rule, so the epidemiologist can at least be guaranteed a positive payout.
I find myself partial to the log scoring rule, because I think discriminating between 1% and 2% probabilities is a valuable skill. But that doesn’t means the quadratic scoring rule doesn’t have its merits. Check out Part 2 of this series, where I discuss a really interesting way to think about the quadratic scoring rule. It turns out that this rule neatly captures two different things you might care about when eliciting a prediction: calibration and classification.
Finally, in Part 3 of the series, I talk about my recent research, which dives deeper into the question of “Which scoring rule should you choose?”. In particular, let’s say you really care about the epidemiologist doing as much research as possible before getting back to you. What proper scoring rule most incentivizes the epidemiologist to come up with a precise estimate before getting back to you?