Was Nate Silver’s model wrong?

Nate Silver’s model at FiveThirtyEight gave Biden an 89% chance to win the presidential election. He gave Democrats a 75% chance of taking back the Senate and a 97% chance of keeping the House.

Then the election happened. Biden won — though by a somewhat smaller margin than the model expected: Trump’s 232 electoral votes were a 74th percentile outcome for him. If I had to guess right now, I’d say Republicans are slightly favored to hold the Senate (Democrats would need to win both runoffs races in Georgia to have a 50-50 Senate). And while Democrats kept the House, they did so with just a bare majority: 222 of the 435 seats. The FiveThirtyEight model gave Democrats just an 8% chance of doing this poorly.

The model overestimated Democrats’ performance across the board. Hence the question in the title: was Nate Silver’s wrong?

I should first clarify: what does it even mean for a model to be right or wrong? A model being right isn’t the same thing as the model correctly predicting the eventual outcome. For example, suppose a model is supposed to predict the outcome of a coin flip. It predicts 50% heads, 50% tails. Then the coin lands tails. You wouldn’t call the model wrong just because it failed to predict “tails”. Conversely, models can get the right answer without necessarily being good models.

So here’s my proposed definition: a model is wrong if it can be improved. If it seems like by this definition every real-world model is wrong, well… yeah: as the saying goes, “all models are wrong“. The other complaint you might have about this definition is that it’s a bit circular: doesn’t “improved” just mean “made less wrong”? So, here’s what I mean by “improved”: a model can be improved if it is possible to make money in the long run by making bets against the model. Here are some examples of what I mean:

  1. A model that predicts that a fair coin will come up heads with probability 50% is not wrong because it’s impossible to make money by betting against it. A model that predicts that a fair coin will come up heads with probability 90% is very wrong because there’s a really simple betting strategy that will make you lots of money: betting on tails with 9:1 odds in your favor. The improvement you could make to the model that corresponds to this bet is lowering the probability down from 90%.
  2. A model that predicts how many times Trump will tweet tomorrow by predicting the average number of times he has tweeted daily over the past month is somewhat wrong, because it’s possible to do more research into Trump’s schedule and make bets against the model based on whether Trump is busy tomorrow. You could probably make a lot of money that way, but you have to put in a lot of effort in order to do so. The corresponding improvement to the model would be to incorporate information about Trump’s schedule into the model’s prediction.
  3. A model that predicts the value of the S&P 500 stock market index tomorrow by reporting today’s value is somewhat wrong because you can make money in the long term betting that the price will go up. This is an easy strategy to execute, but your profit accumulation will be slow and inconsistent. A corresponding improvement is to increase the model’s prediction by the historical average daily increase. This new model is still slightly wrong because you could make money betting against it by using more sophisticated knowledge, such as how interest rates affect this rate of increase.

To summarize, here’s what I mean by a model’s wrongness: How wrong a model is is determined by (a) how much money you can make betting against it and (b) how easy this is to do.

Let’s get back to Nate Silver. We’ve already established that all models are wrong, but was his model very wrong, somewhat wrong, or only slightly wrong? If Nate Silver were willing to make bets based on his model’s outputs, would you easily be able to make tons of money off him, or would you only be able to take his a little of his money, and only if you did a lot of research?

This seems hard to answer, but let’s discuss some simple ways that you could try to make money off of the FiveThirtyEight model.

(A) Does the model consistently overestimate either Democrats or Republicans? (Could you make money betting at the model’s odds in favor of either Republicans or Democrats?)

No, or at least not obviously. Historically the FiveThirtyEight model has underestimated Democrats and Republicans about equally. Really, this is a matter of polling bias: the model is based on polls, and polls have not historically been consistently better for either side. Judge for yourself (source):

(B) Are the model’s predictions consistently over- or underconfident?

An example of overconfidence is if, in the long run, events that the model assigns a 90% chance to happen only 80% of the time. The reverse of this is underconfidence. If a model is overconfident, you can make money betting on the underdog at the model’s odds; if it’s overconfident, you can make money betting on the favorite.

Are FiveThirtyEight’s political predictions consistently over- or underconfident? Take a look for yourself here! As an example, here’s a calibration plot for their presidential forecasts (not just on election day) in 2008, 2012, and 2016.

Naïvely this plot suggests that the model is underconfident. For example, among the times that they have assigned a ~90% chance to a presidential candidate winning a state, that candidate won the state 95% of the time. However, I believe that this is totally reasonable, and even correct. That’s because outcomes in an election are very highly correlated. You shouldn’t expect one out of ten 10%-likely events to happen in every election; you should expect that almost no such events happen in most elections, but that many such events happen in elections with particularly large polling biases or late shifts in the race. The 2008, 2012, and 2016 presidential races were not volatile by historical standards and featured small- to medium-sized polling errors, so you should expect a well-calibrated model to have been underconfident in those elections.

(By the way, that’s why the error bars in the plot are so large: when evaluating FiveThirtyEight’s political forecasts for accuracy, it makes more sense to think of the sample size as the number of elections (3 in this case) than as the total number of predictions.)

So, my answer to whether FiveThirtyEight is clearly over- or underconfident is again “no”.

(C) Does the FiveThirtyEight model have momentum or mean reversion?

A trend has momentum if the fact that it has recently gone up (or down) means that on average it will keep going up (or down). A trend has mean reversion if the opposite is true: recent upward (or downward) movement is predictive or future downward (respectively, upward) movement.

A good model should not have momentum or mean reversion, because if it does then you can improve on it by adjusting the forecast upward or downward based on the recent trend (or equivalently, make money betting against the model by forecasting momentum or mean reversion).1 Polls, on the other hand, are snapshots of the current voter mood rather than forecasts, so this is not true of polls. In fact, polls are slightly mean-reverting, but FiveThirtyEight’s forecast accounts for that. It’s really difficult to answer whether the FiveThirtyEight model does a good job of this because the historical sample size is so small, but from the little data we have, I see no reason to think that it does a poor job. I thought their model may have been mean-reverting in 2016:

But I would have lost money betting on mean reversion in 2020:

So my answer here is: the model doesn’t have momentum or mean reversion as far as I can tell, but it would be pretty hard to tell if it did. And to the extent that I understand the inner workings of the model, I don’t see a reason to criticize it on this front.

(D) Should the model rely more heavily on economic fundamentals?

The Economist’s presidential election model put a large weight on economic fundamentals. Before June it relied exclusively on the state of the economy to predict the election, and until the week before the election, economic fundamentals had a higher weight than polls. This is in stark contrast with the FiveThirtyEight model, which treated fundamentals as secondary to polls.

This point of contention between the models led to some heated discussions on Twitter this summer. My tentative opinion is that the Economist model of economic fundamentals is likely overfit, despite their best efforts (but that is a topic for a separate post). It also doesn’t really make sense to me that you would weight fundamentals as 50% of your forecast a week before the election: to whatever extent people are happy or unhappy about the economy, that’s probably already reflected in the polls by then.

On the other hand, the argument that FiveThirtyEight should put less weight on polls and more weight on other factors such as economic fundamentals got a boost from this year’s larger-than-average polling error. There seems to be an emerging consensus that polls systematically missed “low social trust” voters, who disproportionately backed Trump (even controlling for party registration and demographics, as high-quality polls do). If this theory is right, it may be really hard to correct for this bias in the future. And if you expect polls to continue being mediocre, then it makes sense to put more weight on things other than polls in future election cycles.

But I’d want to see more evidence that polls are doomed to be mediocre for the foreseeable future before discounting them any substantial amount. Polls were great in 2018, so it could just be a coincidence that they were off substantially in the same direction in both 2016 and 2020.

So here again, my answer is: not as far as I can tell; it seems to me that FiveThirtyEight was treating polls and fundamentals basically appropriately.

(E) Could you beat the FiveThirtyEight model by betting in the direction of the conventional wisdom?

If anything, the opposite is true: polls might be biased in the direction of the conventional wisdom, so if you want to predict what will happen, you may want to look at what the polls say and shift your expectation a little bit away from the conventional wisdom. For example, in 2016 polls showed a close race (with Clinton slightly favored) but the conventional wisdom was that Clinton would win in a landslide. Instead, Clinton lost: polls were biased toward what most people thought.

I haven’t seen a statistical analysis of this claim about polling bias, though, so I’m not sure it’s robust. This year’s election is some evidence against it, as the conventional wisdom was that the race would be very close despite polls showing Biden far ahead, and polls ended up being biased toward Biden. The hypothesis “polls are biased in the direction of the conventional wisdom” is complex enough that I’d want to see substantial data — rather than just anecdotes — to back it up.

So for now I’ll say “no” to the original question and “not enough evidence to say” to the opposite question (should you bet against the conventional wisdom).

(F) Could you beat the FiveThirtyEight model by betting in the direction suggested by betting markets?

Or to put this another way, could the FiveThirtyEight model be improved by incorporating betting odds into the probabilities?

Betting markets are extremely stupid. As of 12/6, PredictIt thinks that Trump has a 14% (!) chance of winning the electoral college, which is voting in eight days and there’s no sign that any (successful) shenanigans will take place.

They also think that Trump has a 10% chance of winning the popular vote in Nevada, Arizona, Wisconsin, Michigan, Pennsylvania, and Georgia. (I hope to have a blog post soon on why exactly this is; the upshot is that it’s caused by artificial limits on market participation due to regulations.) Other prediction markets, such as FTX and Betfair, which have fewer limits but are illegal to use in the U.S., are less crazy but still pretty crazy. It would be really hard to convince me that betting markets are worth incorporating into your model — even a little bit — if they are subject to such insanity.

But while I haven’t seen anyone defend these numbers, I’ve heard an argument that prices on PredictIt are far more trustworthy for markets that aren’t really active. That’s because random people with no expertise trade in the “will Trump win” market, whereas “will a Democrat win [random House race]” is traded disproportionately by people who have some idea of what’s going on.

I don’t think this hypothesis is crazy, but I don’t have a great way of getting historical data to test it. For what it’s worth, FiveThirtyEight would have done worse in 2016 and 2018 if it had shifted its numbers toward PredictIt’s probabilities, even for low-volume markets (though in 2020 it would have done better). Remember that because election results are so correlated, it makes sense to treat each election as one data point. So basically we just don’t have enough evidence either way on this hypothesis.

But again, the hypothesis “defer a little to betting markets, but only low-volume markets” is complex enough that I’d want to see a lot of data before thinking that it’s probably right. So for now at least, my answer to whether you could substantially beat FiveThirtyEight’s model by factoring in betting market prices is “probably not” (or at least not easily).

These were the simplest strategies I could come up with for trying to beat the FiveThirtyEight model, and my answer to all of them were “this probably wouldn’t work”. Where does that leave us? Well, I would say that the FiveThirtyEight model is in the “slightly wrong” category (along with all of the best models): it seems that you’d need to do something at least moderately sophisticated to beat it, and in any case you probably wouldn’t be able to beat it by a lot.

Maybe you have your own theory for how the FiveThirtyEight model can be improved. For what it’s worth, I’ve spent a while thinking about their model and there’s nothing that jumps out at me — well, except for one thing.

In October, Andrew Gelman (who helped create the Economist model) published a blog post in which he observed that for many pairs of states, the FiveThirtyEight model predicted no correlation, or sometimes a negative correlation, between the polling errors in those states. For example, per the FiveThirtyEight model, how much Biden outperforms (or underperforms) polls in New Jersey tells you nothing about how he will do in Alaska. And conditional on outperforming his polls in Washington State, the model expected Biden to underperform in Mississippi! This leads to really weird effects: for example, in late October, the model gave Trump a 90% chance of winning Mississippi, but just a 31% chance in Mississippi if he somehow won Washington.

(The correlation here is -0.42, if you were wondering.) I think this is unrealistic: when polls miss by a lot, they tend to miss in the same direction. I’ve thought about this some, and I could see an argument for no correlation or even a small negative correlation between some states, but definitely nothing this strong.2

So if I were fine-tuning the FiveThirtyEight model, I would delve into this issue to figure out what’s going on, and probably end up changing the model in a way that got rid of this effect.

On the other hand, I’ve thought about this model a lot, and this is my primary issue with it, that’s a pretty good sign. That’s because this issue is (a) fairly complex and (b) only clear once you start looking at the tails of the model. If I were trying to bet against the model by making bets like “conditional on Trump winning Washington, I bet he will win Mississippi too”, I wouldn’t be able to make very much money (because the condition would hold very rarely). I’ve had much bigger issues with every other election model I’ve looked at closely.

This is all to say: all models are wrong, but this one is among a special few that are only a little wrong. Nate Silver has gotten a lot of criticism over the years — especially after 2016 — but I will say this: his model is very solid.

1. There’s a subtle point here that in some situations you should expect to see momentum most of the time, just not in expectation. For instance, say that Biden is ahead by 10 points a month before the election, so Trump would need a dramatic change in the race to win. If the FiveThirtyEight model gives Biden a 90% chance, you should expect that probability to drift toward 100% most of the time (because most likely, nothing dramatic will happen). It’s just that some of the time, something big will happen and the probabilities will shift toward Trump by a lot. The property that a good model should satisfy is that the expected value of the probability at any future point is equal to the current probability. (This is called the martingale property.)

2. The argument is basically that if you have high confidence about what the national result will be, then a polling miss in some states will need to be compensated by the opposite polling miss in other states, and these states will likely have pretty different demographics (as Washington and Mississippi do). But in practice, national polls are on average about as accurate as state polls. So while there is some reason to think that demographically distinct states with sparse polling could have opposite polling errors, I think the case for that is weak and can at most justify a slight negative correlation.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s