Least squares regression isn’t arbitrary

In 11th grade I took my high school’s statistics class. When we learned about linear regression I raised my hand and asked why we were minimizing the sum of the squares of the residuals rather than, say, the sum of the absolute values. If my memory serves right, my teacher said that minimizing the sum of the absolute values would also be reasonable, but that absolute values are annoying to deal with so we square the residuals instead.

This has vaguely bothered me ever since: it seemed like linear regression — a tool applied extensively throughout the sciences — was based on an arbitrary choice. But in a conversation with my friend Mike a few months ago I learned that the choice is far from arbitrary.1

Recall the normal distribution. It is probably the most well-known probability distribution, and for good reason: it appears everywhere in the real world, from the heights of adult human males to annual rainfall totals. If you don’t know anything about how a quantity is distributed, the normal distribution is a reasonable guess.

The choice of minimizing the sum of the squares of residuals follows from the assumption that residuals off of a line of best fit are normally distributed. Specifically, they are assumed to be independent and normally distributed with the same standard deviation \sigma regardless of the value of the predictor.2

To see why this assumption makes us want to minimize the sum of the squares of residuals, suppose we have some dataset of predictors (x-values) and responders (y-values) and we want to find the “line of best fit.” The line of best fit is the line that best models the data, in the sense that the residuals of our data off of the line surprise us as little as possible.

We ask: under our model of how residuals are distributed, what is the probability of seeing a particular residual r? This is not a well-defined question: the probability that the residual is exactly r is zero. But we can say that for really small values of \epsilon, the probability that the residual lies in an interval of length \epsilon around r is roughly equal to \epsilon times the value of the normal distribution’s PDF at r — that is,

\frac{\epsilon}{\sigma \sqrt{2\pi}} e^{-\frac{r^2}{2\sigma^2}}.

(From now on we’ll ignore the \epsilon and think of the remaining expression as the “instantaneous likeliness” of seeing the residual value r. If this bothers you, feel free to do the calculation with epsilons included.)

This means that minimizing how surprised we are about our residuals amounts to choosing coefficients for our line that maximize the product over all data points of the expression above.3 That is, if we call the residuals r_1, \dots, r_n, we want to maximize

\left( \frac{\epsilon}{\sigma \sqrt{2\pi}} \right)^n e^{-\frac{\sum_{i = 1}^n r_i^2}{2\sigma^2}}.

This is equivalent to minimizing \sum_{i = 1}^n r_i^2, i.e. the sum of the squares of the residuals. (Interestingly, this is the quantity that you want to minimize regardless of the particular value of \sigma.) There you have it — the theoretical justification for least-squares regression.


Often, though, you might have a reason to believe that your residuals are not normally distributed, or that the standard deviation of the residuals does depend on the value of the predictor. If you followed the math I did above, you should be able to figure out what quantity you want to minimize instead of the sum of the squares of the residuals, whatever your model for how residuals are distributed! I’ve included a few examples, with answers in the footnotes.

Example 1: Suppose you model your residuals as being distributed as \frac{1}{2} e^{-|r|} (the \frac{1}{2} is there so the distribution sums to 1). What function of the residuals do you want to minimize?4

Example 2: Suppose instead your residuals are distributed as \frac{1}{\pi(r^2 + 1)} (again, the \pi is just a normalizing factor). What do you want to minimize?5

Example 3: Suppose your residuals are normally distributed, but the standard deviation of the residual depends on the value of x, the predictor. Specifically, assume that the standard deviation is k \sqrt{x^2 + 1} for some k > 0. (I think this is pretty natural in some contexts, one of which I’ll talk about in the next post.) What do you want to minimize?6


Want to see these techniques applied in practice? Check out my post on the predictive power of general election polls, where I use a residual model like the one in Example 3!


1. I think it’s not unlikely that my teacher knew a better answer but decided not to derail the class to have this discussion.

2. You may recall these as the assumptions you need to make when doing a test for the significance of the slope of a line of best fit, if you’ve taken AP statistics.

3. This is where we use the assumption that the residuals are independent; otherwise we couldn’t represent the probability of seeing all n residuals as the product of the probabilities of seeing each residual.

4. You want to maximize e^{-\sum_{i = 1}^n |r_i|}, which amounts to minimizing \sum_i |r_i| — precisely what I suggested as an alternative to minimizing \sum_i r_i^2 in the statistics class!

5. You want to maximize \prod_i \frac{1}{r_i^2 + 1} or equivalently, minimize \sum_i \log(r_i^2 + 1).

6. You want to maximize \prod_i e^{-\frac{r_i^2}{2k^2(x_i^2 + 1)}}, which amounts to minimizing \sum_i \frac{r_i^2}{x_i^2 + 1}. This is the same as doing a weighted least-squares regression, where the weight of the point (x_i, y_i) is \frac{1}{x_i^2 + 1}.


7 thoughts on “Least squares regression isn’t arbitrary

  1. Another interesting (but unrelated, I think) on the expression \sum (x - a_i)^2 for a data set \{a_1,\dots, a_n\} is the following.

    Suppose that we are given our data points \{a_i\} and we want to pick the value of x that minimizes \sum (x - a_i)^2. A dash of calculus tells us that the correct value of x to pick is the average of the data.

    What would happen if we instead tried to find x to minimize \sum |x - a_i|? (Take a moment to think about it – it’s a good problem!)

    The minimizing value of x is now the *median* of the data! (Or if the median is midway between two data points, any value in between them.)

    I’m not entirely certain what the correct interpretation of this is, but it at least convinces me that if you think the average is a reasonable statistic to look at, then the variance, i.e. \sum (\mu - a_i)^2 where \mu is the average, is reasonable too.

    Liked by 1 person

    1. That’s a cool connection; thanks for the comment! More broadly, I guess there’s a mapping from minimization criteria to summary statistics. There might be more interesting stuff to say about this.

      (I edited your LaTeX formulas to make them render; I hope you don’t mind. For future reference, you need a space between they keyword “latex” and the next thing you type, even if it’s a backslash.)


      1. Thanks for editing the formulas – that’s useful to know.

        For completely unrelated reasons, I just stumbled into the fact that if you instead try to minimize \sum d(x,a_i) where d(x, a_i) is 0 when x = a_i and 1 when x \neq a_i then the minimizing value of x is the mode.


      2. Haha, yep precisely. I met him at the SSC meetup in Boston and stumbled onto his blog afterwards. This chain of coincidences is a little uncanny…


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s