top of page
  • Writer's pictureGianluca Turcatel

OLS Formula Derivation

Updated: Dec 28, 2021

OLS is most famous algorithm that estimates the parameters of a linear regression model. OLS minimizes the following loss function:

In plain words, we seek to minimize the squared differences between the predictions (yi_hat) of the model and observed data (yi). When I saw formula (1) the first time, I thought that it didn't have a formal mathematical derivation and was obtained through mere statistical intuition (someone may call it common sense). Later I discovered, that (1) is derived by applying Maximum Likelihood Estimation (MLE). In this articles we will walk through its derivation, step by step.


As for every algorithm in Data Science, there are assumptions to be made:

  1. Data is normally distributed

  2. The variance of the error is constant across all values of the independent variables. This property of the data is named homoscedasticity. What we are saying is: the noise in the data comes from the same distribution.

  3. Linear relationship between the independent and dependent variables

  4. No auto-correlation in the independent and dependent variables. Autocorrelation is a situation where a signal is correlated with a delayed copy of itself.

  5. No (or limited) Multicollinearity. Multicollinearity is defined as the occurrence of high intercorrelations among two or more independent variables.

Because of the above assumption:

Where Xi is a (1 x k) vector, W a (k x 1) vector. We can combine the two equations into:

Each Yi is drawn from a normal distribution with mean XiW and variance sigma2. The probability of each data point yi given Xi and W is given by:

The goal is to find W such the probability of the data is maximized:

We can estimate W with MLE and because each data point is independent from each other:

Using the chain rule of probability we can expand the main term to:

Xi is independent from W:

Probability of each Xi is constant, we can remove P(Xi):

Working with multiplication can pose the serious issue that P can quickly become very small, potentially being approximated by the computer to zero. Thus, we log transform (9), which converts multiplications into summations. This transformation is valid because the logarithmic is a monotonic function: log-likelihood and likelihood will have maxima at the same value of W.

Let's plug in the normal probability distribution:

Log(a*b) = Log(a) + Log(b). Therefore:

The first time inside the square brackets is constant, so can be removed. For the second term, we recognize that Log(e^x) = x. Thus:

The term outside the summation is constant, so can be removed. In general, when performing optimization tasks we prefer minimizing the objective function: we drop the minus sign and replace argmax with argmin. Finally, to render the formula more interpretable we divide by n (average mean squared error). The final formula is:

which is equivalent to (1): mission accomplished! If you have ever wondered where (1) came from, now you know.

Follow me on Twitter and Facebook to stay updated.

208 views0 comments

Recent Posts

See All


bottom of page