Raghav Sikaria

Hello reader, welcome to my Know The Math series, a less formal series of blogs, where I take you through the notoriously complex backend mathematics which forms the soul for machine learning algorithms. The idea for the series originated from my Project - NIFTYBANK Index Time series analysis, Prediction, Deep Learning & Bayesian Hyperparameter Optimization where people also asked to explain the underlying concepts.

Before we begin, I need you to conduct a ceremony:

Stand up in front of a mirror, and say it out aloud, “I am going to learn the Math behind Linear Regression modelling - getting the Intercept & Slope coefficients, today!”
Take a out a naice register & pen to go through the derivation with me.

Assumptions for Linear Regression
Understanding the Given Equations
A Simple Derivation
- Finding the Intercept Term
- Finding the Slope coefficient
Diving Deeper into the Math
References

Assumptions for Linear Regression

Before we proceed with the derivation of our Linear Regression coefficients, let us first refresh our memory and list down the assumptions we make when we apply this algorithm. Allow me to borrow these lines from the CFA curriculum:

A linear relationship exists between the dependent & the independent variable
The independent variable is uncorrelated with the residuals
The expected value of residual term is zero
The variance of the residual term is constant for all observations
The residual term is independently & normally distributed

Understanding the Given Equations

Let us for a given dataset assume, that the all-knowing Lord Vishnu (from the Hindu Mythology) comes and gives you the Linear Regression model. It looks somewhat like this:

The perfect Linear Regression Model

And, you an earthling, are going to prepare a Linear Regression model from the same dataset in hopes that it is very close to the above one. Our model is somewhat going to look like this:

Our Linear Regression Our attempt at Linear Regression Model

Now, the question is, how do I get those values of the Intercept & Slope coefficients? Let’s get started.

A Simple Derivation

We know, that our model will be the best, if we can minimize the difference between our predicted and the actual values of the dependent variable. Most commonly, the difference is formally represented as Sum of Squared Errors (SSE) where in each error is the difference between the predicted value and the actual value of the corresponding data point. We will call this expression, as our Cost Function. Here is how it looks like:

The Cost Function

Borrowing from common wisdom, one can conclude, that lesser the value of this Cost Function, the better will be our Linear Regression model. And now, how I want you to look at it is as follows:

Forget all things, buzzwords & concepts like Machine Learning, Linear Regression, Prediction, Deep Learning and all earthly things for that matter. In this entire universe, there is only you and this Cost Function that you have to minimize. This essentially is only an optimization problem and give it only the respect that such a problem deserves. Don’t let anyone scare you and make you feel otherwise!

We find the minima (where our Cost Function is minimized) by using Derivatives, more specifically setting the partial derivatives equal to Zero.

Finding the Intercept Term

Finding the intercept term Finding our Intercept Term

Finding the Slope Coefficient

Finding the slope coefficient Finding our Slope Coefficient

We could have stopped at Equation 6, but as it happens, we can derive a more prominent variation! Here is how:

Reshaping the slope coefficient Reshaping our Slope Coefficient

Diving Deeper into the Math

Well, the issue is, setting the partial derivatives to zero, gives us the power to get an extrema (a minima or a maxima) and here lies the problem! We don’t know whether what we have found is a really a minima? Or a maxima? To know all this, we’ll have to investigate further. KEYWORDS: Jacobian(J) & Hessian(H). Again, don’t be scared, we just need to find more derivatives. And Jacobian? Well we already have that!

Jacobian & Hessian Our Jacobian

Well, if now, we can prove that determinant of our Hessian is positive (more strictly positive-definitive), then we can guarantee that at the expressions (Eq 5 & 8) we found earlier, we have achieved our Minima.

Exploring Hessian Concluding evidence from the Hessian

Our expression above, could also have the value 0, which will invalidate our conclusion of Minima, however that case will only arise when all x_i are some same constant. And this case will violate the very first assumption we made in the top! Hence we can safely ignore this and rejoice that we are correct.

We now atleast are at the level, from where we can touch the feet of Lord Vishnu. Hope I was able to help you out with the concepts mentioned! If not please leave feedback in comments as to how I can improve, since I plan to continue the Know The Math series. Am also attaching some references for additional information.

References

कर्मेन्द्रियाणि संयम्य य आस्ते मनसा स्मरन्‌ ।
इन्द्रियार्थान्विमूढात्मा मिथ्याचारः स उच्यते ॥
– Bhagavad Gita 3.6 ॥

(Read about this Shloka from the Bhagavad Gita here at http://sanskritslokas.com/)

Know The Math - Linear Regression & the Coefficients via Ordinary Least Squares

Contents