Contents

Simple Linear Regression for absolute beginners

Linear Regression is the most simple and basic model in machine learning. It may seem dull compared to advanced machine learning models, yet it is still a widely used statistical learning model. The importance of having a good understanding of linear regression before studying more complex methods cannot be overstated.

Liner Regression is a linear model that assumes a linear relationship between the input variables(XX) and output variable(YY). Mathematically, this linear relationship can be represented as Yβ0+β1XY \approx \beta_0 + \beta_1X In above equation, β0\beta_0 and β1\beta_1 are two unknown constants that represent the intercept and slope terms of the linear model. Once we estimate the values of β^0\hat\beta_0 and β^1\hat\beta_1 using our training data, we can predict the output variable for new data by computing y^=β^0+β^1x(Eq.1)\tag{Eq.1} \hat{y} = \hat\beta_0 + \hat\beta_1 x where y^\hat{y} indicates a prediction of YY on the basis of X=xX=x. ^ \hat{} represents estimated value for an unknown parameter or coefficient or predicted value of the response.

Before we make predictions, we must find out the values of β0\beta_0 and β1\beta_1. Let (x1,y1),(x2,y2),,(xn,yn)(x_1,y_1), (x_2,y_2), …, (x_n,y_n) represent nn observations pairs. The most common approach of estimating the coefficients is ordinary least scores criterion, which minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear approximation.

Let y^i=β^0+β^1xi \hat{y}_i = \hat\beta_0 + \hat\beta_1 x_i be the prediction for YY based on the ithi^{th} value of XX. Then ei=yiy^ie_i = y_i - \hat{y}_i represents the ithi^{th} residual which is the difference between ithi^{th} actual value and ithi^{th} predicted value by our linear regression model. Residual sum of squares (RSS) can be defined as RSS=e12+e22+e32++en2RSS = e^2_1 + e^2_2 + e^2_3 + … + e^2_n or equivalent to RSS=(y1β^0β^1x1)2+(y2β^0β^1x2)2++(y3β^0β^1x3)2RSS = (y_1 - \hat\beta_0 - \hat\beta_1x_1)^2 + (y_2 - \hat\beta_0 - \hat\beta_1x_2)^2 + … + (y_3 - \hat\beta_0 - \hat\beta_1x_3)^2 Our goal here is to choose β^0\hat\beta_0 and β^1\hat\beta_1 to minimize the RSSRSS value. Using calculus, we can show that the values are β^0=yˉβ^1xˉ,(Eq.2)\tag{Eq.2}\hat\beta_0 = \bar{y} - \hat\beta_1\bar{x},

β^1 =i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2(Eq.3)\tag{Eq.3}\hat\beta_1\ = \cfrac{\sum^n_{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sum^n_{i=1}(x_i - \bar{x})^2} where yˉi=1nyi\bar{y} \equiv \sum^n_{i=1} y_i and xˉi=1nxi\bar{x} \equiv \sum^n_{i=1} x_i are the sample means or we can say Eq. 2 and Eq. 3 defines the least squares coefficient estimates for simple linear regression

Cost function is the average error of nn samples in the data. It can be written as: J(θ)=1ni=1n(yiy^i)2J(\theta) = \cfrac{1}{n} \sum^n_{i=1}(y_i - \hat{y}_i)^2 We can obtain the coefficients by minimizing the cost function. This can be done via:

  • Closed form solution: differentiating the function and equating it to zero
  • Iterative solution:
    • first order: Gradient Descent (θJ(θ))\bigg( \cfrac{\partial}{\partial\theta}J(\theta)\bigg)
    • second order: Newton’s Method (2θ2J(θ))\bigg( \cfrac{\partial^2}{\partial\theta^2}J(\theta)\bigg)

Let’s understand both the solutions with a simple function, J(θ)=θ2J(\theta) = \theta^2

Our approach here is to differentiate the function and equating it to zero. differential of θ22θ=0θ=0\theta^2 \Rightarrow 2\theta = 0 \newline \Rightarrow \theta = 0. For fuction θ2\theta^2, minima is at θ=0\theta = 0

Gradient Descent is the most popular optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descents.

In this approach, we start at a random initial point, say θ0=10\theta^0 = 10 and compute θ1\theta^1. θ1=θ0αJθθ=θ0 \theta^1 = \theta^0 - \alpha \cfrac{\partial J}{\partial\theta} \bigg\vert _{\theta = \theta^0} θ1=θ0α2θ\Rightarrow \theta^1 = \theta^0 - \alpha * 2\theta Here α\alpha is the learning rate, consider α=0.1\alpha = 0.1 and substituting the same θ1=100.12(10)\Rightarrow \theta^1 = 10 - 0.1 * 2(10) θ1=8\Rightarrow \theta^1 = 8 Similarly, we can calculate θ2\theta^2, as θ2=θ1α2θ1θ2=8(0.1)28=6.4\theta^2 = \theta^1 - \alpha * 2\theta^1 \Rightarrow \theta^2 = 8 - (0.1) * 2 * 8 = 6.4

For θ3\theta^3, θ3=θ2α2θ2θ3=6.4(0.1)26.4=5.12\theta^3 = \theta^2 - \alpha * 2\theta^2 \Rightarrow \theta^3 = 6.4 - (0.1) * 2 * 6.4 = 5.12

As we see the the θ\theta value is decreasing at every step. After a few iterations, we approach the minima of the function which is at θ=0\theta=0.

Learning Rate
The higher the learning rate the bigger the steps(the faster we move which means less iterations) and vice versa. However, with higher learning rates we may miss the minima and circle back which takes more number of steps/time to converge.

Once we have the estimates of the coefficients β0\beta_0 and β1\beta_1, we can subsitute them in our Eq. 1 and predict the target variable yiy_i for any given xix_i. Assuming that we have the coefficients and we predicted the yiy_i values.

Looks very simple isn’t it?

While a basic linear regression involves deriving coefficients, substituting them into a straight line equation, and using this to predict a target variable, the process becomes more complex when more parameters are involved. Specifically, if we are working with a single parameter, it is known as Simple Linear Regression, whereas with multiple parameters, we refer to it as Multiple Linear Regression

My upcoming post will provide a comprehensive analysis of Multiple Linear Regression, including an exploration of the challenges that arise with an increased number of parameters and strategies for effectively managing them