Contents

Simple Linear Regression for absolute beginners

Linear Regression is the most simple and basic model in machine learning. It may seem dull compared to advanced machine learning models, yet it is still a widely used statistical learning model. The importance of having a good understanding of linear regression before studying more complex methods cannot be overstated.

Definition

Liner Regression is a linear model that assumes a linear relationship between the input variables($X$) and output variable($Y$). Mathematically, this linear relationship can be represented as $$Y \approx \beta_0 + \beta_1X $$ In above equation, $\beta_0$ and $\beta_1$ are two unknown constants that represent the intercept and slope terms of the linear model. Once we estimate the values of $\hat\beta_0$ and $\hat\beta_1$ using our training data, we can predict the output variable for new data by computing $$\tag{Eq.1} \hat{y} = \hat\beta_0 + \hat\beta_1 x$$ where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X=x$. $ \hat{}$ represents estimated value for an unknown parameter or coefficient or predicted value of the response.

Estimating the coefficients

Before we make predictions, we must find out the values of $\beta_0$ and $\beta_1$. Let $$(x_1,y_1), (x_2,y_2), …, (x_n,y_n)$$ represent $n$ observations pairs. The most common approach of estimating the coefficients is ordinary least scores criterion, which minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear approximation.

Let $ \hat{y}_i = \hat\beta_0 + \hat\beta_1 x_i$ be the prediction for $Y$ based on the $i^{th}$ value of $X$. Then $e_i = y_i - \hat{y}_i$ represents the $i^{th}$ residual which is the difference between $i^{th}$ actual value and $i^{th}$ predicted value by our linear regression model. Residual sum of squares (RSS) can be defined as $$RSS = e^2_1 + e^2_2 + e^2_3 + … + e^2_n$$ or equivalent to $$RSS = (y_1 - \hat\beta_0 - \hat\beta_1x_1)^2 + (y_2 - \hat\beta_0 - \hat\beta_1x_2)^2 + … + (y_3 - \hat\beta_0 - \hat\beta_1x_3)^2$$ Our goal here is to choose $\hat\beta_0$ and $\hat\beta_1$ to minimize the $RSS$ value. Using calculus, we can show that the values are $$\tag{Eq.2}\hat\beta_0 = \bar{y} - \hat\beta_1\bar{x},$$

$$\tag{Eq.3}\hat\beta_1\ = \cfrac{\sum^n_{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sum^n_{i=1}(x_i - \bar{x})^2}$$ where $\bar{y} \equiv \sum^n_{i=1} y_i$ and $\bar{x} \equiv \sum^n_{i=1} x_i$ are the sample means or we can say Eq. 2 and Eq. 3 defines the least squares coefficient estimates for simple linear regression

Cost function

Cost function is the average error of $n$ samples in the data. It can be written as: $$J(\theta) = \cfrac{1}{n} \sum^n_{i=1}(y_i - \hat{y}_i)^2$$ We can obtain the coefficients by minimizing the cost function. This can be done via:

  • Closed form solution: differentiating the function and equating it to zero
  • Iterative solution:
    • first order: Gradient Descent $\bigg( \cfrac{\partial}{\partial\theta}J(\theta)\bigg)$
    • second order: Newton’s Method $\bigg( \cfrac{\partial^2}{\partial\theta^2}J(\theta)\bigg)$

Let’s understand both the solutions with a simple function, $J(\theta) = \theta^2$

Closed form solution

Our approach here is to differentiate the function and equating it to zero. differential of $\theta^2 \Rightarrow 2\theta = 0 \newline \Rightarrow \theta = 0$. For fuction $\theta^2$, minima is at $\theta = 0$

Gradient Descent

Gradient Descent is the most popular optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descents.

In this approach, we start at a random initial point, say $\theta^0 = 10$ and compute $\theta^1$. $$ \theta^1 = \theta^0 - \alpha \cfrac{\partial J}{\partial\theta} \bigg\vert _{\theta = \theta^0}$$ $$\Rightarrow \theta^1 = \theta^0 - \alpha * 2\theta$$ Here $\alpha$ is the learning rate, consider $\alpha = 0.1$ and substituting the same $$\Rightarrow \theta^1 = 10 - 0.1 * 2(10)$$ $$\Rightarrow \theta^1 = 8$$ Similarly, we can calculate $\theta^2$, as $\theta^2 = \theta^1 - \alpha * 2\theta^1 \Rightarrow \theta^2 = 8 - (0.1) * 2 * 8 = 6.4$

For $\theta^3$, $\theta^3 = \theta^2 - \alpha * 2\theta^2 \Rightarrow \theta^3 = 6.4 - (0.1) * 2 * 6.4 = 5.12$

As we see the the $\theta$ value is decreasing at every step. After a few iterations, we approach the minima of the function which is at $\theta=0$.

Learning Rate
The higher the learning rate the bigger the steps(the faster we move which means less iterations) and vice versa. However, with higher learning rates we may miss the minima and circle back which takes more number of steps/time to converge.

Predictions and more

Once we have the estimates of the coefficients $\beta_0$ and $\beta_1$, we can subsitute them in our Eq. 1 and predict the target variable $y_i$ for any given $x_i$. Assuming that we have the coefficients and we predicted the $y_i$ values.

Looks very simple isn’t it?

While a basic linear regression involves deriving coefficients, substituting them into a straight line equation, and using this to predict a target variable, the process becomes more complex when more parameters are involved. Specifically, if we are working with a single parameter, it is known as Simple Linear Regression, whereas with multiple parameters, we refer to it as Multiple Linear Regression

My upcoming post will provide a comprehensive analysis of Multiple Linear Regression, including an exploration of the challenges that arise with an increased number of parameters and strategies for effectively managing them