Linear Regression

What?[edit | edit source]

Regression : Predict real-valued output

Regression is a form of technique which investigates the relationship between two or more variable. We try to find out a general pattern and try to predict the values. Regression is whole concept but to keep it simple lets first discuss Linear Regression which also is known as Regression in one variable.

Why?[edit | edit source]

You may have a lot of data with a real valued attribute, such as temperature observations. You may wish to see which 'direction' the data is trending. A linear regression would draw a line through the middle of your data, with as little error as possible. This gives you a quick glimpse of the data and a way to predict an x value for any value of y

Visualization of a linear regression.  Original author: Jake Vanderplas
Visualization of a linear regression. Original author: Jake Vanderplas Introduction to Scikit-Learn: Machine Learning with Python

Try it out[edit | edit source]

Let's say you have an input which when fed into a function gives you a certain output. With that function you can predict any value (output) for any given value (input). Mathematically it can be defined as follows : 


Our task in regression is of predicting the output accurately. For that purpose first we need to get an approximate function F( ) so that when we give that function an input it can predict those desired value for us. One thing that must be noted is that the more our function is approximated accurately the more accurate our predictions. 

I have used the word approximated because in real world data no variable can be strictly said to follow a specific pattern there can be noise, error or many other ambiguities.  

Okay lets first understand the data we will be dealing with. 

Format of data :

  • a variable X (Input) 
  • a variable y (Output). We have various instance of the X's and their corresponding y's. Our aim is to find a function best describing these patterns and which can also help us in predicting values for new X's.

Notation :

Data points : (x,y) instances in our data

n - number of data points

Lets say we have n data points - (x1,y1),(x2,y2),(x3,y3).........(xn,yn)

h(x) - predictor function

F' - value output by the function h(x) when x is an input to it.

Now the question of interest is how can we approximate the function.

( You will find this common in most of the algorithms so better memorise it )

Lets devide our work in tasks for better understanding :

Task 1 : Assume a hypothesis function which we want to approximate.

Task 2 : Find how well this hypothesis function performs.

Task 3 : Update the Approximated function appropriately and interate till either dataset is finished or the function conveges.

Task 1 [edit | edit source]

Under this section we assume our hypothesis fuinction.

lets say,


This is equation of a straight line It is another reason it is called linear regression.

Task 2[edit | edit source]

Under section how well hypothesis function performs : 

This thing can be done by using one of the value of X and checking error between corresponding output that we know and the one which we have predicted.

We us root mean squared error to check the performance of our predictor function.

Error= J(theta0,theta1) = ((h(x1)-y1)^2 +(h(x2)-y2)^2 +(h(x3)-y3)^2 +.......+(h(xn)-yn)^2)/(2*n)

Task 3[edit | edit source]

What we need to do is to minimize this Error function.

For minimizing we first diffrentiate this function after substituting the actual form of h(x) and use it to update the value of parameters appropriately.

grad(theta0) = J'(theta0,theta1)/del(theta0) = (2*(theta0+theta1*(x1)-y1)+2*(theta0+theta1*(x2)-y2)+2*(theta0+theta1*(x3)-y3)+.......+2*(theta0+theta1*(xn)-yn))/(2*n)

grad(theta1) = J'(theta0,theta1)/del(theta1) = (2*(theta0+theta1*(x1)-y1)*x1+2*(theta0+theta1*(x2)-y2)*x2+2*(theta0+theta1*(x3)-y3)*x3+.......+2*(theta0+theta1*(xn)-yn)*xn)/(2*n)

Using these as the updates to our parameters theta0 and theta1 with the steps which we can scale with the help of alpha knows as step.

So, our update function becomes

Repeat until convergence:

(theta0)new = (theta0)old - alpha*(grad(theta0))

(theta1)new = (theta1)old - alpha*(grad(theta1))

Here alpha is learning rate. The greater the learning rate the faster the algorithm will converge.

Delimma while selecting alpha the learning rate:

  • if we choose small learning rate, slow convergence
  • when we choose large alpha, it overshoots. So, we need to carefully select our learning rate alpha.

This is also known as the gradient descent algorithm.