EVERYTHING ABOUT LINEAR REGRESSION
PART 1 : THEORY AND MATHEMATICAL INTUITION.
In this article, I will walk you through the detailed explanation of Regression theory, Visualizations, Evaluation metrices and the hurdles one need to pass to build a robust regression model. Its implementations will be discussed in my next article. After reading these two articles, Believe me, you’ll understand almost every aspect of a regression problem.
Click the link to read the article “Everything about linear regression implementation(python)” directly : https://medium.com/@saurabh62nagar/everything-about-linear-regression-df3ac4145c35
1. INTRODUCTION
As we know that, there are three main machine learning categories and those are : Supervised, Unsupervised and Reinforcement learning. In supervised learning, models are trained on a labelled dataset. Linear regression comes under Supervised machine learning. It allows you to predict future outcomes using historical data.
In supervised machine learning, we need to find out the mapping function between features(x) and target/label(y), (y = f(x)). but the question arises is, How should Y look like and what values it can take ?
Y can either take Continuous or Discrete values, based on this, supervised learning is further divided into two types : Regression and Classification. If Y takes continuous data, then it is considered as a regression problem and if Y takes discrete data/values, then it becomes a classification problem.
Example for regression problem : “you want to estimate the amount of rainfall for a day” and for classification problem : “estimating amount of rainfall in categories such as low, medium and high”.
In regression problems, y=f(x) is the solution for input(x) and output(y). We as ML engineers, need to find out this mapping function (f). Question is, what will be given to us ? and the answer is “both x and y”. Once f is found, we will expose the same function to new input data(x) and predict/calculate Y for new input data using equation, y’ = f(x’) and then we can compare actual Y with predicted Y (y’), This is called model evaluation.
2. REGRESSION
Linear Regression is a statistical model, used to analyze and understand the strength and character of relationship between two variables (input(x), output(y)). This technique is used for forecasting, time series modelling and finding the cause-effect relationship. So there are two tasks in performing linear regression.
- How closely x and y are related ?, Linear Regression gives a value between -1 and 1, indicating the strength of relationship.
- Once the relationship between x and y is known, use it to predict future outcomes using y = mx+c | y=a0+a1x
A regression line can be a positive linear relationship or negative linear relationship. When the correlation is positive, the regression slope is also positive and when the correlation is negative, slope of regression will be negative.
2.1 TYPES OF REGRESSION
- Simple Linear Regression (y=mx+c, one input and one output).
- Multiple Linear Regression(multiple inputs and one output).
- Polynomial Regression.
- Logistic Regression (used for classification problems).
Above mentioned are types of regression, not the techniques that are used in machine learning to perform regression.
2.2 MULTIPLE LINEAR REGRESSION
Multiple linear regression is a statistical technique that uses multiple explanatory variables to predict the outcome of response variable, while in simple linear regression there is only one explanatory variable.
Y = m1x1 + m2x2 + m3x3 + ……………….+mn*xn + c
where, Y is target (dependent variable) , x1…xn are independent variables, m…..mn are slopes and c is the coefficient/intercept.
2.4 POLYNOMIAL REGRESSION
Polynomial regression is used to fit a linear model on non-linear data points by creating new linear features by reducing the powers of non-linear features. Because on non-linear data, Linear regression line will not best fit(Shown below in figure).
Example : Quadratic Features, y = mx1 + m2x2² +c [non-linear]
Reduce it by taking x2' = x2² and therefore the equation becomes y = mx1 + m2x2' +c.
3.NEED OF LINEAR REGRESSION
To understand this, lets take an example, suppose you want to estimate the salary of an employee based on his/her experience from the company’s employee history data. Here, both salary and experience are continuous variables. So here, experience is independent variable (x) and salary is dependent variable (y). By checking the relationship between x and y , we can estimate the future salary of the employee.
4. CORRELATION AND REGRESSION
KEY SIMILARITIES
- Both quantify the direction and strength of the relationship between two numeric variables.
- When the correlation (r) is negative, the regression slope (m) will be negative.
- When the correlation is positive, the regression slope will be positive.
- The correlation squared (r2 or R2) has special meaning in simple linear regression. It represents the proportion of variation in Y explained by X.
KEY DIFFERENCES
- Regression attempts to establish how X causes Y to change and the results of the analysis will change if X and Y are swapped. With correlation, the X and Y variables are interchangeable.
- Regression assumes X is fixed with no error, such as a dose amount or temperature setting. With correlation, X and Y are typically both random variables, such as height and weight or blood pressure and heart rate.
- Correlation is a single statistic, whereas regression produces an entire equation.
5. LINEAR REGRESSION ASSUMPTIONS
Regression is a parametric approach. “Parametric” means it makes assumptions about the data and therefore it is restrictive in nature. it fails to give good results on datasets which doesn’t follow its assumptions. Therefore to build a robust regression model, it is important to validate its assumptions.
First Let’s see, what are the assumptions and then we will look at how to validate them.
- There should be a linear and additive relationship between independent (x) and dependent (y) variables. Linear means change in Y due to one unit change in X is constant regardless the values of X. Additive means effect of one independent variable on Y is independent of the other variables.
- There should be no correlation among residual (error)terms. Absence of this assumption is termed as Autocorrelation.
- Independent variables should not be correlated with each other. Absence of this phenomenon is called multicollinearity.
- The error (residuals) must have constant variance, this is called as homoscedasticity. Presence of non-constant variance in errors is termed as heteroscedasticity.
- Error terms must be normally distributed.
SOLUTION : To overcome the problem of non-linearity, you can apply , Square Root transformation or log transformation or box cox transformation on predictors (x). To handle heteroscedasticity , perform log or square root transformation on response variable(y).
6. REGRESSION ACCURACY METRICS
(A) R-SQUARE : R-square tells how much variation in “response” is explained by the independent variables. it is the most common metric to judge the performance of regression models. its value lies between 0 and 1 i.e 0% to 100%.
Example : if R-square value is 16%, this means that we only have 16% information to make accurate predictions. “More the R-square, better is the model”.
DISADVANTAGE OF R-SQUARE : It assumes that every independent (x) variable in the model explains variation in dependent (y) variable, therefore it simply remains same or increases with addition of more number of predictors.
(B) ADJUSTED R-SQUARE: One should always check Adjusted R-square, as its value increases only if the added independent variable improves the explanation of dependent variable variation. If the model doesn’t improve, then Adjusted R-square value decreases.
(C) COST FUNCTION : This is based on “ordinary least square regression method” [OLS]. While doing Linear Regression we can have multiple lines for different values of slopes and intercepts. But the main question that arises is which of those lines actually represents the right relationship between the X and Y, that is “which is the best fit line ?”. OLS method says “the line that has minimum total sum of squared differences will be the best fit line”. In order to find that we can use the Mean Squared Error or MSE as the parameter. For linear regression, this MSE is nothing but the Cost Function.
MSE is the sum of the squared differences between the prediction and Actual values, output is a single number representing the cost. So the line with the minimum cost function or MSE represents the relationship between X and Y in the best possible manner. And once we have the slope and intercept of the line which gives the least error, we can use that line to predict Y.
Like MSE, we also have Root Mean Squared error (RMSE), Mean absolute error (MAE) and Mean absolute percentage error (MAPE). But we should choose MSE over MAE because it amplifies the error terms and indicates even if the error is small but its impact can be huge.
I would encourage to explore some other terms such as : TSS, RSS and Residuals (errors).
7. GRADIENT DESCENT
Gradient descent is a method of updating m and c values to minimize the cost function. A regression model uses gradient descent to randomly select and automatically update these values and minimize the cost function. Therefore we can say that “gradient descent is an optimization algorithm that tweaks parameters (m & c) iteratively to minimize the cost function to its minimum value possible”.
Using above formula, it tries to update m and c iteratively until it minimizes the cost function and reaches minimum value of slope (m).
Your model stops learning when gradient(slope) is zero or near to zero.
8. EVALUATING COEFFICIENTS
In regression analysis, p-values and coefficients together indicate which relationships in the model are statistically significant. Coefficients describes mathematical relationship between each independent variable and dependent variable. while P-values for coefficients indicate whether or not these relationships are significant.
P < 0.05 : Reject Null ( Significant, Retain the variable)
P>0.05 : Accept Null ( Not significant, drop the variable)
9. CHALLANGES IN PREDICTION
(A). UNDER FITTING : If the model training is poor, means your model is not able to fit properly on the data then learning of the model will be poor. In this situation, your model will not work well, not even on the training dataset. This situation is called as underfitting problem. Therefore in this case, accuracy for both train as well as test dataset.
To handle this problem, Re-clean, pre-process and Re-train your data.
(B). OVER FITTING : When you strictly train your model, means you try to fit the model through each and every point of the data, In this case model will work best on the train dataset but it will fail on test/new dataset. So accuracy on test predictions will be very low as compare to the training data predictions. This problem arises mainly because of “multicollinearity” in the data.
So the question is how to overcome these situation ?
Regularization techniques solves overfitting situation . We can use Ridge and Lasso regression machine learning algorithms to handle. we’ll discuss these during the implementation part in the later section.
MULTI COLLINEARITY : When two or more variables, which are correlated amongst themselves, tries to explain the response variable is termed as multi-collinearity. Let me give you an example :
“A crime scene and an Investigating officer”
Suppose an investing officer is investing a crime scene and there are 3 eye witnesses and two of them are telling the same story and 1 is telling different story about the crime.
Question : Should the officer visit all 3 of them for further investigation or only 2 out of them, 1 out of 2 (same story witnesses) and one other person ?
Answer : Officer should only consider 02 witnesses for further investigation in order to clear understanding of the crime scene and save energy and time.
Therefore, Once you have the dataset :
Calculate correlation of each independent variable with dependent variable and correlation of independent variables among themselves.
Correlation must be Zero or Low among independent variables and High between independent and dependent variables.
With this the theory part of Regression finishes, if you liked this particular article, please do hit the clap button. Don’t forget to read the PART 02 of this article to complete your practical understanding of LINEAR REGRESSION. Click on the link : https://medium.com/@saurabh62nagar/everything-about-linear-regression-df3ac4145c35
THANK YOU !