Linear regression is undoubtedly an under-appreciated model. While it is the starting point for many data scientists when building models for regression, as it serves a good baseline for training, more often than not, it is quickly put on the back burner and forgotten. Despite having almost no hyperperimeter to tune, if we consider a regularized linear model to be the same, there are still ways to improve on the model to become among one of the best.
Why should you spend time developing a linear model?
Linear regression and, by the same extension, all Generalized Linear Models (GLM) are parametric discriminative learning algorithms. It is a class of supervised models where the prediction is based on the probability of the target variable given the training examples: P(Y|X). The complexity of the model remain the same even as the number of training examples increases, unlike non-parametric models like Random Forest and Support Vector Machines. This makes the model easy to interpret and fast to train. In addition, figuring out the feature importance is as straightforward as calculating the t-statistic.
Linear regression summarizes a linear relationship between the features and the target even if the relationship is better explained by a non-linear one. This oversimplification often leads to sub-par performance from under-fitting. The goal in such an event is to explore different strategies to increase the complexity of the model in order to capture the signal.
How to overcome under fitting?
Extending Linear Model with Polynomial Features
As you can see from Figure 1 above, simply fitting a plain linear model increases bias. It also does not capture the shape to the true underlying function indicated by the red curve. In order to successfully model the data with linear regression, one option would be to extend the model with Polynomial Features in Scikit-learn by creating new features that are polynomial combinations of existing features.
With a 4th degree polynomial extension on a linear regression, the model at the center of Figure 1 is able to generalize the non-linear relationship between the features and the target. However, as the model increases in complexity with each higher degree, the variance of the model grows exponentially. Using grid search to find the optimal hyperperimeter and evaluating the training and test scores will guarantee a model that can best differentiate the noise to explain the signal .
Transforming Continuous Feature
An alternative to polynomial extension is to perform logarithmic, square root, exponential transformation on selected individual feature without increasing the number of predictors. This is a more targeted approach to replacing an existing feature whose relationship with the target is non-linear. For example, the age of a person is not directly proportional the weight. The average weight difference between a 7 year old and 8 year old is more than the difference in weight between a 70 year old and 71 year old. This could be done in Scikit-learn with grid search inside a pipeline using Column Transformer and Function Transformer.
Transforming Categorical Feature
Another option to dealing with non-linear relationship between a continuous feature and target is to discretize into interval groups. Continuing with the example of weight prediction by the age, intelligently splitting the age into groups of peers where the weights within each group are similar could improve the prediction results. Just like Function Transformer, pairing K Bins Discretizer with Column Transformer in Scikit-learn can seamlessly facilitate the preprocessing of data in a pipeline. It may be helpful to experiment with different binning strategies and check which configuration works best.
There are a few ways to handle categorical data so that machine learning algorithms can pick up the relevant signal.
When the order of the categories are informative and is useful in narrowing down what the target could be, then try preserving this order by transforming the feature using Ordinal Encoder in Scikit-learn. As it is the case when analyzing the default risk of loan applicants given information on his or her education background. All possible categories should be listed and ranked before mapping to a numerical value. This initial step can prevent the model from breaking down during test time due to a category that the model has not seen in training.
As for other categorical features, the most common way to handle is to transform each category of the feature into separate binary features, 1 if it is applicable to the training example and 0 otherwise. This is done in Scikit-learn through the One Hot Encoder. This method of encoding works best when a predictor is made up of not many categories, i.e. gender, marital status.
While One Hot Encoder can be used to transform a feature with many categories, it is unlikely to produce good predictions for a linear model. A better alternative to one hot encoding is with Target Encoder or other smart encoders. This is available on a separate Python package called Category Encoders and the algorithm works well along side Scikit-learn, including inside a preprocessing pipeline. These sophisticated encoders can learn the relationship of each category with the target and quantify the result to a numerical value. These smart encoders are truly indispensable and are utilized in many winning models of data science competitions.
Pro Tip:
It is indeed possible to include different smart encoders as parameters in grid search. Simply instantiate a Column Transformer with a named step and a surrogate function, any of the smart encoders in this case, in the list of transformers. Finally, add the step and a list of encoders in a dictionary of parameters for grid search to process like you normally would.
Regularizing Model with Ridge and Lasso
Sometimes, the dataset you are working with contains many features, hundreds, thousands or even hundreds of thousands of attributes, and sometimes the dataset has more columns of features than rows of observations. Many of these predictors could be correlated, indicating that multicollinearity exists in the data. This can have a negative impact on the predictions of a linear model. While you could try to find and eliminate the correlated features through a correlation plot or VIF, it may not be feasible where there are exponentially large number of predictors.
Lasso regression is best suited to deal with such predicaments. L1 regularization automatically selects one of the correlated features and drops the rest of the correlated attributes. This not only makes the model more compact but also reduces model variance and complexity. It is extremely important to standardize all features before running any model with regularization as large numerical values will distort the model.
While lasso regression is great for dropping predictors from the model, it may be too extremely of a measure for a model that is only slightly overfitted. This is where ridge regression comes into play. As you generate new features to undo the effects of bias in a simple linear model, you will eventually end up confronting the “curse of dimensionality” as data points becomes increasingly scattered and decentralized in higher dimension spaces. Linear and distance-based models are especially prone to these effects in higher dimensions, which by adding L2 regularization to the overfitted model takes care of the problem as it reduces each predictor’s effect on the target variable.
Fitting Model with Huber Loss
Often times, the loss function in linear regression is not the right choice for the data at hand. Unlike many other models, linear regression has some strict assumptions about the data generating process, and in particular, the residuals of data points have to be normally distributed, which means outliers should be rare or nonexistent in the dataset. While removing outliers is part of data preprocessing, just deleting data without sufficient justification, like when the outlier is not the result of an error in data collection or recording, is not standard practice and should not be done.
Ordinary least square, linear regression’s loss function, amplifies the loss from outliers more than that of other data points, which may cause the model to overcompensate during training. The more appropriate model in Scikit-learn to handle such data with some outliers is the Huber Regressor. Huber loss is a hybrid loss function that switches between mean absolute error (MAE) for outliers and values far from the prediction and mean square error (MSE) for data points closer to the prediction. There is a tunable hyperperimeter, delta, that determines which loss function should be used to calculate for a given residual size.
Fitting with Other GLMs
You probably already know that when a problem involves prediction of a binary class, one of the models most suited for this task is logistic regression. But, are you aware that logistic regression, just like linear regression, is part of the family of Generalized Linear Model (GLM) and that there are other GLMs besides these two that are available on Scikit-learn which you can utilize?
The selection of an appropriate model within GLM depends on what the range of all possible values of the target is. For example, if you are trying to predict a count, like the ridership of a transportation system in a given day, you might want to test how well the Poisson Regressor stack up against Linear Regression in Scikit-learn. The underlying data generating process is likely to be Poisson distributed. Similarly, if you want to predict the average public transport commuter time for any time period, the Gamma Regressor may be more suitable for the task.
In addition to the Poisson and Gamma Regressor in Scikit-learn, there is Tweedie Regressor, which you can think of it as a generic class of GLM. Depending of the hyperperimeter, power, the model assumes different distribution of the target variable. The table below shows the possible values of power together with the corresponding distribution.
Unlike Linear Regression in Scikit-learn, the scorer for GLMs is the percentage of deviance explained, D². While R² and D² are equal when the target distribution is normal as in linear regression, the calculations are nonetheless different for other GLMs. It is best to keep this in mind when comparing the score of other regression models, like random forest or SVM to that of a GLM.
Summary
Although linear regression is a relative simple and inflexible model, there are many techniques and tools available in Scikit-learn to help the model perform better. This may involve smart encoding of categorical features or substituting L2 loss with Huber loss or any combination of tricks that we explored in this blogpost. As data scientist, we owe it to ourselves to experiment with different tools in our toolbox to come up with a solution that is simple enough and able to generalize well.
Thanks for reading!