Linear regression implementation

Since linear regression is a trivial model, it is relatively easy to implement it from scratches and maybe in the future I’ll implement a full version on this page.

Many libraries enabling a user to build and train a linear regression model exist. In the last years I feel like scikit-learn and pytorch are the most widely used libraries in machine learning.

Reading data

For this example we are using house prices as a function of inhabitable surface and number of rooms. Data is stored in a csv file, to parse it into a python data structure we use pandas. This is a preliminary step for any approach and while some libraries may offer custom way to parse data I find that this is just better. Delegating parsing to a second library follows the single-responsibility principle. This is at least true for datasets saved in common formats like csv or tsv or similar. Sometimes we will deal with custom formats like Pytorch’s pt files: in that case it is obviously better (or sometimes necessary) to take care of data loading with the right library.

import pandas as pd

We read data from a csv file and cast it into a pandas.DataFrame

df = pd.read_csv('./data/house_pricing.csv')
df.head()

	sqf	rooms	price
0	2104	3	399900
1	1600	3	329900
2	2400	3	369000
3	1416	2	232000
4	3000	4	539900

This dataset has two feature columns (sqf and rooms) and a label column (price)

Let’s assign the features $X$ and the labels $y$ to two different variables

xy = df.values.T
X = xy[:-1].T
y = xy[-1]

Where the features $X$ are

X[:5]

array([[2104,    3],
       [1600,    3],
       [2400,    3],
       [1416,    2],
       [3000,    4]])

and their labels $y$

y[:5].reshape(-1, 1)

array([[399900],
       [329900],
       [369000],
       [232000],
       [539900]])

scikit-learn

Linear regression in scikit-learn is as easy as one line of code. To keep this first example as easy as possible, I’m not going to split the data in training and dev sets. I’m just fitting the model to the whole dataset. In a real scenario, there should be a preliminary step of dataset splitting.

Single feature

In order for the first example to be as simple as possible and plottable, for now we drop the rooms column from the features and we are only left with the sqf column. This means that in this first example we are exploring linear dependency between the inhabitable surface and the price of a house.

X_simple = X[:, 0]
X_simple

array([2104, 1600, 2400, 1416, 3000, 1985, 1534, 1427, 1380, 1494, 1940,
       2000, 1890, 4478, 1268, 2300, 1320, 1236, 2609, 3031, 1767, 1888,
       1604, 1962, 3890, 1100, 1458, 2526, 2200, 2637, 1839, 1000, 2040,
       3137, 1811, 1437, 1239, 2132, 4215, 2162, 1664, 2238, 2567, 1200,
        852, 1852, 1203])

Since the fit() function that we are using later wants a 2D-vector of shape $(m, n)$ and we only have one feature, we need to reshape the array in the form $(m, 1)$. On the other hand $y$ can either be a 2D or 1D array.

X_simple = X_simple.reshape(-1, 1)
X_simple[:5]

array([[2104],
       [1600],
       [2400],
       [1416],
       [3000]])

Building a linear regression model with scikit-learn requires the LinearRegression class

from sklearn.linear_model import LinearRegression

Now we can build the model buy instantiating the LinearRegression class

linreg = LinearRegression()

the linreg variable contains a linear regression object that allow the computation of the model, but we didn’t feed the data to it. Data is fed to the .fit() method

linreg = linreg.fit(X_simple, y)

The parameters and bias of the model are returned with

linreg.coef_, linreg.intercept_

(array([134.52528772]), 71270.49244872917)

svg

Multiple Features

We can now introduce the dataset split step that we oversaw in the previous example. In scikit-learn splitting the dataset in train and test set is taken care of for us through a function. The proportion of the split can be configured through its arguments.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

We can now fit the all ($n=2$) features of the training set $X^t$

linreg = LinearRegression().fit(X_train, y_train)

Since this time $X^t \in \mathbb{R}^{m \times 2}$, we have 2 weight parameters and 1 bias parameter

linreg.coef_, linreg.intercept_

(array([ 112.70999509, 1927.44419957]), 118980.9276634129)

Parameters fitted on the training set can be used to produce prediction from the test set features

y_pred = linreg.predict(X_test)
y_pred

array([356619.09445432, 264072.81419883, 478334.51487332, 420751.08166304,
       347827.71483696, 233618.3669574 , 345348.09494488, 633333.50669408,
       260353.38436072, 293151.99293319, 337785.15099058, 323921.82159397,
       280303.05349243, 366988.41400301, 285600.42326187, 335429.61537657])

Predictions can be now compared to the labels of the test set

from sklearn.metrics import explained_variance_score
explained_variance_score(y_test, y_pred)

0.8286812907113245

Pytorch

Whereas scikit-learn is a high-level library, Pytorch is has a much lower-level approach. Many of the things that in scikit-learn happen under the hoods, in Pytorch need to be done manually.

The steps for training a model in Pytorch as defined in Pytorch documentation are

Load Dataset
Make Dataset Iterable
Create Model Class
Instantiate Model Class
Instantiate Loss Class
Instantiate Optimizer Class
Train Model

The main entry point of the framework is the torch module

import torch

The first noticeable Pytorch feature is that it works using a proprietary data-structure, called a tensor. The underlying mathematical concept of tensor is beyond the scope of this article but can be consulted at the Wikipedia tensor entry. In Pytorch, a tensor is (citing the pytorch tensor documentation) a multi-dimensional matrix containing elements of a single data type.

X_tensor = torch.tensor(X, dtype=torch.float32)
X_tensor[:5]

tensor([[2.1040e+03, 3.0000e+00],
        [1.6000e+03, 3.0000e+00],
        [2.4000e+03, 3.0000e+00],
        [1.4160e+03, 2.0000e+00],
        [3.0000e+03, 4.0000e+00]])

As you can notice we had to specify dtype=np.float32. This is because the underlying implementation of forward and backward propagation used by Pytorch under the hood would not work with the int type.

Furthermore, $y$ tensor would be 1D but this would not comply with requirements of Pytorch methods used below, so we transform it into a column vector with the unsqueeze(-1) method. This is equivalent to calling .reshape(-1, 1) on a numpy.array

y_tensor = torch.tensor(y, dtype=torch.float32).unsqueeze(-1)
y_tensor[:5]

tensor([[399900.],
        [329900.],
        [369000.],
        [232000.],
        [539900.]])

Since data is in very different scales we need to first normalize it. Here we use standardization, which rescales data to have mean $\mu=0$ and standard deviation $\sigma=1$

\[X_\text{std} = \frac{X - \mu}{\sigma}\]

X_tensor_norm = (X_tensor - X_tensor.mean()) / torch.sqrt(X_tensor.var())
X_tensor_norm[:5]

tensor([[ 0.9590, -0.8692],
        [ 0.5204, -0.8692],
        [ 1.2166, -0.8692],
        [ 0.3603, -0.8701],
        [ 1.7387, -0.8684]])

A linear regression model can be built using the Linear class from the nn module, which initializes bias and weights automatically. Its constructor takes as input the number of columns of the input ($n_X$) and of the output ($n_y$)

model = torch.nn.Linear(2, 1)

Training the model will require some hyperparameters that we will define in advance for convenience:

epochs is the number of times the model will see all of our training samples;
alpha is the learning rate, which defines how big are the steps takes in updating the parameters;
loss_func is the loss function $\mathcal{L}$ used at training time. In this case we are using the MSELoss, which measures the mean squared error (squared L2 norm) between each element in the input $x$ and target $y$;
optim is the optimization algorithm used. In this case we are using SGD (Stochastic Gradient Descent). SGD requires us to select the correct $\alpha$. Up to this point we haven’t seen that other optimization algorithm exist that automatically adapt $\alpha$ to data (e.g. ADAM), so we are sticking with standard SGD.

epochs = 10
alpha = 0.01
loss_func = torch.nn.MSELoss()
optim = torch.optim.SGD(model.parameters(), lr=alpha)

Now we need to manually run over the epochs and trigger the update of the parameters that will ultimately produce a fitted model

from torch.autograd import Variable

for epoch in range(epochs):
    inputs = Variable(X_tensor_norm)
    labels = Variable(y_tensor)
    # Clear gradient buffers because from previous epoch
    optim.zero_grad()
    # get output from the model, given the inputs
    outputs = model(inputs)
    # get loss for the predicted output
    loss = loss_func(outputs, labels)
    # get gradients w.r.t to parameters
    loss.backward()
    # update parameters
    optim.step()
    print('epoch {}, loss {}'.format(epoch, loss.item()))

epoch 0, loss 131183058944.0
epoch 1, loss 117986451456.0
epoch 2, loss 106161905664.0
epoch 3, loss 95566716928.0
epoch 4, loss 86073114624.0
epoch 5, loss 77566492672.0
epoch 6, loss 69944254464.0
epoch 7, loss 63114424320.0
epoch 8, loss 56994611200.0
epoch 9, loss 51510988800.0

model(inputs)[:5]

tensor([[150347.1562],
        [124711.3125],
        [165403.1250],
        [115392.8125],
        [195881.3750]], grad_fn=<SliceBackward>)