Linear regression and methods for its recovery

Source: xkcd

Linear regression is one of the basic algorithms for many areas related to data analysis. The reason for this is obvious. This is a very simple and understandable algorithm, which has contributed to its widespread use for many tens, if not hundreds, of years. The idea is that we assume a linear dependence of one variable on a set of other variables, and then try to restore this dependence.

But in this article we will not talk about the use of linear regression for solving practical problems. Here we will consider interesting features of the implementation of distributed recovery algorithms that we encountered when writing a machine learning module in Apache Ignite . A little basic math, the basics of machine learning, and distributed computing will help you figure out how to restore linear regression, even if the data is distributed across thousands of nodes.

What are we talking about?

We are faced with the task of restoring linear dependence. As input, a set of vectors of supposedly independent variables is given, each of which is associated with a certain value of the dependent variable. These data can be represented in the form of two matrices:

\ begin {pmatrix} a_ {11} & a_ {12} & a_ {13} & ... & a_ {1n} \\ a_ {21} & a_ {22} & a_ {23} & ... & a_ {2n} \\ ... \\ a_ {m1} & a_ {m2} & a_ {m3} & ... & a_ {mn} \\ \ end {pmatrix} \ begin {pmatrix} b_ {1} \\ b_ {2} \\ ... \\ b_ {m} \\ \ end {pmatrix}

$\ begin {pmatrix} a_ {11} & a_ {12} & a_ {13} & ... & a_ {1n} \\ a_ {21} & a_ {22} & a_ {23} & ... & a_ {2n} \\ ... \\ a_ {m1} & a_ {m2} & a_ {m3} & ... & a_ {mn} \\ \ end {pmatrix} \ begin {pmatrix} b_ {1} \\ b_ {2} \\ ... \\ b_ {m} \\ \ end {pmatrix}$

Now, since the dependence is assumed, and besides it is linear, we write our assumption in the form of a product of matrices (to simplify the notation here and below, it is assumed that the free term of the equation is hidden behind

x_{n}

$x_n$ , and the last column of the matrix

A

$A$ contains units):

\ begin {pmatrix} a_ {11} & a_ {12} & a_ {13} & ... & a_ {1n} \\ a_ {21} & a_ {22} & a_ {23} & ... & a_ {2n} \\ ... \\ a_ {m1} & a_ {m2} & a_ {m3} & ... & a_ {mn} \\ \ end {pmatrix} \ times \ begin {pmatrix} x_ { 1} \\ x_ {2} \\ ... \\ x_ {n} \\ \ end {pmatrix} = \ begin {pmatrix} b_ {1} \\ b_ {2} \\ ... \\ b_ {m} \\ \ end {pmatrix}

$\ begin {pmatrix} a_ {11} & a_ {12} & a_ {13} & ... & a_ {1n} \\ a_ {21} & a_ {22} & a_ {23} & ... & a_ {2n} \\ ... \\ a_ {m1} & a_ {m2} & a_ {m3} & ... & a_ {mn} \\ \ end {pmatrix} \ times \ begin {pmatrix} x_ { 1} \\ x_ {2} \\ ... \\ x_ {n} \\ \ end {pmatrix} = \ begin {pmatrix} b_ {1} \\ b_ {2} \\ ... \\ b_ {m} \\ \ end {pmatrix}$

Very similar to a system of linear equations, right? It seems, but most likely there will be no solutions to such a system of equations. The reason for this is the noise that is present in almost any real data. Also, the reason may be the lack of a linear relationship as such, which you can try to fight by introducing additional variables that nonlinearly depend on the source. Consider the following example:

Source: Wikipedia

This is a simple linear regression example that shows the dependence of one variable (along the axis

y

$y$ ) from another variable (along the axis

x

$x$ ) For the system of linear equations corresponding to this example to have a solution, all points must lie exactly on one straight line. But this is not so. But they do not lie on one straight line precisely because of noise (or because the assumption of the presence of a linear dependence was erroneous). Thus, in order to restore a linear dependence from real data, it is usually necessary to introduce one more assumption: the input data contains noise and this noise has a normal distribution . One can make assumptions about other types of noise distribution, but in the vast majority of cases it is the normal distribution that will be discussed later.

Maximum likelihood method

So, we assumed random randomly distributed noise. How to be in such a situation? For this case, in mathematics there exists and is widely used the maximum likelihood method . In short, its essence lies in the choice of the likelihood function and its subsequent maximization.

We return to the restoration of linear dependence from data with normal noise. Note that the assumed linear relationship is a mathematical expectation

m u

$\ mu$ available normal distribution. At the same time, the probability that

m u

$\ mu$ assumes one or another value, subject to the presence of observables

x

$x$ , as follows:

P (m u m i d X, s i g m a^{2}) = p r o d_{x i n X} f r a c 1 s i g m a s q r t 2 p i e^{- f r a c (x - m u)^{2} 2 s i g m a^{2}}

$P (\ mu \ mid X, \ sigma ^ 2) = \ prod_ {x \ in X} {\ frac {1} {\ sigma \ sqrt {2 \ pi}} \; e ^ {- \ frac {(x- \ mu) ^ 2} {2 \ sigma ^ 2}}}$

We substitute now instead

m u

$\ mu$ and

X

$X$ the variables we need are:

P (x m i d A, B, s i g m a^{2}) = p r o d_{i i n [1, m]} f r a c 1 s i g m a s q r t 2 p i e^{- f r a c (b_{i} - a_{i} x)^{2} 2 s i g m a^{2}}, a_{i} i n A, b_{i} i n B

$P (x \ mid A, B, \ sigma ^ 2) = \ prod_ {i \ in [1, m]} {\ frac {1} {\ sigma \ sqrt {2 \ pi}} \; e ^ {- \ frac {(b_i-a_ix) ^ 2} {2 \ sigma ^ 2}}}, a_i \ in A, b_i \ in B$

It remains only to find the vector

x

$x$ at which this probability is maximum. To maximize such a function, it is convenient to prologarithm first (the logarithm of the function will reach its maximum at the same point as the function itself):

f (x) = l n (p r o d_{i i n [1, m]} f r a c 1 s i g m a s q r t 2 p i e^{- f r a c (b_{i} - a_{i} x)^{2} 2 s i g m a^{2}}) = s u m_{i i n [1, m]} l n f r a c 1 s i g m a s q r t 2 p i - s u m_{i i n [1, m]} f r a c (b_{i} - a_{i} x)^{2} 2 s i g m a^{2}

$f (x) = ln (\ prod_ {i \ in [1, m]} {\ frac {1} {\ sigma \ sqrt {2 \ pi}} \; e ^ {- \ frac {(b_i-a_ix ) ^ 2} {2 \ sigma ^ 2}}}) = \ sum_ {i \ in [1, m]} ln \ frac {1} {\ sigma \ sqrt {2 \ pi}} \; - \ sum_ {i \ in [1, m]} \ frac {(b_i-a_ix) ^ 2} {2 \ sigma ^ 2}$

Which, in turn, boils down to minimizing the following function:

f (x) = s u m_{i i n [1, m]} (b_{i} - a_{i} x)^{2}

$f (x) = \ sum_ {i \ in [1, m]} (b_i-a_ix) ^ 2$

By the way, this is called the least squares method. Often, all the above reasoning is omitted and this method is simply used.

QR decomposition

The minimum of the above function can be found if you find the point at which the gradient of this function is equal to zero. And the gradient will be written as follows:

n a b l a f (x) = s u m_{i i n [1, m]} 2 a_{i} (a_{i} x - b_{i}) = 0

$\ nabla f (x) = \ sum_ {i \ in [1, m]} 2a_i (a_ix - b_i) = 0$

QR decomposition is a matrix method for solving the minimization problem used in the least squares method. In this regard, we rewrite the equation in matrix form:

A^{T} A x = A^{T} b

$A ^ T A x = A ^ Tb$

So, we lay out the matrix

A

$A$ on matrices

Q

$Q$ and

R

$R$ and perform a series of transformations (the QR decomposition algorithm itself will not be considered here, only its use in relation to the task):

\ begin {align} & (QR) ^ T (QR) x = (QR) ^ Tb; \\ & R ^ T Q ^ T Q R x = R ^ T Q ^ T b; \\ \ end {align}

$\ begin {align} & (QR) ^ T (QR) x = (QR) ^ Tb; \\ & R ^ T Q ^ T Q R x = R ^ T Q ^ T b; \\ \ end {align}$

Matrix

Q

$Q$ is orthogonal. This allows us to get rid of the work.

Q Q^{T}

$Q Q ^ T$ :

\ begin {align} & R ^ T R x = R ^ T Q ^ T b; \\ & (R ^ T) ^ {- 1} R ^ T R x = (R ^ T) ^ {- 1} R ^ T Q ^ T b; \\ & R x = Q ^ T b. \ end {align}

$\ begin {align} & R ^ T R x = R ^ T Q ^ T b; \\ & (R ^ T) ^ {- 1} R ^ T R x = (R ^ T) ^ {- 1} R ^ T Q ^ T b; \\ & R x = Q ^ T b. \ end {align}$

And if you replace

Q^{T} b

$Q ^ T b$ on

z

$z$ it will turn out

R x = z

$R x = z$ . Given that

R

$R$ is the upper triangular matrix, it looks like this:

\ begin {pmatrix} r_ {11} & r_ {12} & r_ {13} & r_ {14} & ... & r_ {1n} \\ 0 & r_ {22} & r_ {23} & r_ { 24} & ... & r_ {2n} \\ 0 & 0 & r_ {33} & r_ {34} & ... & r_ {3n} \\ 0 & 0 & 0 & r_ {44} & .. . & r_ {4n} \\ ... \\ 0 & 0 & 0 & 0 & ... & r_ {nn} \\ \ end {pmatrix} \ times \ begin {pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ ... \\ x_n \ end {pmatrix} = \ begin {pmatrix} z_1 \\ z_2 \\ z_3 \\ z_4 \\ ... \\ z_n \ end {pmatrix}

$\ begin {pmatrix} r_ {11} & r_ {12} & r_ {13} & r_ {14} & ... & r_ {1n} \\ 0 & r_ {22} & r_ {23} & r_ { 24} & ... & r_ {2n} \\ 0 & 0 & r_ {33} & r_ {34} & ... & r_ {3n} \\ 0 & 0 & 0 & r_ {44} & .. . & r_ {4n} \\ ... \\ 0 & 0 & 0 & 0 & ... & r_ {nn} \\ \ end {pmatrix} \ times \ begin {pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ ... \\ x_n \ end {pmatrix} = \ begin {pmatrix} z_1 \\ z_2 \\ z_3 \\ z_4 \\ ... \\ z_n \ end {pmatrix}$

This can be solved by the substitution method. Element

x_{n}

$x_n$ is like

z_{n} / r_{n n}

$z_n / r_ {nn}$ previous item

x_{n - 1}

$x_ {n-1}$ is like

(z_{n - 1} - r_{n - 1, n} x_{n}) / r_{n - 1, n - 1}

$(z_ {n-1} - r_ {n-1, n} x_n) / r_ {n-1, n-1}$ and so on.

It is worth noting here that the complexity of the resulting algorithm through the use of QR decomposition is

O (2 m n^{2})

$O (2mn ^ 2)$ . Moreover, despite the fact that the matrix multiplication operation is well parallelized, it is not possible to write an effective distributed version of this algorithm.

Gradient descent

Speaking about minimizing a certain function, it is always worth recalling the method of (stochastic) gradient descent. This is a simple and effective minimization method based on iteratively calculating the gradient of the function at a point and then shifting it to the side opposite to the gradient. Each such step brings the solution to a minimum. The gradient looks the same:

n a b l a f (x) = s u m_{i i n [1, m]} 2 a_{i} (a_{i} x - b_{i})

$\ nabla f (x) = \ sum_ {i \ in [1, m]} 2a_i (a_ix - b_i)$

This method is also well parallelized and distributed due to the linear properties of the gradient operator. Note that in the above formula, independent terms are under the sum sign. In other words, we can calculate the gradient independently for all indices

i

$i$ from first to

k

$k$ , in parallel with this, calculate the gradient for indices with

k + 1

$k + 1$ before

m

$m$ . Then add the resulting gradients. The result of the addition will be the same as if we immediately calculated the gradient for the indices from the first to

m

$m$ . Thus, if the data is distributed between several parts of the data, the gradient can be calculated independently on each part, and then the results of these calculations can be summed to obtain the final result:

n a b l a f (x) = s u m_{i i n P_{1}} 2 a_{i} (a_{i} x - b_{i}) + s u m_{i i n P_{2}} 2 a_{i} (a_{i} x - b_{i}) + . . . + s u m_{i i n P_{k}} 2 a_{i} (a_{i} x - b_{i})

$\ nabla f (x) = \ sum_ {i \ in P_1} 2a_i (a_ix - b_i) + \ sum_ {i \ in P_2} 2a_i (a_ix - b_i) + ... + \ sum_ {i \ in P_k} 2a_i (a_ix - b_i)$

In terms of implementation, this fits into the MapReduce paradigm. At each step of the gradient descent, a task for calculating the gradient is sent to each data node, then the calculated gradients are collected together, and the result of their summation is used to improve the result.

Despite the simplicity of implementation and the ability to execute in the MapReduce paradigm, gradient descent also has its drawbacks. In particular, the number of steps required to achieve convergence is significantly larger in comparison with other more specialized methods.

LSQR

LSQR is another method of solving the problem, which is suitable both for reconstructing linear regression and for solving systems of linear equations. Its main feature is that it combines the advantages of matrix methods and an iterative approach. Implementations of this method can be found both in the SciPy library and in MATLAB . This method will not be described here (it can be found in the LSQR article : An algorithm for sparse linear equations and sparse least squares ). Instead, an approach will be demonstrated to adapt the LSQR to execution in a distributed environment.

The LSQR method is based on the bidiagonalization procedure . This is an iterative procedure, each iteration of which consists of the following steps:

But based on the fact that the matrix

A

$A$ horizontally partitioned, then each iteration can be represented as two steps MapReduce. Thus, it is possible to minimize the data transfers during each iteration (only vectors of length equal to the number of unknowns):

This approach is used when implementing linear regression in Apache Ignite ML .

Conclusion

There are many linear regression recovery algorithms, but not all of them can be applied in any conditions. So QR decomposition is great for accurate solution on small data arrays. Gradient descent is simply implemented and allows you to quickly find an approximate solution. And LSQR combines the best properties of the previous two algorithms, since it can be distributed, converges faster in comparison with gradient descent, and also allows an early stop of the algorithm, unlike QR decomposition, to find an approximate solution.

All Articles