Synopsis on “Machine Learning”. Mathematical analysis. Gradient descent





Recall the mathematical analysis



Function Continuity and Derivative



Let E subseteq mathbbR , a Is the limit point of the set E (those. a inE, forall varepsilon>0 space space|(a varepsilon,a+ varepsilon) capE|= infty ), f colonE to mathbbR .



Definition 1 (Cauchy function limit):



Function f colonE to mathbbR committed to A at x seeking to a , if a





 forall varepsilon>0 space space exists delta>0 space space forallx inE space space(0<|xa|< delta Rightarrow|f(x)A|< varepsilon).







Designation:  lim limitsE nix toaf(x)=A .



Definition 2:



  1. Interval ab called set ] a, b [\ space: = \ {x \ in \ mathbb {R} | a <x <b \} ;
  2. Point Interval x in mathbbR is called the neighborhood of this point.
  3. A punctured neighborhood of a point is a neighborhood of a point from which this point itself is excluded.


Designation:



  1. V(x) or U(x) - neighborhood of a point x ;
  2.  overset circU(x) - punctured neighborhood of a point x ;
  3. UE(x):=E capU(x), overset circUE(x):=E cap overset circU(x)


Definition 3 (function limit through neighborhoods):









 lim limitsE nix toaf(x)=A:= forallVR(A) space exists overset circUE(a) space space(f( overset circUE(a)) subsetVR(A)).







Definitions 1 and 3 are equivalent.



Definition 4 (continuity of a function at a point):



  1. f colonE to mathbbR continuous in a inE:=





    = forallV(f(a)) space space existsUE(a) space space(f(UE(a)) subsetV(f(a)));





  2. f colonE to mathbbR continuous in a inE:=





     forall varepsilon>0 space space exists delta>0 space space forallx inE space space(|xa|< delta Rightarrow|f(x)f(a)|< varepsilon).







Definitions 3 and 4 show that

( f colonE to mathbbR continuous in a inE where a - limit point E )  Leftrightarrow

 Leftrightarrow( lim limitsE nix toaf(x)=f(a)).



Definition 5:



Function f colonE to mathbbR called continuous on the set E if it is continuous at each point of the set E .



Definition 6:



  1. Function f colonE to mathbbR defined on the set E subset mathbbR is called differentiable at the point a inE limiting for the set E if there is such a linear with respect to the increment xa argument function A cdot(xa) [function differential f at the point a ] that increment f(x)f(a) the functions f represented as





    f(x)f(a)=A cdot(xa)+o(xa) quadfor spacex toa, spacex inE.





  2. Value





    f(a)= lim limitsE nix toa fracf(x)f(a)xa







    called derivative function f at the point a .


Also





f(x)= lim substackh to0x+h,x inE fracf(x+h)f(x)h.









Definition 7:



  1. Dot x0 inE subset mathbbR is called the point of local maximum (minimum) , and the value of the function in it is called the local maximum (minimum) of the function f colonE to mathbbR , if a  existsUE(x0) :





     forallx inUE(x0) space spacef(x) leqf(x0)(respectively,f(x) geqf(x0)).





  2. The points of local maximum and minimum are called points of local extremum , and the values ​​of the function in them are called local extrema of the function .
  3. Dot x0 inE extremum function f colonE to mathbbR called an internal extremum point if x0 is the limit point as for the set E _- = \ {x \ in E | x <x_0 \} , and for the set E _ + = \ {x \ in E | x> x_0 \} .


Lemma 1 (Fermat):



If the function f colonE to mathbbR differentiable at the point of internal extremum x0 inE , then its derivative at this point is zero: f(x0)=0 .



Proposition 1 (Roll's theorem):

If the function f colon[a,b] to mathbbR continuous on a segment [a,b] differentiable in the interval ]a,b[ and f(a)=f(b) then there is a point  xi in]a,b[ such that f( xi)=0 .



Theorem 1 (Lagrange finite increment theorem):



If the function f colon[a,b] to mathbbR continuous on a segment [a,b] and differentiable in the interval ]a,b[ then there is a point  xi in]a,b[ such that





f(b)f(a)=f( xi)(ba).







Corollary 1 (a sign of monotonicity of a function):

If at any point of a certain interval the derivative of the function is non-negative (positive), then the function does not decrease (increases) on this interval.



Corollary 2 (criterion for the constancy of a function):

Continuous on a cut [a,b] a function is not constant if and only if its derivative is zero at any point in the interval [a,b] (or at least the interval ]a,b[ )



Partial derivative of a function of many variables



Across  mathbbRm denote the set:







\ mathbb {R} ^ m = \ underbrace {\ mathbb {R} \ times \ mathbb {R} \ times \ cdots \ times \ mathbb {R}} _ m = \ {(\ omega_1, \ omega_2, ... , \ omega_m), \ space \ omega_i \ in \ mathbb {R} \ space \ forall i \ in \ overline {1, m} \}.









Definition 8:



Function f colonE to mathbbR defined on the set E subset mathbbRm is called differentiable at the point x inE limiting for the set E , if a





f(x+h)f(x)=L(x)h+ alpha(x;h), qquad(1)





Where L(x) colon mathbbRm to mathbbR - linear with respect to h function [function differential f at the point x (reference df(x) or f(x) )], but  alpha(x;h)=o(h) at h to0,x+h inE .



Relation (1) can be rewritten as follows:





f(x+h)f(x)=f(x)h+ alpha(x;h)





or





 bigtriangleupf(x;h)=df(x)h+ alpha(x;h).







If we go to the coordinate record of the point x=(x1,...,xm) , vector h=(h1,...,hm) and linear function L(x)h=a1(x)h1+...+am(x)hm , then equality (1) looks like this





f(x1+h1,...,xm+hm)f(x1,...,xm)==a1(x)h1+...+am(x)hm+o(h) quadfor space spaceh to0, qquad(2)





Where a1(x),...,am(x) - associated with point x real numbers. You need to find these numbers.



We denote





hi=hiei=0 cdote1+...+0 cdotei1+hi cdotei+0 cdotei+1+...+0 cdotem,





Where \ {e_1, ..., e_m \} - basis in  mathbbRm .



At h=hi from (2) we obtain





f(x1,...,xi1,xi+hi,xi+1,...,xm)f(x1,...,xi,...,xm)==ai(x)hi+o(hi) quadfor space spacehi to0. qquad(3)









From (3) we obtain





ai(x)= limhi to0 fracf(x1,...,xi1,xi+hi,xi+1,..,xm)f(x1,...,xi,...,xm)hi. qquad(4)







Definition 9:

The limit (4) is called the partial derivative of the function f(x) at the point x=(x1,...,xm) by variable xi . It is designated:





 frac partialf partialxi(x), quad partialif(x), quadfxi(x).









Example 1:





f(u,v)=u3+v2 sinu, partial1f(u,v)= frac partialf partialu(u,v)=3u2+v2 cosu, partial2f(u,v)= frac partialf partialv(u,v)=2v sinu.













Gradient descent



Let f colon mathbbRn to mathbbR where \ mathbb {R} ^ n = \ underbrace {\ mathbb {R} \ times \ mathbb {R} \ times \ cdots \ times \ mathbb {R}} _ n = \ {(\ theta_1, \ theta_2, ... , \ theta_n), \ space \ theta_i \ in \ mathbb {R} \ space \ forall i \ in \ overline {1, n} \} .



Definition 10:



Gradient Function f colon mathbbRn to mathbbR called a vector, i whose element is equal to  frac partialf partial thetai :





 bigtriangledown thetaf= left( beginarrayc frac partialf partial theta1 frac partialf partial theta2 vdots frac partialf partial thetan endarray right), quad theta=( theta1, theta2,..., thetan).







Gradient is the direction in which the function increases most rapidly. This means that the direction in which it decreases most rapidly is the direction opposite to the gradient, i.e.  bigtriangledown thetaf .



The aim of the gradient descent method is to find the extremum (minimum) point of the function.



Denote by  theta(t) function parameter vector in step t . Parameter update vector in step t :





u(t)= eta bigtriangledown thetaf( theta(t1)), quad theta(t)= theta(t1)+u(t).







In the formula above, the parameter  eta Is the learning speed that controls the size of the step that we take in the direction of the gradient slope. In particular, two opposing problems may arise:











Example:

Consider the example of the gradient descent method in the simplest case ( n=1 ) I.e f colon mathbbR to mathbbR .

Let f(x)=x2, quad theta(0)=3, quad eta=1 . Then:





 frac partialf partialx(x)=2x quad Rightarrow quad bigtriangledownf theta(x)=2x; theta(1)= theta(0)1 cdotf theta( theta(0))=36=3; theta(2)= theta(1)1 cdotf theta( theta(1))=3+6=3= theta(0).





In the case when  eta=1 , the situation turns out, as in the third image of the picture above. We constantly jump over the extreme point.

Let  eta=0.8 . Then:





 theta(1)= theta(0)0.8 timesf theta( theta(0))=30.8 times6=34.8=1.8; theta(2)= theta(1)0.8 timesf theta( theta(1))=1.8+0.8 times3.6=1.8+2.88=1.08; theta(3)= theta(2)0.8 timesf theta( theta(2))=1.080.8 times2.16=1.081.728=0.648; theta(4)= theta(3)0.8 timesf theta( theta(3))=0.648+0.8 times1.296=0.648+1.0368=0.3888; theta(5)= theta(4)0.8 timesf theta( theta(4))=0.38880.8 times0.7776=0.3888.62208=0.23328; theta(6)= theta(5)0.8 timesf theta( theta(5))=0.23328+0.8 times0.46656=0.23328+0.373248==0.139968.





It is seen that iteratively we are approaching the point of extremum.

Let  eta=0.5 . Then:





 theta(1)= theta(0)0.5 timesf theta( theta(0))=30.5 times6=33=0; theta(2)= theta(1)0.5 timesf theta( theta(1))=00.5 times0=0.





The extremum point was found in 1 step.



Bibliography:








All Articles