Synopsis on “Machine Learning”. Mathematical statistics. Maximum likelihood method





Recall some definitions of mathematical statistics.



Let a probability space be given ( Omega, Sigma,P) .



Definition 1:



Random variable  xi= xi(w) taking values ​​in the set S c  sigma -algebra of subsets  Phi called any ( Sigma, Phi) measurable function  xi colon Omega toS , i.e  forallA subseteqS,A in Phi the condition is satisfied \ xi ^ {- 1} (A) = \ {\ omega \ in \ Omega \ space \ colon \ space \ xi (w) \ in A \} \ in \ Sigma .



Definition 2:



The sample space is the space of all possible values ​​of the observation or sample along with  sigma -algebra of measurable subsets of this space.

Designation: (B, mathscrB) .



Defined on probability space ( Omega, Sigma,P) random variables  xi, eta, ldots colon Omega toB spawn in space (B, mathscrB) probabilistic measures P_ \ xi \ {C \} = P \ {\ xi \ in C \}, P_ \ eta \ {C \} = P \ {\ eta \ in C \}, \ ldots On a sample space, not one probability measure is determined, but a finite or infinite family of probability measures.



In problems of mathematical statistics , a family of probability measures is known. \ {P_ \ theta, \ space \ theta \ in \ Theta \} defined in the sample space, and it is required to determine from the sample which of the probability measures of this family corresponds to the sample.



Definition 3:



A statistical model is an aggregate consisting of a sample space and a family of probability measures defined on it.



Designation: (B, mathscrB, mathscrP) where \ mathscr {P} = \ {P_ \ theta, \ space \ theta \ in \ Theta \} .



Let B= mathbbRn and ( mathbbRn, mathscrB) - selective space.



Sampling X=(x1, ldots,xn) can be considered as a combination n real numbers. We assign to each element of the sample a probability equal to  frac1n .



Let





Ix(B)= begincases1, quadx inB0, quadx not inB endcases







Definition 4:



An empirical distribution constructed from sample X is a probability measure Pn :





Pn(B)= frac1n sumk=1nIxk(B)







I.e Pn(B) - the ratio of the number of sample elements that belong B , to the total number of sample items: Pn(B)= frac nun(B)n, space nun(B)= sum limitsk=1nI(xk inB), spaceB in mathscrB .



Definition 5:



Selective moment order k called





 hatmk= hatmk(X)= frac1n sumj=1nxjk





 hatm1= overlineX= frac1n sum limitsj=1nxj - sample mean .



Definition 6:



Selective central moment of order k defined by equality





 hatmk(0)= hatmk(0)(X)= frac1n sumj=1n(xj overlineX)k





S2=S2(X)= hatm2(0)= frac1n sum limitsj=1n(xj overlineX)2 - sample variance .



In machine learning, many tasks are to learn how to select a parameter from the data available.  theta which best describes this data. In mathematical statistics, the maximum likelihood method is often used to solve a similar problem.



In real life, the error distribution often has a normal distribution. For some justification, we state the central limit theorem .



Theorem 1 (CLT):



If random variables  xi1, ldots, xin - independent, equally distributed, mathematical expectation M( xii)=a variance D( xii)= sigma2 in(0,+ infty) space foralli in overline1,n then





\ lim \ limits_ {n \ to \ infty} P \ {\ frac {\ xi_1 + \ xi_2 + \ ldots + \ xi_n - na} {\ sigma \ sqrt {n}} \ leq x \} = F (x) = \ frac {1} {\ sqrt {2 \ pi}} \ int \ limits _ {- \ infty} ^ xe ^ {- u ^ 2/2} du.







Below, we formulate the maximum likelihood method and consider its operation as an example of a family of normal distributions.



Maximum likelihood method



Let for a statistical model (B, \ mathscr {B}, \ mathscr {P} = \ {P_ \ theta, \ space \ theta \ in \ Theta \}) two conditions are satisfied:





Definition 7:



Maximum Likelihood Assessment (OMP)  hat theta parameter  theta called empirically constructed Pn corresponding to the sample X=(x1, ldots,xn) , value  theta in Theta at which  max limits theta in Theta int lnf theta(x)Pn(dx)= max limits theta in Theta frac1n sum limitsi=1n lnf theta(x).



Definition 8:



Function  Lambda theta(X)= prod limitsi=1nf theta(xi) as a function of  theta is called the likelihood function , and the function L(X, theta)= sum limitsi=1n lnf theta(xi) - logarithmic likelihood function .



These functions peak at the same values.  theta , as  lnx - monotonous increasing function.



Example:



\ mathscr {P} = \ {N (a, \ sigma ^ 2) \ space | \ space a \ in \ mathbb {R}, \ space \ sigma \ in (0, + \ infty) \} - family of normal distributions with densities \ phi_ {a, \ sigma ^ 2} (x) = \ frac {1} {\ sigma \ sqrt {2 \ pi}} \ exp \ {- \ frac {1} {2 \ sigma ^ 2} (xa ) ^ 2 \} . By sample X=(x1, ldots,xn)





\ Lambda_ {a, \ sigma} (X) = \ frac {1} {(2 \ pi) ^ {\ frac {n} {2}} \ sigma ^ n} \ exp \ {- \ frac {1} {2 \ sigma ^ 2} \ sum \ limits_ {i = 1} ^ n (x_j-a) ^ 2 \};









L(X,(a, sigma))= fracn2 ln2 pin ln sigma frac12 sigma2 sum limitsi=1n(xia)2;









 frac partialL partiala= frac1 sigma2 sum limitsi=1n(xia), quad frac partialL partial sigma= fracn sigma+ frac1 sigma3 sum limitsi=1n(xia)2;









 frac partialL partiala=0 quad Rightarrow quad sum limitsi=1nxina=0 quad Rightarrow quad frac1n sum limitsi=1nxi= overlineX= hata;









 frac partialL partial sigma=0 quad Rightarrow quad fracn sigma= frac1 sigma3 sum limitsi=1n(xia)2 quad Rightarrow quad hat sigma= sqrt frac1n sum limitsi=1n(xi overlineX)2= sqrtS2.





Estimates for mathematical expectation and variance were obtained.



If you look closely at the formula





L(X,(a, sigma))= fracn2 ln2 pin ln sigma frac12 sigma2 sum limitsi=1n(xia)2





we can conclude that the function L(X,(a, sigma)) assumes its maximum value when  sum limitsi=1n(xia)2 is minimal. In machine learning problems, the least squares method is often used, in which the sum of the squared deviations of the predicted values ​​from the true ones is minimized.



Bibliography:






All Articles