LECTURE NOTESNovember 20, 2002
ANNOUNCEMENT
This class will have a final exam, on Wednesday December 11 from 2pm to 4pm in APM 4882. Exam questions will be similar to assignment questions, but easier.
MIXTURE MODELS
Supose we have data from a combination of two Gaussians. The probability of generating x is
g(x) = piN(x,theta_1) + (1 - pi)N(x,theta_2)
The model has five parameters, pi and the two means and variances.
Suppose we want to do MLE. The log likelihood is
l(theta|X) = SUM log [ piN(x,theta_1) + (1 - pi)N(x,theta_2) ]
Maximizing this is difficult because of the + inside the logarithm.
Also, the actual maximum gives parameter estimates that are unwanted: set one mean equal to one observation, with zero variance. Then the likelihood of this point is infinite, so the total likelihood is infinite also.
EM is a general iterative procedure for finding a local optimum for this type of hard MLE problem.
LEARNING WITH CENSORED DATA
Often, we want to learn from data with parts of observations missing (censored data). These situations give rise to "chicken and egg" problems.
Example: waiting for the AP&M elevator, where the waiting time has an exponential distribution, which has the Markov property of history-independence: Pr(T > t+r | T>t) = Pr(T > r)
What is the expected time to wait, called mu?
Data: wait 7 min, success
wait 12 min, give up (censored data)
wait 8 min, success
wait 5 min, give up
Guess mu = 8. Fill in missing data, giving 7, 20, 8, 13. Now the new estimate of mu is the mean which is 48/4 = 12. Repeat.
E step: compute expected value, or a probability distribution for the missing info.
M step: compute MLE of model parameters given expected values for the missing data.
Theorem: Under certain conditions, this process converges to a model with locally maximal likelihood L(data | theta).
THE EM FRAMEWORK
The EM algorithm is useful in two situations. One situation is where some training data is genuinely unobserved, i.e. censored. The bus waiting time scenario is an example of this situation.
The other situation is where maximizing the original incomplete likelihood L(theta | X) is too difficult. Consider the general mixture modeling scenario where we have components numbered i = 1 to i = M.
Suppose we have observed data X generated by a pdf with parameter theta. To estimate theta, we want to maximize the log-likelihood l(theta; X) = log p(X; theta).
Suppose there is additional data called Z also generated with parameter theta. Names for the Z data include latent, missing, and hidden.
Let the "complete" data be T = (X,Z) with log-likelihood l_0(theta; X, Z).
In the Gaussian mixture case, z_i can be a 0/1 variable that reveals whether x_i was generated by theta_1 or theta_2.
DERIVING THE EM ALGORITHM
By the definition of conditional probability, p(Z | X, theta') = p(Z, X | theta') / p(X | theta').
So p(X | theta') = P(Z, X | theta') / p(Z | X, theta').
Changing to log-likelihoods, l(theta'; X) = l_0(theta'; Z, X) - l_1(theta'; Z | fixed X)
Now we take the expectation of each side over Z, where the distribution of Z is a function of a given X and theta (different from theta'). On the left we have
E[ l(theta'; X) ] where Z ~ f(X,theta) = E [ log p(X; theta') ] = log p(X; theta')
On the right we have
E[ l_0(theta'; Z, X) ] - E[ l_1(theta'; Z | fixed X) ] where Z ~ f(X,theta)
Call this expression Q(theta', theta) - R(theta', theta).
We have a lemma that says we can maximize this expression just by mazimizing Q(theta', theta).
Lemma: If Q(theta', theta) > Q(theta,theta) then Q(theta', theta) - R(theta', theta) > Q(theta, theta) - R(theta, theta).
In words, this lemma says that if the expected complete log-likelihood is increased, then the incomplete log-likelihood is also increased.
MAXIMIZING Q
The E step is to evaluate Q(theta', thetaj) symbolically as a function of theta', for a given thetaj.
Based on the E step, the M step is to find the value for theta' that maximizes Q(theta', thetaj).
What is Q(theta', thetaj)? It is E [ l_0(theta'; T) ] where the expectation averages over alternative values for the missing data Z, whose distribution depends on the known data X and the parameter thetaj.
Often, a major simplification is possible in the E step. The simplification is to bring the expectation inside the l_0 function, so it applies just to Z:
E [ l_0(theta'; X, Z) ] = l_0(theta'; X, E[Z])
where the distribution of Z is a function of thetaj and also of X.
SIMPLIFYING THE EM STEPS
In the E step we evaluate Q(theta, theta') = E[ log p(X,Z | theta) | X, thetaj ] where the expectation is an integral over Z.
Often we can simplify this by finding a single special z value such that the integral over Z is the same as evaluating the integrand for this special z value.
The obvious choice for the special z value is the expectation of Z. We want to use z = E[ Z | X, thetaj ] as an imputed value for Z, instead of averaging over all possible values of Z. In this case Q(theta, thetaj) = E[ log p(X, E[ Z | X, thetaj ] | theta)]
In general, the expectation of a function is the function of the expectation if and only if the function is linear. Fortunately, many log-likelihood functions are linear.
A further simplification: If a missing variable Yi is binary-valued then then p(Yi = 1) = E[ Yi ].
In M step we have a function of theta to maximize. Usually we do this by computing the derivative and solving for when the derivative equals zero. Sometimes the derivative expression is a sum of terms where each term involves only a part of the theta parameters. Then we can solve separately for when each term equals zero.
GAUSSIAN MIXTURE EXAMPLE
l_0(theta'; T) = log p(X, Z; theta') = SUM_{i s.t. Zi = 0} f(xi, theta'1) + SUM_{i s.t. Zi = 1} f(xi, theta'2)
= SUM_i (1 - Zi) f(xi, theta'1) + Zi f(xi, theta'2)
Now we want to take the average value of this, for Z following a distribution that depends on X and thetaj. Note that f(xi,theta'1) and f(xi, theta'2) do not depend on Z. So we take them outside the expectation and get
SUM_i f(xi, theta'1) (1 - E[Zi]) + f(xi, theta'2) E[Zi]
For fixed X and thetaj, we can compute E[Zi] using Bayes' rule:
E[Zi] = P(Zi = 1|X, thetaj) = P(Zi = 1|Xi, thetaj)
= P(Zi = 1 and X | thetaj) / ( P(Zi = 1 and X | thetaj) + P(Zi = 0 and X | thetaj)
[此贴子已经被作者于2006-4-27 14:06:50编辑过]