楼主: hanszhu
7844 29

[下载]50 Lecture Notes [推广有奖]

11
罗马朗 发表于 2005-9-1 00:05:00
太贵了,呵呵,买不起

12
zhouye 发表于 2005-9-1 11:49:00
可惜没钱
我有一座房子
面朝大海,春暖花开

13
math2008 发表于 2005-9-5 08:01:00

都是数学书啊

14
hanszhu 发表于 2005-9-5 23:34:00

[下载]

LECTURE NOTESNovember 20, 2002


ANNOUNCEMENT

This class will have a final exam, on Wednesday December 11 from 2pm to 4pm in APM 4882. Exam questions will be similar to assignment questions, but easier.


MIXTURE MODELS

Supose we have data from a combination of two Gaussians. The probability of generating x is
g(x) = piN(x,theta_1) + (1 - pi)N(x,theta_2)
The model has five parameters, pi and the two means and variances.

Suppose we want to do MLE. The log likelihood is

l(theta|X) = SUM log [ piN(x,theta_1) + (1 - pi)N(x,theta_2) ]
Maximizing this is difficult because of the + inside the logarithm.

Also, the actual maximum gives parameter estimates that are unwanted: set one mean equal to one observation, with zero variance. Then the likelihood of this point is infinite, so the total likelihood is infinite also.

EM is a general iterative procedure for finding a local optimum for this type of hard MLE problem.


LEARNING WITH CENSORED DATA

Often, we want to learn from data with parts of observations missing (censored data). These situations give rise to "chicken and egg" problems.

Example: waiting for the AP&M elevator, where the waiting time has an exponential distribution, which has the Markov property of history-independence: Pr(T > t+r | T>t) = Pr(T > r)

What is the expected time to wait, called mu?

Data: wait 7 min, success
wait 12 min, give up (censored data)
wait 8 min, success
wait 5 min, give up

Guess mu = 8. Fill in missing data, giving 7, 20, 8, 13. Now the new estimate of mu is the mean which is 48/4 = 12. Repeat.

E step: compute expected value, or a probability distribution for the missing info.
M step: compute MLE of model parameters given expected values for the missing data.

Theorem: Under certain conditions, this process converges to a model with locally maximal likelihood L(data | theta).


THE EM FRAMEWORK

The EM algorithm is useful in two situations. One situation is where some training data is genuinely unobserved, i.e. censored. The bus waiting time scenario is an example of this situation.

The other situation is where maximizing the original incomplete likelihood L(theta | X) is too difficult. Consider the general mixture modeling scenario where we have components numbered i = 1 to i = M.

Suppose we have observed data X generated by a pdf with parameter theta. To estimate theta, we want to maximize the log-likelihood l(theta; X) = log p(X; theta).

Suppose there is additional data called Z also generated with parameter theta. Names for the Z data include latent, missing, and hidden.

Let the "complete" data be T = (X,Z) with log-likelihood l_0(theta; X, Z).

In the Gaussian mixture case, z_i can be a 0/1 variable that reveals whether x_i was generated by theta_1 or theta_2.


DERIVING THE EM ALGORITHM

By the definition of conditional probability, p(Z | X, theta') = p(Z, X | theta') / p(X | theta').

So p(X | theta') = P(Z, X | theta') / p(Z | X, theta').

Changing to log-likelihoods, l(theta'; X) = l_0(theta'; Z, X) - l_1(theta'; Z | fixed X)

Now we take the expectation of each side over Z, where the distribution of Z is a function of a given X and theta (different from theta'). On the left we have

E[ l(theta'; X) ] where Z ~ f(X,theta) = E [ log p(X; theta') ] = log p(X; theta')
On the right we have
E[ l_0(theta'; Z, X) ] - E[ l_1(theta'; Z | fixed X) ] where Z ~ f(X,theta)
Call this expression Q(theta', theta) - R(theta', theta).

We have a lemma that says we can maximize this expression just by mazimizing Q(theta', theta).

Lemma: If Q(theta', theta) > Q(theta,theta) then Q(theta', theta) - R(theta', theta) > Q(theta, theta) - R(theta, theta).

In words, this lemma says that if the expected complete log-likelihood is increased, then the incomplete log-likelihood is also increased.


MAXIMIZING Q

The E step is to evaluate Q(theta', thetaj) symbolically as a function of theta', for a given thetaj.

Based on the E step, the M step is to find the value for theta' that maximizes Q(theta', thetaj).

What is Q(theta', thetaj)? It is E [ l_0(theta'; T) ] where the expectation averages over alternative values for the missing data Z, whose distribution depends on the known data X and the parameter thetaj.

Often, a major simplification is possible in the E step. The simplification is to bring the expectation inside the l_0 function, so it applies just to Z:

E [ l_0(theta'; X, Z) ] = l_0(theta'; X, E[Z])
where the distribution of Z is a function of thetaj and also of X.


SIMPLIFYING THE EM STEPS

In the E step we evaluate Q(theta, theta') = E[ log p(X,Z | theta) | X, thetaj ] where the expectation is an integral over Z.

Often we can simplify this by finding a single special z value such that the integral over Z is the same as evaluating the integrand for this special z value.

The obvious choice for the special z value is the expectation of Z. We want to use z = E[ Z | X, thetaj ] as an imputed value for Z, instead of averaging over all possible values of Z. In this case Q(theta, thetaj) = E[ log p(X, E[ Z | X, thetaj ] | theta)]

In general, the expectation of a function is the function of the expectation if and only if the function is linear. Fortunately, many log-likelihood functions are linear.

A further simplification: If a missing variable Yi is binary-valued then then p(Yi = 1) = E[ Yi ].

In M step we have a function of theta to maximize. Usually we do this by computing the derivative and solving for when the derivative equals zero. Sometimes the derivative expression is a sum of terms where each term involves only a part of the theta parameters. Then we can solve separately for when each term equals zero.


GAUSSIAN MIXTURE EXAMPLE

l_0(theta'; T) = log p(X, Z; theta') = SUM_{i s.t. Zi = 0} f(xi, theta'1) + SUM_{i s.t. Zi = 1} f(xi, theta'2)
= SUM_i (1 - Zi) f(xi, theta'1) + Zi f(xi, theta'2)

Now we want to take the average value of this, for Z following a distribution that depends on X and thetaj. Note that f(xi,theta'1) and f(xi, theta'2) do not depend on Z. So we take them outside the expectation and get

SUM_i f(xi, theta'1) (1 - E[Zi]) + f(xi, theta'2) E[Zi]
For fixed X and thetaj, we can compute E[Zi] using Bayes' rule:
E[Zi] = P(Zi = 1|X, thetaj) = P(Zi = 1|Xi, thetaj)
= P(Zi = 1 and X | thetaj) / ( P(Zi = 1 and X | thetaj) + P(Zi = 0 and X | thetaj)

[此贴子已经被作者于2006-4-27 14:06:50编辑过]

15
hanszhu 发表于 2005-9-5 23:39:00

The Matrix Reference Manual


Copyright © 1998-2005 Mike Brookes, Imperial College, London, UK Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

To cite this manual use: Brookes, M., "The Matrix Reference Manual", [online] http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html, 2005

The Matrix Reference Manual

16
hanszhu 发表于 2005-9-9 11:11:00

[下载]

MIT OpenCourseWare

» Political Science

» Quantitative Research Methods: Multivariate, Spring 2004

[此贴子已经被作者于2006-4-27 14:28:26编辑过]

17
Trevor 发表于 2005-9-19 07:51:00
Good Job!

18
ReneeBK 发表于 2005-9-27 08:14:00
The More Strong in Math, the better in the study of economics and econometrics!

19
zhouzuyu 发表于 2005-9-27 08:28:00

楼主,太贵了,买不起,不能降点价吗?

如果每个人都买不起,您还不是得不到钱,得不到钱也就没有钱去买更多的坛子上的好东东啊!

I will remember to love, you taught me how!

20
libingfu 发表于 2005-9-27 12:06:00
以下是引用zhouzuyu在2005-9-27 8:28:48的发言:

楼主,太贵了,买不起,不能降点价吗?

如果每个人都买不起,您还不是得不到钱,得不到钱也就没有钱去买更多的坛子上的好东东啊!

确实是这样啊!

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2025-12-30 04:30