楼主: 蓝色
57262 35

[一般统计问题] How can I compute the Chow test statistic   [推广有奖]

贵宾

泰斗

33%

还不是VIP/贵宾

-

TA的文库  其他...

统计软件和图书资源

Stata FAQ and Econometrics

威望
13
论坛币
1096670 个
通用积分
77199.4022
学术水平
3447 点
热心指数
3906 点
信用等级
2743 点
经验
469673 点
帖子
11687
精华
5
在线时间
20148 小时
注册时间
2004-7-15
最后登录
2024-4-24

初级热心勋章 初级信用勋章 初级学术勋章 中级学术勋章 中级热心勋章 中级信用勋章 高级热心勋章 高级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

http://www.stata.com/support/faqs/stat/chow.html

How can I compute the Chow test statistic?

Title Computing the Chow statistic
AuthorWilliam Gould, StataCorp
DateJanuary 1999; minor revisions July 2005

You can include the dummy variables in a regression of the full model and then use the test command on those dummies. You could also run each of the models and then write down the appropriate numbers and calculate the statistic by hand—you also have access to functions to get appropriate p-values.
Here is a longer answer:

Let’s start with the Chow test to which many refer. Consider the model

 y = a + b*x1 + c*x2 + u 
and say that we have two groups of data. We could estimate that model on the two groups separately:
 y = a1 + b1*x1 + c1*x2 + u for group == 1 y = a2 + b2*x1 + c2*x2 + u for group == 2 
and we could estimate a single, pooled regression
 y = a + b*x1 + c*x2 + u for both groups 
In the last regression, we are asserting that a1==a2, b1==b2, and c1==c2. The formula for the “Chow test” of this constraint is
 ess_c - (ess_1+ess_2) --------------------- k --------------------------------- ess_1 + ess_2 --------------- N_1 + N_2 - 2*k 
and this is the formula to which people refer. ess_1 and ess_2 are the error sum of squares from the separate regressions, ess_c is the error sum of squares from the pooled (constrained) regression, k is the number or estimated parameters (k=3 in our case), and N_1 and N_2 are the number of observations in the two groups.

The resulting test statistic is distributed F(k, N_1+N_2-2*k).

Let’s try this. I have created small datasets:

 clear set obs 100 set seed 1234 generate x1 = uniform() generate x2 = uniform() generate y = 4*x1 - 2*x2 + 2*invnormal(uniform()) generate group = 1 save one, replace clear set obs 80 generate x1 = uniform() generate x2 = uniform() generate y = -2*x1 + 3*x2 + 8*invnormal(uniform()) generate group = 2 save two, replace use one, clear append using two save combined, replace 
The models are different in the two groups, the residual variances are different, and so are the number of observations. With this dataset, I can carry forth the Chow test. First, I run the separate regressions:
 . regress y x1 x2 if group==1 Source | SS df MS Number of obs = 100 ---------+------------------------------ F( 2, 97) = 36.10 Model | 328.686307 2 164.343154 Prob > F = 0.0000 Residual | 441.589627 97 4.55247038 R-squared = 0.4267 ---------+------------------------------ Adj R-squared = 0.4149 Total | 770.275934 99 7.78056499 Root MSE = 2.1337 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | 5.121087 .728493 7.03 0.000 3.67523 6.566944 x2 | -3.227026 .7388209 -4.37 0.000 -4.693381 -1.760671 _cons | -.1725655 .5698273 -0.30 0.763 -1.303515 .9583839 ------------------------------------------------------------------------------ . regress y x1 x2 if group==2 Source | SS df MS Number of obs = 80 ---------+------------------------------ F( 2, 77) = 5.02 Model | 544.11726 2 272.05863 Prob > F = 0.0089 Residual | 4169.24211 77 54.1460014 R-squared = 0.1154 ---------+------------------------------ Adj R-squared = 0.0925 Total | 4713.35937 79 59.6627768 Root MSE = 7.3584 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | -1.21464 2.9578 -0.41 0.682 -7.104372 4.675092 x2 | 8.49714 2.688249 3.16 0.002 3.144152 13.85013 _cons | -2.2591 1.91076 -1.18 0.241 -6.06391 1.545709 ------------------------------------------------------------------------------ 
and then I run the combined regression:
 . regress y x1 x2  Source | SS df MS Number of obs = 180 ---------+------------------------------ F( 2, 177) = 2.93 Model | 176.150454 2 88.0752272 Prob > F = 0.0559 Residual | 5316.21341 177 30.035104 R-squared = 0.0321 ---------+------------------------------ Adj R-squared = 0.0211 Total | 5492.36386 179 30.683597 Root MSE = 5.4804 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | 2.692373 1.41842 1.90 0.059 -.1068176 5.491563 x2 | 2.061004 1.370448 1.50 0.134 -.6435156 4.765524 _cons | -1.380331 1.017322 -1.36 0.177 -3.387973 .62731 ------------------------------------------------------------------------------ 
For the Chow test,
 ess_c - (ess_1+ess_2) --------------------- k --------------------------------- ess_1 + ess_2 --------------- N_1 + N_2 - 2*k 
here are the relevant numbers copied from the output above:
 ess_c = 5316.21341 (from combined regression) ess_1 = 441.589627 (from group==1 regression) ess_2 = 4169.24211 (from group==2 regression) k = 3 (we estimate 3 parameters) N_1 = 100 (from group==1 regression) N_2 = 80 (from group==2 regression) 
So, plugging in, we get
 5316.21341 - (441.589628+4169.24211) 705.38167 ------------------------------------ --------- 3 3 ----------------------------------------- = --------------- 441.589628 + 4169.24211 4610.8317 ----------------------- --------- 100+80-2*3 174 235.12722 = ---------- 26.499033 = 8.8730491 
The Chow test is F(k,N_1+N_2-2*k) = F(3,174), so our test statistic is F(3,174) = 8.8730491.

Now, I will do the same problem by running one regression and using test to test certain coefficients equal to zero. What I want to do is estimate the model

 y = a3 + b3*x1 + c3*x2 + a3'*g2 + b3'*g2*x1 + c3'*g2*x2 + u 
where g2=1 if group==2 and g2=0 otherwise. I can do this by typing
 . generate g2 = (group==2) . generate g2x1 = g2*x1 . generate g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2 
Think about the predictions from this model. The model says
 y = a3 + b3*x1 + c3*x2 + u when g2==0 y = (a3+a3') + (b3+b3')*x1 + (c3+c3')*x2 + u when g2==1 
Thus the model is equivalent to estimating the separate models
 y = a1 + b1*x1 + c1*x2 + u for group == 1 y = a2 + b2*x1 + c2*x2 + u for group == 2 
the relationship being
 a1 = a3 a2 = a3 + a3' b1 = b3 b2 = b3 + b3' c1 = c3 c2 = c3 + c3' 
Some of you may be concerned that in the pooled model (the one estimating a3, b3, etc.), we are constraining the var(u) to be the same for each group, whereas, in the separate-equation model, we estimate different variances for group 1 and group 2. This does not matter, because the model is fully interacted. That is probably not convincing, but what should be convincing is that I am about to obtain the same F(3,174) = 8.87 answer and, in my concocted data, I have different variances in each group.

So, here is the result of the alternative test coeffiecients against 0 in a pooled specification:

 . generate g2 = (group==2) . generate g2x1 = g2*x1 . generate g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2 Source | SS df MS Number of obs = 180 ---------+------------------------------ F( 5, 174) = 6.65 Model | 881.532123 5 176.306425 Prob > F = 0.0000 Residual | 4610.83174 174 26.499033 R-squared = 0.1605 ---------+------------------------------ Adj R-squared = 0.1364 Total | 5492.36386 179 30.683597 Root MSE = 5.1477 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | 5.121087 1.757587 2.91 0.004 1.652152 8.590021 x2 | -3.227026 1.782504 -1.81 0.072 -6.745139 .2910877 g2 | -2.086535 1.917507 -1.09 0.278 -5.871102 1.698032 g2x1 | -6.335727 2.714897 -2.33 0.021 -11.6941 -.9773583 g2x2 | 11.72417 2.59115 4.52 0.000 6.610035 16.8383 _cons | -.1725655 1.374785 -0.13 0.900 -2.885966 2.540835 ------------------------------------------------------------------------------ . test g2 g2x1 g2x2 ( 1) g2 = 0 ( 2) g2x1 = 0 ( 3) g2x2 = 0 F( 3, 174) = 8.87 Prob > F = 0.0000 
Same answer.

This definition of the “Chow test” is equivalent to pooling the data, estimating the fully interacted model, and then testing the group 2 coefficients against 0.

That is why I said, “Chow Test is a term I have heard used by economists in the context of testing a set of regression coefficients being equal to 0.”

Admittedly, that leaves a lot unsaid.

The issue of the variance of u being equal in the two groups is subtle, but I do not want that to get in the way of understanding that the Chow test is equivalent to the “pool the data, interact, and test” procedure. They are equivalent.

Concerning variances, the Chow test itself is testing against a pooled, uninteracted model and so has buried in it an assumption of equal variances. It is really a test that the coefficients are equal and variance(u) in the groups are equal. It is, however, a weak test of the equality of variances because that assumption manifests itself only in how the pooled coefficient estimates are manufactured. Since the Chow test and the “pool the data, interact, and test” procedure are the same, the same is true of both procedures.

Your second concern might be that in the “pool the data, interact, and test” procedure there is an extra assumption of equality of variances because everything comes from the pooled model. As shown, that is not true. It is not true because the model is fully interacted and so the assumption of equal variances never makes a difference in the calculation of the coefficients.

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:statistic CHOW TEST compute Statist Comput test How statistic Chow compute

已有 3 人评分经验 论坛币 学术水平 热心指数 信用等级 收起 理由
日新少年 + 1 + 1 + 1 精彩帖子
oldbridge + 1 + 1 + 1 精彩帖子
Sunknownay + 100 + 10 + 1 + 1 + 1 精彩帖子

总评分: 经验 + 100  论坛币 + 10  学术水平 + 3  热心指数 + 3  信用等级 + 3   查看全部评分

本帖被以下文库推荐

沙发
蓝色 发表于 2008-1-2 18:05:00 |只看作者 |坛友微信交流群

http://www.stata.com/support/faqs/stat/chow2.html

How can I do a Chow test with the robust variance estimates, that is, after estimating with regress, vce(robust)?

Title Chow and Wald tests
AuthorWilliam Gould, StataCorp
DateJuly 1999; minor revision August 2007


First, see the FAQ How can I compute a Chow test statistic?. The point of that FAQ is that you can do Chow tests using Stata’s test command and, in fact, Chow tests are what the test command reports.

Well, that’s not exactly right. test uses the estimated variance–covariance matrix of the estimators, and test performs Wald tests,

 W = (Rb-r)'(RVR')-1 (Rb-r) 

where V is the estimated variance–covariance matrix of the estimators.

For linear regression with the conventionally estimated V, the Wald test is the Chow test and vice versa.

You might say that you are performing a Chow test, but I say that you are performing a Wald test. That distinction is important, because the Wald test generalizes to different variance estimates of V, whereas the Chow test does not. After regress, vce(robust), for instance, test uses the V matrix estimated by the robust method because that is what regress, vce(robust) left behind.

Thus the short answer is that you estimate your model using regress, vce(robust) and then use Stata’s test command. You then call the result a Wald test.

If you are bothered that a Wald test produces F rather than chi-squared statistics, also see the FAQ Why does test sometimes produce chi-squared and other times F statistics?

使用道具

藤椅
蓝色 发表于 2008-1-2 18:06:00 |只看作者 |坛友微信交流群

http://www.stata.com/support/faqs/stat/chow3.html

Can you explain Chow tests?

Title Chow tests
AuthorWilliam Gould, StataCorp
DateJanuary 2002; updated August 2005

Privately I was asked yet another question on Chow tests. The question started out “Is a Chow test the correct test to determine whether data can be pooled together?” and went on from there. Like many on the list, I am tired of seeing and answering questions on Chow tests. I do not blame the questioner for asking; I blame their teachers for confusing them with what is, these days, unnecessary jargon.

In the past, I have always given in and cast my answer in Chow-test terms. In this reply, I try a different approach and, I think, the result is more useful.

This reply concerns linear regression (though the technique is really more general than that), and I gloss over the detail of pooling the residuals and whether the residual variances are really the same. For the last, I think I can be forgiven.

Here is what I wrote:

Is a Chow test the correct test to determine whether data can be pooled together?
A Chow test is simply a test of whether the coefficients estimated over one group of the data are equal to the coefficients estimated over another, and you would be better off to forget the word Chow and remember that definition.

History:   In the days when statistical packages were not as sophisticated as they are now, testing whether coefficients were equal was not so easy. You had to write your own program, typically in FORTRAN. Chow showed a way you could perform a Wald test based on statistics that were commonly reported, and that would produce the same result as if you performed the Wald test.

What does it mean “whether data can be pooled together”? Do you often meet nonprofessionals who say to you, “I was wondering whether the data could be pooled?” Forget that phrase, too: it is another piece of jargon for testing whether the behavior is the same, as measured by whether the coefficients are the same.

Let’s pretend that you have some model and two or more groups of data. Your model predicts something about the behavior within the group based on certain characteristics that vary within the group. Under the assumption that each group's behavior is unique, you have

 y_1 = X_1*b_1 + u_1 (equation for group 1) y_2 = X_2*b_2 + u_2 (equation for group 2) 
and so on. Now, you want to test whether the behavior for one group is the same as for another, which means you want to test
 b_1 = b_2 = ... 
How do you do that? Testing coefficients across separately estimated models is difficult to impossible, depending on things we need not go into right now. A trick is to “pool” the data to convert the multiple equations into one giant equation:
 y = d1*(X_1*b1 + u1) + d2*(X_2*b2 + u2) + ... 
where y is the set of all outcomes (y_1, y_2, ...), and d1 is a variable that is 1 when the data are for group 1 and 0 otherwise, d2 is 1 when the data are for group 2 and 0 otherwise, ....

Notice that from the above I can retrieve the original equations. Setting d1=1 and d2=d3=...=0, I get the equation for group 1; setting d1=0 and d2=1 and d3=...=0, I get the equation for group 2; and so on.

Now, let’s start with

 y = d1*(X_1*b1 + u1) + d2*(X_2*b2 + u2) + ... 
and rewrite it by a little algebraic manipulation:
 y = d1*(X_1*b1 + u1) + d2*(X_2*b2 + u2) + ... = d1*X_1*b1 + d1*u2 + d2*X_2*b2 + d2*u2 + ... = d1*X_1*b1 + d2*X_2*b2 + ... + d1*u1 + d2*u2 + ... = X_1*d1*b1 + X_2*d2*b2 + ... + d1*u1 + d2*u2 + ... = (X_1*d1)*b1 + (X_2*d2)*b2 + ... + d1*u1 + d2*u2 + ... 
By stacking the data, I can get back estimates of b1, b2, ...

I include not X_1 in my model, but X_1*d1 (a set of variables equal to X_1 when group is 1 and 0 otherwise); I include not X_2 in my model, but X_2*d2 (a set of variables equal to X_2 when group is 2 and 0 otherwise); and so on.

Let’s use the auto dataset and pretend that I have two groups.

 . sysuse auto,clear . generate group1=rep78==3 . generate group2=group1==0 
I could fit the models separately:
 . regress price mpg weight if group1==1 Source | SS df MS Number of obs = 30 -------------+------------------------------ F( 2, 27) = 16.20 Model | 196545318 2 98272658.8 Prob > F = 0.0000 Residual | 163826398 27 6067644.36 R-squared = 0.5454 -------------+------------------------------ Adj R-squared = 0.5117 Total | 360371715 29 12426610.9 Root MSE = 2463.3 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | 13.14912 184.5661 0.07 0.944 -365.5492 391.8474 weight | 3.517687 1.015855 3.46 0.002 1.433324 5.60205 _cons | -5431.147 6599.898 -0.82 0.418 -18973.02 8110.725 ------------------------------------------------------------------------------ . regress price mpg weight if group2==1 Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 2, 41) = 5.16 Model | 54562909.6 2 27281454.8 Prob > F = 0.0100 Residual | 216614915 41 5283290.61 R-squared = 0.2012 -------------+------------------------------ Adj R-squared = 0.1622 Total | 271177825 43 6306461.04 Root MSE = 2298.5 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -170.5474 93.3656 -1.83 0.075 -359.103 18.0083 weight | .0527381 .8064713 0.07 0.948 -1.575964 1.68144 _cons | 9685.028 4190.693 2.31 0.026 1221.752 18148.3 ------------------------------------------------------------------------------ 
I could fit the combined model:
 . generate mpg1=mpg*group1 . generate weight1=weight*group1 . generate mpg2=mpg*group2 . generate weight2=weight*group2 . regress price group1 mpg1 weight1 group2 mpg2 weight2, noconstant  Source | SS df MS Number of obs = 74 -------------+------------------------------ F( 6, 68) = 91.38 Model | 3.0674e+09 6 511232168 Prob > F = 0.0000 Residual | 380441313 68 5594725.19 R-squared = 0.8897 -------------+------------------------------ Adj R-squared = 0.8799 Total | 3.4478e+09 74 46592355.7 Root MSE = 2365.3 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- group1 | -5431.147 6337.479 -0.86 0.394 -18077.39 7215.096 mpg1 | 13.14912 177.2275 0.07 0.941 -340.5029 366.8012 weight1 | 3.517687 .9754638 3.61 0.001 1.571179 5.464194 group2 | 9685.028 4312.439 2.25 0.028 1079.69 18290.37 mpg2 | -170.5474 96.07802 -1.78 0.080 -362.2681 21.17334 weight2 | .0527381 .8299005 0.06 0.950 -1.603303 1.708779 ------------------------------------------------------------------------------ 
What is this noconstant option? We must remember that when we fit the separate models, each has its own intercept. There was an intercept in X_1, X_2, and so on. What I have done above is literally translate
 y = (X_1*d1)*b1 + (X_2*d2)*b2 + d1*u1 + d2*u2 
and so included the variables group1 and group2 (variables equal to 1 for their respective groups) and told Stata to omit the overall intercept.

I do not recommend you fit the model the way I have just illustrated because of numerical concerns—we'll get to that later. Fit the models separately or jointly, and you will get the same estimates for b_1 and b_2.

Now we can test whether the coefficients are the same for the two groups:

 . test _b[mpg1]=_b[mpg2], notest ( 1) mpg1 - mpg2 = 0 . test _b[weight1]=_b[weight2], accum ( 1) mpg1 - mpg2 = 0 ( 2) weight1 - weight2 = 0 F( 2, 68) = 5.61 Prob > F = 0.0056 
That is the Chow test. Something was omitted: the intercept. If we really wanted to test whether the two groups were the same, we would would test
 . test _b[mpg1]=_b[mpg2] ( 1) mpg1 - mpg2 = 0 . test _b[weight1]=_b[weight2], accum ( 1) mpg1 - mpg2 = 0 ( 2) weight1 - weight2 = 0 . test _b[group1]=_b[group2], accum ( 1) mpg1 - mpg2 = 0 ( 2) weight1 - weight2 = 0 ( 3) group1 - group2 = 0 F( 3, 68) = 4.07 Prob > F = 0.0102 
Using this approach, however, we are not tied down by what the "Chow test" can test. We can formulate any hypothesis we want. We might think that weight works the same way in both groups but that mpg works differently, and each group has its own intercept. Then, we could test
 . test _b[mpg1]=_b[mpg2] ( 1) mpg1 - mpg2 = 0 F( 1, 68) = 0.83 Prob > F = 0.3654 
by itself. If we had more variables, we could test any subset of variables.

Is “pooling the data” justified? Of course it is: we just established that pooling the data is just another way of fitting separate models and that fitting separate models is certainly justified—we got the same coefficients. That’s why I told you to forget the phrase about whether pooling the data is justified. People who ask that don’t really mean to ask what they are saying: they mean to ask whether the coefficients are the same. In that case, they should say that. Pooling is always justified, and it corresponds to nothing more than the mathematical trick of writing separate equations,

 y_1 = X_1*b_1 + u_1 (equation for group 1) y_2 = X_2*b_2 + u_2 (equation for group 2) 
as one equation
 y = (X_1*d1)*b1 + (X_2*d2)*b2 + d1*u1 + d2*u2 
There are many ways I can write the above equation, and I want to write it a little differently because of numerical concerns. Starting with
 y = (X_1*d1)*b1 + (X_2*d2)*b2 + d1*u1 + d2*u2 
let’s do some algebra to obtain
 y = X*b1 + d2*X_2*(b2-b1) + d1*y1 + d2*u2 
where X = (X_1, X_2). In this formulation, I measure not b1 and b2, but b1 and (b2−b1). This is numerically more stable, and I can still test that b2==b1 by testing whether (b2−b1)=0. Let’s fit this model
 . regress price mpg weight mpg2 weight2 group2 Source | SS df MS Number of obs = 74 -------------+------------------------------ F( 5, 68) = 9.10 Model | 254624083 5 50924816.7 Prob > F = 0.0000 Residual | 380441313 68 5594725.19 R-squared = 0.4009 -------------+------------------------------ Adj R-squared = 0.3569 Total | 635065396 73 8699525.97 Root MSE = 2365.3 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | 13.14912 177.2275 0.07 0.941 -340.5029 366.8012 weight | 3.517687 .9754638 3.61 0.001 1.571179 5.464194 mpg2 | -183.6965 201.5951 -0.91 0.365 -585.9733 218.5803 weight2 | -3.464949 1.280728 -2.71 0.009 -6.020602 -.9092956 group2 | 15116.17 7665.557 1.97 0.053 -180.2075 30412.56 _cons | -5431.147 6337.479 -0.86 0.394 -18077.39 7215.096 ------------------------------------------------------------------------------ 
and, if I want to test whether the coefficients are the same, I can do
 . test _b[mpg2]=0 ( 1) mpg2 = 0 . test _b[weight2]=0, accum ( 1) mpg2 = 0 ( 2) weight2 = 0 F( 2, 68) = 5.61 Prob > F = 0.0056 
and that gives the same answer yet again. If I want to test whether *ALL* the coefficients are the same (including the intercept) I can use
 . test _b[mpg2]=0, notest ( 1) mpg2 = 0 . test _b[weight2]=0, accum notest ( 1) mpg2 = 0 ( 2) weight2 = 0 . test _b[group2]=0, accum ( 1) mpg2 = 0 ( 2) weight2 = 0 ( 3) group2 = 0 F( 3, 68) = 4.07 Prob > F = 0.0102 
Just as before, I can test any subset.

Using this difference formulation, if I had three groups, starting with

 y = (X_1*d1)*b1 + (X_2*d2)*b2 + (X_3*d3)*b3 + d1*u1 + d2*u2 + d3*u3 
as
 y = X*b1 + (X_2*d2)*(b2-b1) + (X_3*d3)*(b3-b1) + d1*u1 + d2*u2 + d3*u3 
Let’s create the group variables and fit this model:
 . sysuse auto,clear . generate group1=rep78==3 . generate group2=rep78==4 . generate group3=(group1+group2)==0 . generate mpg1=mpg*group1 . generate weight1=weight*group1 . generate mpg2=mpg*group2 . generate weight2=weight*group2 . generate mpg3=mpg*group3 . generate weight3=weight*group3 . regress price mpg weight mpg2 weight2 group2 /// > mpg3 weight3 group3 Source | SS df MS Number of obs = 74 -------------+------------------------------ F( 8, 65) = 5.80 Model | 264415585 8 33051948.1 Prob > F = 0.0000 Residual | 370649811 65 5702304.78 R-squared = 0.4164 -------------+------------------------------ Adj R-squared = 0.3445 Total | 635065396 73 8699525.97 Root MSE = 2387.9 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | 13.14912 178.9234 0.07 0.942 -344.1855 370.4837 weight | 3.517687 .9847976 3.57 0.001 1.55091 5.484463 mpg2 | 130.5261 336.6547 0.39 0.699 -541.8198 802.872 weight2 | -2.18337 1.837314 -1.19 0.239 -5.85274 1.486 group2 | 4560.193 12222.22 0.37 0.710 -19849.27 28969.66 mpg3 | -194.1974 216.3459 -0.90 0.373 -626.27 237.8752 weight3 | -3.160952 1.73308 -1.82 0.073 -6.622152 .3002481 group3 | 14556.66 9167.998 1.59 0.117 -3753.101 32866.41 _cons | -5431.147 6398.12 -0.85 0.399 -18209.07 7346.781 ------------------------------------------------------------------------------ 
If I want to test whether the three groups were the same in the Wald-test sense, I can use
 . test (_b[mpg2]=0) (_b[weight2]=0) (_b[group2]=0) /* > */ (_b[mpg3]=0) (_b[weight3]=0) (_b[group3]=0) ( 1) mpg2 = 0 ( 2) weight2 = 0 ( 3) group2 = 0 ( 4) mpg3 = 0 ( 5) weight3 = 0 ( 6) group3 = 0 F( 6, 65) = 2.28 Prob > F = 0.0463 
which I could more easily type as
 . testparm mpg2 weight2 group2 mpg3 weight3 group3 ( 1) mpg2 = 0 ( 2) weight2 = 0 ( 3) group2 = 0 ( 4) mpg3 = 0 ( 5) weight3 = 0 ( 6) group3 = 0 F( 6, 65) = 2.28 Prob > F = 0.0463 

使用道具

板凳
蓝色 发表于 2008-1-2 18:07:00 |只看作者 |坛友微信交流群

http://www.stata.com/support/faqs/stat/awreg.html

How can I pool data (and perform Chow tests) in linear regression without constraining the residual variances to be equal?

Title Pooling data and performing Chow tests in linear regression
AuthorWilliam Gould, StataCorp
DateDecember 1999; updated August 2005

  1. Pooling data and constraining residual variance
  2. Illustration
  3. Pooling data without constraining residual variance
  4. Illustration
  5. The (lack of) importance of not constraining the variance
  6. Another way to fit the variance-unconstrained model
  7. Appendix: do-file and log providing results reported above
    7.1   do-file
    7.2   log

1. Pooling data and constraining residual variance

Consider the linear regression model

y = β0 + β1x1 + β2x2 + u, u ~ N(0, σ2 )

and let us pretend that we have two groups of data, group=1 and group=2. We could have more groups; everything said below generalizes to more than two groups.

We could estimate the models separately by typing

 . regress y x1 x2 if group==1 and . regress y x1 x2 if group==2 
or we could pool the data and estimate a single model, one way being
 . gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2 
The difference between these two approaches is that we are constraining the variance of the residual to be the same in the two groups when we pool the data. When we estimated separately, we estimated
group 1: y = β01 + β11x1 + β21x2 + u1, u1 ~ N(0, σ12)
and
group 2: y = β02 + β12x1 + β22x2 + u2, u2 ~ N(0, σ22)

When we pooled the data, we estimated

y = β01 + β11x1 + β21x2 + (β02-β01)g2 + (β12-β11)g2x1 + (β22-β21)g2x2 + u, u ~ N(0, σ2)

If we evaluate this equation for the groups separately, we obtain

y = β01 + β11x1 + β21x2 + u, u ~ N(0,σ2) for group=1
and
y = β02 + β12x1 + β22x2 + u, u ~ N(0,σ2) for group=2

The difference is that we have now constrained the variance of u for group=1 to be the same as the variance of u for group=2.

If you perform this experiment with real data, you will observe the following:

  1. You will obtain the same values for the coefficients either way.

  2. You will obtain different standard errors and therefore different test statistics and confidence intervals.

If u is known to have the same variance in the two groups, the standard errors obtained from the pooled regression are better—they are more efficient. If the variances really are different, however, then the standard errors obtained from the pooled regression are wrong.


2. Illustration (See the do-file and the log with the results in section 7)

I have created a dataset (containing made-up data) on y, x1, and x2. The dataset has 74 observations for group=1 and another 71 observations for group=2. Using these data, I can run the regressions separately by typing

 [1] . regress y x1 x2 if group==1 [2] . regress y x1 x2 if group==2 
or I can run the pooled model by typing
 . gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 [3] . regress y x1 x2 g2 g2x1 g2x2 
I did that in Stata and let me summarize the results. When I typed command [1], I obtained the following results (standard errors in parentheses):
 y = -8.650993 + 1.21329*x1 + -.8809939*x2 + u, Var(u) = 15.8912 (22.73703) (.54459) (.405401) 
and when I ran command [2], I obtained
 y = 4.646994 + .9307004*x1 + .8812369*x2 + u, Var(u) = 7.56852 (11.1593) (.236696) (.1997562) 
When I ran command [3], I obtained
 y = -8.650993 + 1.21329*x1 + -.8809939*x2 + (17.92853) (.42942) + (.3196656) 13.29779*g2 + -.2825893*g2x1 + 1.762231*g2x2 + u, Var(u) = 12.5312 (25.74446) (.6123452) (.459958) 
The intercept and coefficients on x1 and x2 in [3] are the same as in [1], but the standard errors are different. Also, if I sum the appropriate coefficients in [3], I obtain the same results as [2]:
 Intercept: 13.29779 + -8.650993 = 4.646797 ([2] has 4.646994) x1: -.2825893 + 1.21329 = .9307004 ([2] has .9307004) x2: 1.762231 + -.8809939 = .8812371 ([2] has .8812369) 
The coefficients are the same, estimated either way. (The fact that the coefficients in [3] are a little off from those in [2] is just because I did not write down enough digits.)

The standard errors for the coefficients are different.

I also wrote down the estimated Var(u), what is reported as RMSE in Stata’s regression output. In standard deviation terms, u has s.d. 15.891 in group=1, 7.5685 in group=2, and if we constrain these two very different numbers to be the same, the pooled s.d. is 12.531.


3. Pooling data without constraining residual variance

We can pool the data and estimate an equation without constraining the residual variances of the groups to be the same. Previously we typed
 . gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2 
and we start exactly the same way. To that, we add
 . predict r, resid . sum r if group==1 . gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1 . sum r if group==2 . replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2 [4] . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w] 
In the above, the constant 3 that appears twice is 3 because there were three coefficients being estimated in each group (an intercept, a coefficient for x1, and a coefficient for x2). If there were a different number of coefficients being estimated, that number would change.

In any case, this will reproduce exactly the standard errors reported by estimating the two models separately. The advantage is that we can now test equality of coefficients between the two equations. For instance, we can now read right off the pooled regression results whether the effect of x1 is the same in groups 1 and 2 (answer: is _b[g2x1]==0?, because _b[x1] is the effect in group 1 and _b[x1]+_b[g2x1] is the effect in group 2, so the difference is _b[g2x1]). And, using test, we can test other constraints as well.

For instance, if you wanted to prove to yourself that the results of [4] are the same as typing regress y x1 x2 if group==2, you could type

 . test x1 + g2x1 == 0 (reproduces test of x1 for group==2) and . test x2 + g2x2 == 0 (reproduces test of x2 for group==2) 


4. Illustration

Using the made-up data, I did exactly that. To recap, first I estimated separate regressions:

 [1] . regress y x1 x2 if group==1 [2] . regress y x1 x2 if group==2 
and then I ran the variance-constrained regression,
 . gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 [3] . regress y x1 x2 g2 g2x1 g2x2 
and then I ran the variance-unconstrained regression,
 . predict r, resid . sum r if group==1 . gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1 . sum r if group==2 . replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2 [4] . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w] 
Just to remind you, here is what commands [1] and [2] reported:
 y = -8.650993 + 1.21329*x1 + -.8809939*x2 + u, Var(u) = 15.8912 (22.73703) (.54459) (.405401) y = 4.646994 + .9307004*x1 + .8812369*x2 + u, Var(u) = 7.56852 (11.1593) (.236696) (.1997562) 
Here is what command [4] reported:
 y = -8.650993 + 1.21329*x1 + -.8809939*x2 + (22.73703) (.54459) (.405401) 13.29779*g2 + -.2825893*g2x1 + 1.762231*g2x2 + u (25.3269) (.6050657) (.451943) 
Those results are the same as [1] and [2]. (Pay no attention to the RMSE reported by regress at this last step; the reported RMSE is the standard deviation of neither of the two groups but is instead a weighted average; see the FAQ on this if you care. If you want to know the standard errors of the respective residuals, look back at the output from the summarize statements typed when producing the weighting variable.)

Technical Note:   Note that in creating the weights, we typed
 . sum r if group==1 . gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1 
and similarly for group 2. The 3 that appears in the finite-sample normalization factor (r(N)-1)/(r(N)-3) appears because there are three coefficients per group being estimated. If our model had fewer or more coefficients, that number would change. In fact, the finite-sample normalization factor changes results very little. In real work, I would have ignored it and just typed
 . sum r if group==1 . gen w = r(Var) if group==1 
unless the number of observations in one of the groups was very small. The normalization factor was included here so that [4] would produce the same results as [1] and [2].

5. The (lack of) importance of not constraining the variance

Does it matter whether we constrain the variance? Here, it does not matter much. For instance, if after
 [4] . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w] 
we test whether group 2 is the same as group 1, we obtain
 . test g2x1 g2x2 g2 ( 1) g2x1 = 0.0 ( 2) g2x2 = 0.0 ( 3) g2 = 0.0 F( 3, 139) = 307.50 Prob > F = 0.0000 
If instead we had constrained the variances to be the same, estimating the model using
 [3] . regress y x1 x2 g2 g2x1 g2x2 
and then repeated the test, the reported F-statistic would be 300.81.

If there were more groups, and the variance differences were great among the groups, this could become more important.


6. Another way to fit the variance-unconstrained model

Stata’s xtgls, panels(het) command (see help xtgls) fits exactly the model we have been describing, the only difference being that it does not make all the finite-sample adjustments and so its standard errors are just a little different from those produced by the method just described. (To be clear, xtgls, panels(het) does not make the adjustment described in the technical note above and it does not make the finite-sample adjustments regress itself makes, which is to say, variances are invariable normalized by N, the number of observations, rather than N-k, observations minus number of estimated coefficients.)

Anyway, to estimate xtgls, panels(het), you pool the data just as always,

 . gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 
and then type
 [5] . xtgls y x1 x2 g2 g2x1 g2x2, panels(het) i(group) 
to estimate the model. The result of doing that with my fictional data is
 y = -8.650993 + 1.21329*x1 + -.8809939*x2 + (22.27137) (.53344) (.397099) 13.29779*g2 + -.2825893*g2x1 + 1.762231*g2x2 + u (24.80488) (.5925734) (.442610) 
These are the same coefficients we have always seen.

The standard errors produced by xtgls, panels(het) here are about 2% smaller than those produced by [4] and in general will be a little smaller because xtgls, panels(het) is an asymptotically based estimator. The two estimators are asymptotically equivalent, however, and in fact quickly become identical. The only caution I would advise is not to use xtgls, panels(het) if the number of degrees of freedom (observations minus number of coefficients) is below 25 in any of the groups. Then, the weighted OLS approach [4] is better (and you should make the finite-sample adjustment described in the above technical note).


7. Appendix: do-file and log providing results reported above

7.1 do-file

The following do-file, named uncv.do, was used. Up until the line reading “BEGINNING OF DEMONSTRATION’, the do-file is concerned with constructing the artificial dataset for the demonstration: uncv.do

7.2 log

The do-file shown in 7.1 produced the following output: uncv.log
已有 1 人评分经验 论坛币 学术水平 热心指数 信用等级 收起 理由
Sunknownay + 100 + 10 + 1 + 1 + 1 热心帮助其他会员

总评分: 经验 + 100  论坛币 + 10  学术水平 + 1  热心指数 + 1  信用等级 + 1   查看全部评分

使用道具

报纸
kkwei 发表于 2008-1-3 00:13:00 |只看作者 |坛友微信交流群
版主真是辛苦了……很好的东西……

使用道具

地板
bookbug 发表于 2008-1-3 10:54:00 |只看作者 |坛友微信交流群
这几篇东西真好 前阵子我妹问题邹氏检验的问题 我也没用过 就查到了上面几篇说明 看了还是挺容易明白的

使用道具

7
三木 发表于 2008-1-5 22:12:00 |只看作者 |坛友微信交流群

版主辛苦了!

我前阵子还在为这个烦恼。看了一下这几篇,非常不错。

谢谢!

做人要厚道。

使用道具

8
histidine 发表于 2008-10-1 04:25:00 |只看作者 |坛友微信交流群

谢谢版主!

刚刚做作业参考了这个帖子!写的很清楚很有用!Thanks a million~

使用道具

9
kakajin 发表于 2009-3-8 23:33:00 |只看作者 |坛友微信交流群
Is Chow test just a F-test?

使用道具

10
sungmoo 发表于 2009-4-30 02:33:00 |只看作者 |坛友微信交流群
以下是引用kakajin在2009-3-8 23:33:00的发言:Is Chow test just a F-test?

 
reg y x1-xn if g==1
scalar r1=e(rss)
scalar n1=e(N)
reg y x1-xn if g==2
scalar r2=e(rss)
scalar n2=e(N)
reg y x1-xn if g==1|g==2
scalar r=e(rss)
scalar k=e(df_m)+1
di 1-F(n1, n2, (r-r1-r2)*(n1+n2-2*k)/((r1+r2)*k))

[此贴子已经被作者于2009-4-30 2:39:49编辑过]

已有 2 人评分学术水平 热心指数 信用等级 收起 理由
LHZ@EW + 1 + 1 + 1 精彩帖子
sakiny136 + 1 + 1 + 1 热心帮助其他会员

总评分: 学术水平 + 2  热心指数 + 2  信用等级 + 2   查看全部评分

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-24 19:55