An event count is the realization of a nonnegative integer-valued random variable (Cameron and Trivedi 1998). Examples are the number of car accidents per month, thunder storms per year, and wild fires per year. The ordinary least squares (OLS) method for event count data results in biased, inefficient, and inconsistent estimates (Long 1997). Thus, researchers have developed various nonlinear models that are based on the Poisson distribution and negative binomial distribution.
1.1 Count Data Regression Models
The left-hand side (LHS) of the equation has event count data. Independent variables are, as in the OLS, located at the right-hand side (RHS). These RHS variables may be interval, ratio, or binary (dummy). Table 1 below summarizes the categorical dependent variable regression models (CDVMs) according to the level of measurement of the dependent variable.
Table 1. Comparison between OLS and CDVMs
| Model | Dependent (LHS) | Method | Independent (RHS) |
OLS | Ordinary least squares | Interval or ratio scale | Moment based method | A linear function of interval/ratio or binary independent variables |
CDVMs | Binary response | Binary (0 or 1) | Maximum Likelihood Method |
Ordinal response | Ordinal (1st, 2nd, ...) |
Nominal response | Nominal (A, B, ...) |
Event count data | Count (0, 1, 2, ...) |
The Poisson regression model (PRM) and negative binomial regression model (NBRM) are basic models for count data analysis. Either the zero-inflated Poisson (ZIP) or the zero-inflated negative binomial regression model (ZINB) is used when there are many zero counts. Other count models are developed to handle censored, truncated, or sample selected count data. This document, however, focuses on PRM, NBRM, ZIP, and ZINB.
1.2 Poisson Models versus Negative Binomial Models
The Poisson probability distribution,
, has the same mean and variance (equidispersion), Var(y)=E(y)=mu. As the mean of a Poisson distribution increases, the probability of zeros decreases and the distribution approximates a normal distribution (Figure 1). The Poisson distribution also has the strong assumption that events are independent. Thus, this distribution does not fit well if differs across observations (heterogeneity) (Long 1997).
The Poisson regression model (PRM) incorporates observed heterogeneity into the Poisson distribution function, Var(y|x)=E(y|x)=mu=exp(xb). As mu increases, the conditional variance of y increases, the proportion of predicted zeros decreases, and the distribution around the expected value becomes approximately normal (Long 1997). The conditional mean of the errors is zero, but the variance of the errors is a function of independent variables, var(y|x)=exp(xb). The errors are heteroscedastic. Thus, the PRM rarely fits in practice due to overdispersion (Long 1997; Maddala 1983).
Figure 1. Poisson Probability Distribution with Means of .5, 1, 2, and 5
The negative binomial probability distribution is
, where 1/v=alpha determines the degree of dispersion and the Gamma is the Gamma probability distribution. As the dispersion parameter alpha increases, the variance of the negative binomial distribution also increases, Var(y|x)=mu(1+mu/v).
The negative binomial regression model (NBRM) incorporates observed and unobserved heterogeneity into the conditional mean, mu=exp(xb+e) (Long 1997). Thus, the conditional variance of y becomes larger than its conditional mean, E(y|x)=mu, which remains unchanged. Figure 2 illustrates how the probabilities for small and larger counts increase in the negative binomial distribution as the conditional variance of y increases, given mu=2.
Figure 2. Negative Binomial Probability Distribution with Alpha of .01, .5, 1, and 5
The PRM and NBRM, however, have the same mean structure. If , the NBRM reduces to the PRM (Cameron and Trivedi 1998; Long 1997).
1.3 Overdispersion
When Var(y|x) > E(y|x), we are said to have overdispersion. Estimates of a PRM for overdispersed data are unbiased, but inefficient with standard errors biased downward (Cameron and Trivedi 1998; Long 1997). The likelihood ratio test for overdispersion examines the null hypothesis of alpha=0. The LR statistic follows the Chi-squared distribution with one degree of freedom. If the null hypothesis is rejected, NBRM is preferred to PRM.
Zero-inflated models handle overdispersion by changing the mean structure to explicitly model the production of zero counts (Long 1997). These models assume two latent groups. One is the always-zero group and the other is not-always-zero or sometime-zero group (Long 1997). Thus, zero counts come from the former group and some of the latter group with a certain probability.
The likelihood ratio tests the null hypothesis of alpha=0 to compare the ZIP and NBRM. The PRM and ZIP, and NBRM and ZINB cannot, however, be tested by this likelihood ratio, since they are not nested respectively. The Voung’s statistic compares these non-nested models. If V is greater than 1.96, the ZIP or ZINB is favored. If V is less than -1.96, the PRM or NBRM is preferred (Long 1997).
1.4 Estimation in SAS, STATA, and LIMDEP
The SAS GENMOD estimates Poisson and negative binomial regression models. STATA has individual commands (e.g., .poisson and .nbreg) for the corresponding count data models. LIMDEP has Poisson$ and Negbin$ commands to estimate various count data models including zero-inflated and zero-truncated models. Table 2 summarizes the procedures and commands for count data regression models.
Table 2. Comparison of the Procedures and Commands for Count Data Models
Model | SAS 9.1 | STATA 9.0 SE | LIMDEP 8.0 |
Poisson Regression (PRM) | GENMOD | .poission | Poisson$ |
Negative Binomial Regression (NBRM) | GENMOD | .nbreg | Negbin$ |
Zero-infliated Poisson (ZIP) | - | .zip | Poisson; Zip; Rh2$ |
Zero-Inflacted Negative Binomial (ZINB) | - | .zinb | Negbin; Zip; Rh2$ |
Zero-truncated Poisson (ZTP) | - | .ztp | Poisson; Truncation$ |
Zero-truncated Negative Binomial (ZTNB) | - | .ztnb | Negbin; Truncation$ |
The example here examines how waste quotas (emps) and the strictness of policy implementation (strict) affect the frequency of waste spill accidents of plants (accident).
1.5 Long and Freese's SPost Module
STATA users may take advantages of user-written modules such as SPost written by J. Scott Long and Jeremy Freese. The module allows researchers to conduct follow-up analyses of various CDVMs including event count data models. See 2.3 for examples of major SPost commands.
In order to install SPost, execute the following commands consecutively. For more details, visit J. Scott Long’s Web site at
http://www.indiana.edu/~jslsoc/spost_install.htm.
. net from http://www.indiana.edu/~jslsoc/stata/
. net install spost9_ado, replace
. net get spost9_do, replace