Clyde 提供下列作法:
- * Example generated by -dataex-. To install: ssc install dataex
- clear
- input float(stkcd year x)
- 1 2000 .
- 1 2001 1
- 1 2002 2
- 1 2003 .
- 1 2004 3
- 1 2005 4
- 1 2006 .
- 1 2007 5
- 1 2008 6
- 2 2000 .
- 2 2001 1
- 2 2002 2
- 2 2003 3
- 2 2004 3
- 2 2005 4
- 2 2006 .
- 2 2007 5
- 2 2008 6
- 3 2000 .
- 3 2001 1
- 3 2002 2
- 3 2003 .
- 3 2004 3
- 3 2005 4
- 3 2006 4
- 3 2007 5
- 3 2008 .
- end
- by stkcd (year), sort: ///
- gen run = sum(missing(x) != missing(x[_n-1]) | year > year[_n-1]+1)
- by stkcd run (year), sort: gen run_length = _N
- by stkcd: egen longest_non_missing_run_length = max(cond(!missing(x), run_length, .))
- keep if longest_non_missing_run_length >= 5
复制代码
In Stata terminology, in your example, all of your stkcd's in the example have at least 5 consecutive years of observations. I assume what you mean is 5 consecutive years of observations with non-missing values of x. I notice that in your example data there are no gaps in the year variable, but the code above does not rely on that--it will work correctly if there are some gaps.
The logic is that a spell of consecutive non-missing observations begins, with the observations are sorted by year within stkcd, when x is not missing but x[_n-1] is. Here we actually start be counting runs of consecutive missing or consecutive non-missing observations. Then each run has a length. Then we calculate the length of the longest run of non-missing values. And we keep those where that longest run's length is at least 5.