|
[quote]ywh19860616 发表于 2013-8-17 10:50 资料来源:
https://www.stata.com/support/faqs/data-management/duplicate-observations/
Case 1: Identifying duplicates based on a subset of variables
You wish to create a new variable named dup
dup = 0 record is unique
dup = 1 record is duplicate, first occurrence
dup = 2 record is duplicate, second occurrence
dup = 3 record is duplicate, third occurrence
etc.
and to base the determination on the variables name, age, and sex.
. sort name age sex
. quietly by name age sex: gen dup = cond(_N==1,0,_n)
Note the capitalization of _N and _n. (Stata interprets _N to mean the total number of observations in the by-group and _n to be the observation number within the by-group.)
Having created the new variable dup, you could then
. tabulate dup
to see a report of the duplicate count.
To base the duplicate count solely on name, type
. sort name
. quietly by name: gen dup = cond(_N==1,0,_n)
|