做了这么多年的数据管理,对于data cleaning的概念其实没有一个比较明确系统的认识
What is data cleaning? In this book, we define data cleaning to include:
• Making sure that the raw data values were accurately entered into a computer readable
file.
• Checking that character variables contain only valid values.
• Checking that numeric values are within predetermined ranges.
• Checking if there are missing values for variables where complete data is necessary.
• Checking for and eliminating duplicate data entries.
• Checking for uniqueness of certain values, such as patient IDs.
• Checking for invalid date values.
• Checking that an ID number is present in each of "n" files.
• Verifying that more complex multi-file rules have been followed.
2. 用PROC FREQ 和_CHARACTER_列出char型变量
- title "Frequency Counts for Selected Character Variables";
- proc freq data=clean.patients(drop=Patno);
- tables _character_ / nocum nopercent;
- run;
- title "Listing of invalid patient numbers and data values";
- data _null_;
- set clean.patients;
- file print; ***send output to the output window;
- ***check Gender;
- if Gender not in ('F' 'M' ' ') then put Patno= Gender=;
- ***check Dx;
- if verify(trim(Dx),'0123456789') and not missing(Dx)
- then put Patno= Dx=;
- /***********************************************
- SAS 9 alternative:
- if notdigit(trim(Dx)) and not missing(Dx)
- then put Patno= Dx=;
- ************************************************/
- ***check AE;
- if AE not in ('0' '1' ' ') then put Patno= AE=;
- run;
3. SAS output显示的问题解决
用的英文版9.2,output显示会出现傻傻傻的乱码字符表格线,更改SASV9.CFG可以解决
找到对应语言版本的cfg文件,记事本类工具打开,修改如下部分,红色用/**/注释起来,蓝色去掉注释
/* This is the OEM character set */
/* -FORMCHAR "衬诼棵糯懒?=|-/\<>*" */
/* This is the ANSI character set (for SAS Monospace font and ANSI Sasfont) */
-FORMCHAR "們剠唶垑妺?=|-/\<>*"
/* This is the ANSI character set */
/* -FORMCHAR "|----|+|---+=|-/\<>*" */
4. 第一次碰到类似下面的error message:
NOTE 49-169: The meaning of an identifier after a quoted string may change in a future SAS
release. Inserting white space between a quoted string and the succeeding
identifier is recommended.
NOTE 49-169: 加引号的字符串后的标识符的含义可能在将来的 SAS 版本中更改。建议在加引号的字符串和标识符之间插入空格。
通过自学检查code解决


雷达卡



京公网安备 11010802022788号







