楼主: arpanet
5877 9

Data Preparation for Data mining using sas [推广有奖]

  • 0关注
  • 0粉丝

硕士生

60%

还不是VIP/贵宾

-

威望
0
论坛币
1224 个
通用积分
2.1323
学术水平
0 点
热心指数
0 点
信用等级
0 点
经验
2257 点
帖子
231
精华
0
在线时间
114 小时
注册时间
2009-6-20
最后登录
2024-4-25

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
CHAPTER
1 INTRODUCTION 1
1.1 The Data Mining Process 1
1.2 Methodologies of Data Mining 1
1.3 The Mining View 3
1.4 The Scoring View 4
1.5 Notes on Data Mining Software 4
CHAPTER
2 TASKS AND DATA FLOW 7
2.1 Data Mining Tasks 7
2.2 Data Mining Competencies 9
2.3 The Data Flow 10
2.4 Types of Variables 11
2.5 The Mining View and the Scoring View 12
2.6 Steps of Data Preparation 13
CHAPTER
3 REVIEW OF DATA MINING MODELING
TECHNIQUES 15
3.1 Introduction 15
3.2 Regression Models 15
3.2.1 Linear Regression 16
v
vi Contents
3.2.2 Logistic Regression 18
3.3 Decision Trees 21
3.4 Neural Networks 22
3.5 Cluster Analysis 25
3.6 Association Rules 26
3.7 Time Series Analysis 26
3.8 Support Vector Machines 26
CHAPTER
4 SAS MACROS: A QUICK START 29
4.1 Introduction:Why Macros? 29
4.2 The Basics: The Macro and Its Variables 30
4.3 Doing Calculations 32
4.4 Programming Logic 33
4.5 Working with Strings 35
4.6 Macros That Call Other Macros 36
4.7 Common Macro Patterns and Caveats 37
4.7.1 Generating a List of Macro Variables 37
4.7.2 Double Coding 39
4.7.3 Using Local Variables 39
4.7.4 From a DATA Step to Macro Variables 40
4.8 Where to Go FromHere 41
CHAPTER
5 DATA ACQUISITION AND INTEGRATION 43
5.1 Introduction 43
5.2 Sources of Data 43
5.2.1 Operational Systems 43
5.2.2 DataWarehouses and Data Marts 44
5.2.3 OLAP Applications 44
5.2.4 Surveys 44
5.2.5 Household and Demographic Databases 45
5.3 Variable Types 45
5.3.1 Nominal Variables 45
5.3.2 Ordinal Variables 46
5.3.3 Real Measures 47
Contents vii
5.4 Data Rollup 47
5.5 Rollup with Sums, Averages, and Counts 54
5.6 Calculation of the Mode 55
5.7 Data Integration 56
5.7.1 Merging 57
5.7.2 Concatenation 59
CHAPTER
6 INTEGRITY CHECKS 63
6.1 Introduction 63
6.2 Comparing Datasets 66
6.3 Dataset Schema Checks 66
6.3.1 Dataset Variables 66
6.3.2 Variable Types 69
6.4 Nominal Variables 70
6.4.1 Testing the Presence of All Categories 70
6.4.2 Testing the Similarity of Ratios 73
6.5 Continuous Variables 76
6.5.1 Comparing Measure from Two Datasets 77
6.5.2 Comparing the Means, Standard Deviations, and Variance 78
6.5.3 The Confidence-Level Calculations Assumptions 80
6.5.4 Comparison of Other Measures 81
CHAPTER
7 EXPLORATORY DATA ANALYSIS 83
7.1 Introduction 83
7.2 Common EDA Procedures 83
7.3 Univariate Statistics 84
7.4 Variable Distribution 86
7.5 Detection of Outliers 86
7.5.1 Identification of Outliers Using Ranges 88
7.5.2 Identification of Outliers Using Model Fitting 91
7.5.3 Identification of Outliers Using Clustering 94
7.5.4 Notes on Outliers 96
7.6 Testing Normality 96
7.7 Cross-tabulation 97
7.8 Investigating Data Structures 97
viii Contents
CHAPTER
8 SAMPLING AND PARTITIONING 99
8.1 Introduction 99
8.2 Contents of Samples 100
8.3 Random Sampling 101
8.3.1 Constraints on Sample Size 101
8.3.2 SAS Implementation 101
8.4 Balanced Sampling 104
8.4.1 Constraints on Sample Size 105
8.4.2 SAS Implementation 106
8.5 Minimum Sample Size 110
8.5.1 Continuous and Binary Variables 110
8.5.2 Sample Size for a Nominal Variable 112
8.6 Checking Validity of Sample 113
CHAPTER
9 DATA TRANSFORMATIONS 115
9.1 Raw and Analytical Variables 115
9.2 Scope of Data Transformations 116
9.3 Creation of New Variables 119
9.3.1 Renaming Variables 120
9.3.2 Automatic Generation of Simple Analytical Variables 124
9.4 Mapping of Nominal Variables 126
9.5 Normalization of Continuous Variables 130
9.6 Changing the Variable Distribution 131
9.6.1 Rank Transformations 131
9.6.2 Box–Cox Transformations 133
9.6.3 Spreading the Histogram 138
CHAPTER
10 BINNING AND REDUCTION
OF CARDINALITY 141
10.1 Introduction 141
10.2 Cardinality Reduction 142
10.2.1 The Main Questions 142
Contents ix
10.2.2 Structured Grouping Methods 144
10.2.3 Splitting a Dataset 144
10.2.4 The Main Algorithm 145
10.2.5 Reduction of Cardinality Using Gini Measure 147
10.2.6 Limitations and Modifications 156
10.3 Binning of Continuous Variables 157
10.3.1 Equal-Width Binning 157
10.3.2 Equal-Height Binning 160
10.3.3 Optimal Binning 164
CHAPTER
11 TREATMENT OF MISSING VALUES 171
11.1 Introduction 171
11.2 Simple Replacement 174
11.2.1 Nominal Variables 174
11.2.2 Continuous and Ordinal Variables 176
11.3 Imputing Missing Values 179
11.3.1 Basic Issues in Multiple Imputation 179
11.3.2 Patterns of Missingness 180
11.4 Imputation Methods and Strategy 181
11.5 SAS Macros for Multiple Imputation 185
11.5.1 Extracting the Pattern of Missing Values 185
11.5.2 Reordering Variables 190
11.5.3 Checking Missing Pattern Status 194
11.5.4 Imputing to a Monotone Missing Pattern 197
11.5.5 Imputing Continuous Variables 198
11.5.6 Combining Imputed Values of Continuous Variables 200
11.5.7 Imputing Nominal and Ordinal Variables 203
11.5.8 Combining Imputed Values of Ordinal and Nominal
Variables 203
11.6 Predicting Missing Values 204
CHAPTER
12 PREDICTIVE POWER AND VARIABLE
REDUCTION I 207
12.1 Introduction 207
12.2 Metrics of Predictive Power 208
x Contents
12.3 Methods of Variable Reduction 209
12.4 Variable Reduction: Before or During Modeling 210
CHAPTER
13 ANALYSIS OF NOMINAL AND ORDINAL
VARIABLES 211
13.1 Introduction 211
13.2 Contingency Tables 211
13.3 Notation and Definitions 212
13.4 Contingency Tables for Binary Variables 214
13.4.1 Difference in Proportion 215
13.4.2 The Odds Ratio 218
13.4.3 The Pearson Statistic 221
13.4.4 The Likelihood Ratio Statistic 224
13.5 Contingency Tables for Multicategory Variables 225
13.6 Analysis of Ordinal Variables 227
13.7 Implementation Scenarios 231
CHAPTER
14 ANALYSIS OF CONTINUOUS VARIABLES 233
14.1 Introduction 233
14.2 When Is Binning Necessary? 233
14.3 Measures of Association 234
14.3.1 Notation 234
14.3.2 The F-Test 236
14.3.3 Gini and Entropy Variances 236
14.4 Correlation Coefficients 239
CHAPTER
15 PRINCIPAL COMPONENT ANALYSIS 247
15.1 Introduction 247
15.2 Mathematical Formulations 248
Contents xi
15.3 Implementing and Using PCA 249
15.4 Comments on Using PCA 254
15.4.1 Number of Principal Components 254
15.4.2 Success of PCA 254
15.4.3 Nominal Variables 256
15.4.4 Dataset Size and Performance 256
CHAPTER
16 FACTOR ANALYSIS 257
16.1 Introduction 257
16.1.1 Basic Model 257
16.1.2 Factor Rotation 259
16.1.3 Estimation Methods 259
16.1.4 Variable Standardization 259
16.1.5 Illustrative Example 259
16.2 Relationship Between PCA and FA 263
16.3 Implementation of Factor Analysis 263
16.3.1 Obtaining the Factors 264
16.3.2 Factor Scores 265
CHAPTER
17 PREDICTIVE POWER AND VARIABLE
REDUCTION II 267
17.1 Introduction 267
17.2 Data with Binary Dependent Variables 267
17.2.1 Notation 267
17.2.2 Nominal Independent Variables 268
17.2.3 Numeric Nominal Independent Variables 273
17.2.4 Ordinal Independent Variables 273
17.2.5 Continuous Independent Variables 274
17.3 Data with Continuous Dependent Variables 275
17.3.1 Nominal Independent Variables 275
17.3.2 Ordinal Independent Variables 275
17.3.3 Continuous Independent Variables 275
17.4 Variable Reduction Strategies 275
xii Contents
CHAPTER
18 PUTTING IT ALL TOGETHER 279
18.1 Introduction 279
18.2 The Process of Data Preparation 279
18.3 Case Study: The Bookstore 281
18.3.1 The Business Problem 281
18.3.2 Project Tasks 282
18.3.3 The Data Preparation Code 283
APPENDIX
LISTING OF SAS MACROS 297
A.1 Copyright and Software License 297
A.2 Dependencies between Macros 298
A.3 Data Acquisition and Integration 299
A.3.1 Macro TBRollup() 299
A.3.2 Macro ABRollup() 301
A.3.3 Macro VarMode() 303
A.3.4 Macro MergeDS() 304
A.3.5 Macro ContcatDS() 304
A.4 Integrity Checks 304
A.4.1 Macro SchCompare() 304
A.4.2 Macro CatCompare() 306
A.4.3 Macro ChiSample() 307
A.4.4 Macro VarUnivar1() 308
A.4.5 Macro CVLimits() 309
A.4.6 Macro CompareTwo() 309
A.5 Exploratory Data Analysis 310
A.5.1 Macro Extremes1() 310
A.5.2 Macro Extremes2() 311
A.5.3 Macro RobRegOL() 312
A.5.4 Macro ClustOL() 312
A.6 Sampling and Partitioning 313
A.6.1 Macro RandomSample() 313
A.6.2 Macro R2Samples() 313
A.6.3 Macro B2Samples() 315
A.7 Data Transformations 318
A.7.1 Macro NorList() 318
Contents xiii
A.7.2 Macro NorVars() 319
A.7.3 Macro AutoInter() 320
A.7.4 Macro CalcCats() 321
A.7.5 Macro MappCats() 322
A.7.6 Macro CalcLL() 323
A.7.7 Macro BoxCox() 324
A.8 Binning and Reduction of Cardinality 325
A.8.1 Macro GRedCats() 325
A.8.2 Macro GSplit() 329
A.8.3 Macro AppCatRed() 331
A.8.4 Macro BinEqW() 332
A.8.5 Macro BinEqW2() 332
A.8.6 Macro BinEqW3() 333
A.8.7 Macro BinEqH() 334
A.8.8 Macro GBinBDV() 336
A.8.9 Macro AppBins() 340
A.9 Treatment of Missing Values 341
A.9.1 Macro ModeCat() 341
A.9.2 Macro SubCat() 342
A.9.3 Macro SubCont() 342
A.9.4 Macro MissPatt() 344
A.9.5 Macro ReMissPat() 347
A.9.6 Macro CheckMono() 349
A.9.7 Macro MakeMono() 350
A.9.8 Macro ImpReg() 351
A.9.9 Macro AvgImp() 351
A.9.10 Macro NORDImp() 352
A.10 Analysis of Nominal and Ordinal Variables 352
A.10.1 Macro ContinMat() 352
A.10.2 Macro PropDiff() 353
A.10.3 Macro OddsRatio() 354
A.10.4 Macro PearChi() 355
A.10.5 Macro LikeRatio() 355
A.10.6 Macro ContPear() 356
A.10.7 Macro ContSpear() 356
A.10.8 Macro ContnAna() 357
A.11 Analysis of Continuous Variables 358
A.11.1 Macro ContGrF() 358
A.11.2 Macro VarCorr() 359
xiv Contents
A.12 Principal Component Analysis 360
A.12.1 Macro PrinComp1() 360
A.12.2 Macro PrinComp2() 360
A.13 Factor Analysis 362
A.13.1 Macro Factor() 362
A.13.2 Macro FactScore() 362
A.13.3 Macro FactRen() 363
A.14 Predictive Power and Variable Reduction II 363
A.14.1 Macro GiniCatBDV() 363
A.14.2 Macro EntCatBDV() 364
A.14.3 Macro PearSpear() 366
A.14.4 Macro PowerCatBDV() 367
A.14.5 Macro PowerOrdBDV() 368
A.14.6 Macro PowerCatNBDV() 370
A.15 Other Macros 372
A.15.1 ListToCol() 372
Bibliography 373
Index 375
About the Author 393
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Preparation Data Mining ration ATION ratio Using SAS Mining Data Preparation

Data Preparation for Data mining using sas.pdf

2.66 MB

需要: 3 个论坛币  [购买]

已有 1 人评分经验 收起 理由
crackman + 60 奖励积极上传好的资料

总评分: 经验 + 60   查看全部评分

本帖被以下文库推荐

沙发
xmzhermione 发表于 2011-4-6 21:13:53 |只看作者 |坛友微信交流群
多谢lz无私分享
看惯了,白衣苍狗

使用道具

藤椅
junmeili 发表于 2011-4-7 09:26:18 |只看作者 |坛友微信交流群
中间有缺页,楼主可否补全,挣点币不容易啊
tutu

使用道具

板凳
milanblood 发表于 2011-9-11 16:54:25 |只看作者 |坛友微信交流群
下了,看看好不好用

使用道具

报纸
gaotao0727 发表于 2012-11-22 10:52:21 |只看作者 |坛友微信交流群
下了,希望是好书~~~
衣带渐宽终不悔,为伊消得人憔悴~~

使用道具

地板
gaotao0727 发表于 2012-11-29 14:20:40 |只看作者 |坛友微信交流群
楼主,这书里面的代码和数据有吗?可否分享一下!先谢了啊~~
衣带渐宽终不悔,为伊消得人憔悴~~

使用道具

7
davil2000 发表于 2012-12-23 10:00:44 |只看作者 |坛友微信交流群
Refaat, Mamdouh 2006
R是万能的,SAS是不可战胜的!

使用道具

8
giuhn 发表于 2013-12-11 13:04:22 |只看作者 |坛友微信交流群
非常感谢!这边书在Credit Risk Scorecards: Development and Implementation Using SAS 好像有很多处都引用!

使用道具

9
ゞ_。大噗 发表于 2015-11-10 13:09:10 |只看作者 |坛友微信交流群
感谢楼主分享~~~

使用道具

10
baiyaoqian 发表于 2023-9-13 14:35:59 |只看作者 |坛友微信交流群
谢谢楼主分享,下载阅读一下。

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-28 20:53