大概在1999年,SIG(CRISP-DM Special Interest Group)组织开发并提炼出CRISP-DM,同时在Mercedes-Benz和OHRA(保险领域)企业进行了大规模数据挖掘项目的实际试用。SIG还将CRISP-DM和商业数据挖掘工具集成起来。SIG组织目前在伦敦、纽约、布鲁塞尔已经发展到200多个成员。2000年,CRISP-DM 1.0版正式推出,应该说CRISP-DM是实际项目的经验总结和理论抽象。 CRISP-DM强调,DM不单是数据的组织或者呈现,也不仅是数据分析和统计建模,而是一个从理解业务需求、寻求解决方案到接受实践检验的完整过程。
CRISP-DM的六个阶段
CRISP-DM过程描述CRISP-DM 模型为一个KDD工程提供了一个完整的过程描述。一个数据挖掘项目的生命周期包含六个阶段。这六个阶段的顺序是不固定的,我们经常需要前后调整这些阶段。这依赖每个阶段或是阶段中特定任务的产出物是否是下一个阶段必须的输入。上图中箭头指出了最重要的和依赖度高的阶段关系。
上图的外圈象征数据挖掘自身的循环本质――在一个解决方案发布之后一个数据挖掘的过程才可以继续。在这个过程中得到的知识可以触发新的,经常是更聚焦的商业问题。后续的过程可以从前一个过程得到益处。
业务理解(Business Understanding)
最初的阶段集中在理解项目目标和从业务的角度理解需求,同时将这个知识转化为数据挖掘问题的定义和完成目标的初步计划。
数据理解(Data Understanding)
数据理解阶段从初始的数据收集开始,通过一些活动的处理,目的是熟悉数据,识别数据的质量问题,首次发现数据的内部属性,或是探测引起兴趣的子集去形成隐含信息的假设。
数据准备(Data Preparation)
数据准备阶段包括从未处理数据中构造最终数据集的所有活动。这些数据将是模型工具的输入值。这个阶段的任务有个能执行多次,没有任何规定的顺序。任务包括表、记录和属性的选择,以及为模型工具转换和清洗数据。
建模(Modeling)
在这个阶段,可以选择和应用不同的模型技术,模型参数被调整到最佳的数值。一般,有些技术可以解决一类相同的数据挖掘问题。有些技术在数据形成上有特殊要求,因此需要经常跳回到数据准备阶段。
评估(Evaluation)
到项目的这个阶段,你已经从数据分析的角度建立了一个高质量显示的模型。在开始最后部署模型之前,重要的事情是彻底地评估模型,检查构造模型的步骤,确保模型可以完成业务目标。这个阶段的关键目的是确定是否有重要业务问题没有被充分的考虑。在这个阶段结束后,一个数据挖掘结果使用的决定必须达成。
部署(Deployment)
通常,模型的创建不是项目的结束。模型的作用是从数据中找到知识,获得的知识需要便于用户使用的方式重新组织和展现。根据需求,这个阶段可以产生简单的报告,或是实现一个比较复杂的、可重复的数据挖掘过程。在很多案例中,这个阶段是由客户而不是数据分析人员承担部署的工作。
The life cycle of a data mining project consists of six phases. The sequence of the phases is not strict. Moving back and forth between different phases is always required. It depends on the outcome of each phase which phase, or which particular task of a phase, that has to be performed next. The arrows indicate the most important and frequent dependencies between phases.
The outer circle in the figure symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions. Subsequent data mining processes will benefit from the experiences of previous ones.
Below follows a brief outline of the phases:
Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.
Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.
Modeling
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.
Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly uate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created models.