What This Book Is About
This book is about what to do with data to get the most out of it. There is a lot more to that
statement than first meets the eye.
Much information is available today about data warehouses, data mining, KDD, OLTP,
OLAP, and a whole alphabet soup of other acronyms that describe techniques and
methods of storing, accessing, visualizing, and using data. There are books and
magazines about building models for making predictions of all types—fraud, marketing,
new customers, consumer demand, economic statistics, stock movement, option prices,
weather, sociological behavior, traffic demand, resource needs, and many more.
In order to use the techniques, or make the predictions, industry professionals almost
universally agree that one of the most important parts of any such project, and one of the
most time-consuming and difficult, is data preparation. Unfortunately, data preparation
has been much like the weather—as the old aphorism has it, “Everyone talks about it, but
no one does anything about it.” This book takes a detailed look at the problems in
preparing data, the solutions, and how to use the solutions to get the most out of the
data—whatever you want to use it for. This book tells you what can be done about it,
exactly how it can be done, and what it achieves, and puts a powerful kit of tools directly in
your hands that allows you to do it.
How important is adequate data preparation? After finding the right problem to solve, data
preparation is often the key to solving the problem. It can easily be the difference between
success and failure, between useable insights and incomprehensible murk, between
worthwhile predictions and useless guesses.
For instance, in one case data carefully prepared for warehousing proved useless for
modeling. The preparation for warehousing had destroyed the useable information content
for the needed mining project. Preparing the data for mining, rather than warehousing,
produced a 550% improvement in model accuracy. In another case, a commercial baker
achieved a bottom-line improvement approaching $1 million by using data prepared with the
techniques described in this book instead of previous approaches.
|