xiv Preface
mining and knowledge discovery problems discussed throughout this monograph.
It pays extra attention to the reasons that lead to formulate some of these problems
as optimization problems since one always needs to keep control on the size (i.e.,
for size minimization) of the extracted new rules or when one tries to gain a deeper
understanding of the system of interest by issuing a small number of new queries
(i.e., for query minimization).
The second and third chapters present some sophisticated branch-and-bound
algorithms for extracting a pattern (in the form of a compact Boolean function)
from collections of observations grouped into two disjoint classes. The fourth chapter
presents some fast heuristics for the same problem.
The fifth chapter studies the problem of guided learning. That is, now the analyst
has the option to decide the composition of the observation to send to an expert or
“oracle” for the determination of its class membership. Apparently, the goal now is
to gain a good understanding of the system of interest by issuing a small number of
inquiries of the previous type.
A related problem is studied in the sixth chapter. Now it is assumed that the
analyst has two sets of examples (observations) and a Boolean function that is
inferred from these examples. Furthermore, it is assumed that the analyst has a new
example that invalidates this Boolean function. Thus, the problem is how to modify
the Boolean function such that it satisfies all the requirements of the available examples
plus the new example. This is known as the incremental learning problem.
Chapter 7 presents an intriguing duality relationship which exists between
Boolean functions expressed in CNF (conjunctive normal form) and DNF (disjunctive
normal form), which are inferred from examples. This dual relationship could
be used in solving large-scale inference problems, in addition to other algorithmic
advantages.
The chapter that follows describes a graph theoretic approach for decomposing
large-scale data mining problems. This approach is based on the construction of a
special graph, called the rejectability graph, from two collections of data. Then certain
characteristics of this graph, such as its minimum clique cover, can lead to some
intuitive and very powerful decomposition strategies.
Part II (“Application Issues”) begins with Chapter 9. This chapter presents an
intriguing problem related to any model (and not only those based on logic methods)
inferred from grouped observations. This is the problem of the reliability of the
model and it is associated with both the number of the training data (sampled observations
grouped into two disjoint classes) and also the nature of these data. It is
argued that many model inference methods today may derive models that cannot
guarantee the reliability of their predictions/classifications. This chapter prepares the
basic arguments for studying a potentially very critical type of Boolean functions
known as monotone Boolean functions.
The problems of inferring a monotone Boolean function from inquiries to an
expert (“oracle”), along with some key mathematical properties and some application
issues are discussed in Chapters 10 and 11. Although this type of Boolean functions
has been known in the literature for some time, it was the author of this book along
with some of his key research associates who made some intriguing contributions
Preface xv
to this part of the literature in recent years. Furthermore, Chapter 11 describes some
key problems in assessing the effectiveness of data mining and knowledge discovery
models (and not only for those which are based on logic). These issues are referred
to as the “three major illusions” in evaluating the accuracy of such models. There it
is shown that many models which are considered as highly successful, in reality may
even be totally useless when one studies their accuracy in depth.
Chapter 12 presents how some of the previous methods for inferring a Boolean
function from observations can be used (after some modifications) to extract what is
known in the literature as association rules. Traditional methods suffer the problem
of extracting an overwhelming number of association rules and they are doing so in
exponential time. The new methods discussed in this chapter are based on some fast
(of polynomial time) heuristics that can derive a compact set of association rules.
Chapter 13 presents some new methods for analyzing and categorizing text documents.
Since theWeb has made possible the availability of immense textual (and not
only) information easily accessible to anyone with access to it, such methods are
expected to attract even more interest in the immediate future.
Chapters 14, 15, and 16 discuss some real-life case studies. Chapter 14 discusses
the analysis of some real-life EMG (electromyography) signals for predicting muscle
fatigue. The same chapter also presents a comparative study which indicates that the
proposed logic-based methods are superior to some of the traditional methods used
for this kind of analysis.
Chapter 15 presents some real-life data gathered from the analysis of cases suspected
of breast cancer. Next these data are transformed into equivalent binary data
and then some diagnostic rules (in the form of compact Boolean functions) are
extracted by using the methods discussed in earlier chapters. These rules are next
presented in the form of IF-THEN logical expressions (diagnostic rules).
Chapter 16 presents a combination of some of the proposed logic methods with
fuzzy logic. This is done in order to objectively capture fuzzy data that may play a
key role in many data mining and knowledge discovery applications. The proposed
new method is demonstrated in characterizing breast lesions in digital mammography
as lobular or microlobular. Such information is highly significant in analyzing
medical data for breast cancer diagnosis.
The last chapter presents some concluding remarks. Furthermore, it presents
twelve different areas that are most likely to experience high interest for future
research efforts in the field of data mining and knowledge discovery.
All the above chapters make clear that methods based on mathematical logic
already play an important role in data mining and knowledge discovery. Furthermore,
such methods are almost guaranteed to play an even more important role in the near
future as such problems increase both in complexity and in size.