Diagnosing breast cancer with the kNN algorithm
Routine breast cancer screening allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The process of early detection involves examining the breast tissue for abnormal lumps or masses. If a lump is found, a fine-needle aspiration biopsy is performed, which utilizes a hollow needle to extract a small portion of cells from the mass. A clinician then examines the cells under a microscope to determine whether the mass is likely to be malignant or benign.If machine learning could automate the identification of cancerous cells, it would provide considerable benefit to the health system. Automated processes are likely to improve the efficiency of the detection process, allowing physicians to spend less time diagnosing and more time treating the disease. An automated screening system might also provide greater detection accuracy by removing the inherently subjective human component from the process.
We will investigate the utility of machine learning for detecting cancer by applying the kNN algorithm to measurements of biopsied cells from women with abnormal breast masses.
Step 1 – collecting data
We will utilize the "Breast Cancer Wisconsin Diagnostic" dataset from the UCI Machine Learning Repository, which is available at http://archive.ics.uci.edu/ml. This data was donated by researchers of the University of Wisconsin and includes measurements from digitized images of fine-needle aspirate of a breast mass. The values represent characteristics of the cell nuclei present in the digital image.
The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis, and 30 are numeric-valued laboratory measurements. The diagnosis is coded as M to indicate malignant or B to indicate benign.
The 30 numeric measurements comprise the mean, standard error, and worst (that is, largest) value for 10 different characteristics of the digitized cell nuclei. These include:
Radius
Texture
Perimeter
Area
Smoothness
Compactness
Concavity
Concave points
Symmetry
Fractal dimension
Based on their names, all of the features seem to relate to the shape and size of the cell nuclei. Unless you are an oncologist, you are unlikely to know how each relates to benign or malignant masses. These patterns will be revealed as we continue in the machine learning process.
Step 2 – exploring and preparing the data
Let's explore the data and see if we can shine some light on the relationships. At the same time, we will prepare the data for use with the kNN learning method.
- Tip
- If you plan on following along, download the wisc_bc_data.csv file from the Packt website and save it to your R working directory. The dataset was modified very slightly for this book. In particular, a header line was added and the rows of data were randomly ordered.
- > wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
- Tip
- Regardless of the machine learning method, ID variables should always be excluded. Neglecting to do so can lead to erroneous findings because the ID can be used to uniquely "predict" each example. Therefore, a model that includes an identifier will most likely suffer from overfitting, and is not likely to generalize well to other data.
- > wbcd <- wbcd[-1]
- > table(wbcd$diagnosis)
- > wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
- > round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
- > summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
Looking at the features side-by-side, do you notice anything problematic about the values? Recall that the distance calculation for kNN is heavily dependent upon the measurement scale of the input features. As smoothness_mean ranges from 0.05 to 0.16, while area_mean ranges from 143.5 to 2501.0, the impact of area is going to be much larger than smoothness in the distance calculation. This could potentially cause problems for our classifier, so let's apply normalization to rescale the features to a standard range of values.
Transformation – normalizing numeric data
- > normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) }
- > normalize(c(1, 2, 3, 4, 5))> normalize(c(10, 20, 30, 40, 50))
We can now apply the normalize() function to the numeric features in our data frame. Rather than normalizing each of the 30 numeric variables individually, we will use one of R's functions to automate the process.
The lapply() function of R takes a list and applies a function to each element of the list. As a data frame is a list of equal-length vectors, we can use lapply() to apply normalize() to each feature in the data frame. The final step is to convert the list returned by lapply() to a data frame using the as.data.frame() function. The full process looks like this:
- > wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
To confirm that the transformation was applied correctly, let's look at one variable's summary statistics:
- > summary(wbcd_n$area_mean)
Data preparation – creating training and test datasets
In the absence of such data, we can simulate this scenario by dividing our data into two portions: a training dataset that will be used to build the kNN model and a test dataset that will be used to estimate the predictive accuracy of the model. We will use the first 469 records for the training dataset and the remaining 100 to simulate new patients.
Using the data extraction methods presented in Chapter 2, Managing and Understanding Data, we will split the wcbd_n data frame into the wbcd_train and wbcd_test data frames:
- > wbcd_train <- wbcd_n[1:469, ]> wbcd_test <- wbcd_n[470:569, ]
When we constructed our training and test data, we excluded the target variable, diagnosis. For training the kNN model, we will need to store these class labels in factor vectors, divided to the training and test datasets:
- > wbcd_train_labels <- wbcd[1:469, 1]> wbcd_test_labels <- wbcd[470:569, 1]
Step 3 – training a model on the data
Equipped with our training data and labels vector, we are now ready to classify our unknown records. For the kNN algorithm, the training phase actually involves no model building—the process of training a lazy learner like kNN simply involves storing the input data in a structured format.
To classify our test instances, we will use a kNN implementation from the class package, which provides a set of basic R functions for classification. If this package is not already installed on your system, you can install it by typing:
- > install.packages("class")
The knn() function in the class package provides a standard, classic implementation of the kNN algorithm. For each instance in the test data, the function will identify the k-nearest neighbors, using Euclidean distance, where k is a user-specified number. The test instance is classified by taking a "vote" among the k-Nearest Neighbors—specifically, this involves assigning the class of the majority of the k neighbors. A tie vote is broken at random.
Step 3 – training a model on the data
We already have nearly everything that we need to apply the kNN algorithm to this data. We split our data into training and test datasets, each with exactly the same numeric features. The labels for the training data are stored in a separate factor vector. The only remaining parameter is k, which specifies the number of neighbors to include in the vote.
As our training data includes 469 instances, we might try k = 21, an odd number roughly equal to the square root of 469. Using an odd number will reduce the chance of ending with a tie vote.Now we can use the knn() function to classify the test data:
- > wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=21)
Step 4 – evaluating model performance
The next step of the process is to evaluate how well the predicted classes in the wbcd_test_pred vector match up with the known values in the wbcd_test_labels vector. To do this, we can use the CrossTable() function in the gmodels package, which was introduced in Chapter 2, Managing and Understanding Data. If you haven't done so already, please install this package using the command install.packages("gmodels").
- > CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)
Step 5 – improving model performance
We will attempt two simple variations on our previous classifier. First, we will employ an alternative method for rescaling our numeric features. Second, we will try several different values for k.
- > wbcd_z <- as.data.frame(scale(wbcd[-1]))
- > summary(wbcd_z$area_mean)
- > wbcd_train <- wbcd_z[1:469, ]> wbcd_test <- wbcd_z[470:569, ]> wbcd_train_labels <- wbcd[1:469, 1]> wbcd_test_labels <- wbcd[470:569, 1]> wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=21)> CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)
Testing alternative values of k
We may be able do even better by examining performance across various values of k. Using the normalized training and test datasets, the same 100 records were classified using several different k values. The number of false negatives and false positives are shown for each iteration:
Although the classifier was never perfect, the 1NN approach was able to avoid some of the false negatives at the expense of adding false positives. It is important to keep in mind, however, that it would be unwise to tailor our approach too closely to our test data; after all, a different set of 100 patient records is likely to be somewhat different from those used to measure our performance.
Reference:
- Machine Learning with R
By: Brett Lantz
Publisher: Packt Publishing
Pub. Date: October 25, 2013
Print ISBN-13: 978-1-78216-214-8
Web ISBN-13: 978-1-78216-215-5


雷达卡


京公网安备 11010802022788号







