Logistic regression predicts the probability of the outcome being true. In this exercise, we will implement a logistic regression and apply it to two different data sets. The file ex2data1.txt contains the dataset for the first part of the exercise and ex2data2.txt is data that we will use in the second part of the exercise. To learn the basics of Logistic Regression in R read this post. In the first part of this exercise, we will build a logistic regression model to predict whether a student gets admitted into a university. Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant’s scores on two exams and the admissions decision. Our task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two exams. The first two columns contains the exam scores and the third column contains the label. Before starting to implement any learning algorithm, it is always good to visualize the data if possible. This is the plot: This is the formula: For large positive values of x, the sigmoid should be close to 1, while for large negative values, the sigmoid should be close to 0. Evaluating sigmoid(0) should give exactly 0.5. We can visualize the sigmoid function graphically: This is the plot: This is the formula: Initialize fitting parameters: The function below calculates cost. What is the cost for the initial theta parameters, which are all zeros? The function below calculates gradient. The gradient for the initial theta parameters, which are all zeros, is shown below We can use gradient descent to get the optimal theta values but using optimazation libraries converges quicker. So, let’s use the optim general-purpose Optimization in R to get the required theta values and the associated cost. Now, let’s plot the decision boundary. Only 2 points are required to define a line, so let’s choose two endpoints. This is the plot: After learning the parameters, you can use the model to predict whether a particular student will be admitted. For example, for a student with an Exam 1 score of 45 and an Exam 2 score of 85, the probability of admission is shown below. Now, let’s calculate the model accuracy. Let’s use a threshould of 0.5. In this part of the exercise, we will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly. Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. This is the plot: The above figure shows that our dataset cannot be separated into positive and negative examples by a straight-line through the plot. Therefore, a straightforward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary.本帖隐藏的内容
Load the data and display first 6 observations
Visualizing the data
Let’s code the sigmoid function so that we can call it in the rest of our programs.
Let’s check!
Cost function and gradient
Add ones for the intercept term:
General-purpose Optimization in lieu of Gradient Descent
Decision Boundary
Evaluating logistic regression
Model accuracy
Regularized logistic regression