This article was originally posted here, by Mubashir Qasim. "In my last article, I stated that for practitioners (as opposed to theorists), the real prerequisite for machine learning is data analysis, not math. One of the main reasons for making this statement, is that data scientists spend an inordinate amount of time on data analysis. The traditional statement is that data scientists “spend 80% of their time on data preparation.” While I think that this statement is essentially correct, a more precise statement is that you’ll spend 80% of your time on getting data, cleaning data, aggregating data, reshaping data, and exploring data using exploratory data analysis and data visualization. (From this point forward, I’ll use the term “data analysis” as a shorthand for getting data, reshaping it, exploring it, and visualizing it.) And ultimately, the importance of data analysis applies not only to data science generally, but machine learning specifically. The fact is, if you want to build a machine learning model, you’ll spend huge amounts of time just doing data analysis as a precursor to that process. Moreover, you’ll use data analysis to explore the results of your model after you’ve applied an ML algorithm. Additionally, in industry, you’ll also need to rely heavily on data visualization techniques to present the results after you’ve finalized them. This is one of the practical details of working as a data scientist that many courses and teachers never tell you about. Creating presentations to communicate your results will take large amounts of your time. And to create these presentations, you should rely heavily data visualization to communicate the model results visually. Data analysis and data visualization are critical at almost every part of the machine learning workflow. So, to get started with ML (and to eventually master it) you need to be able to apply visualization and analysis. In this post, I’ll show you some of the basic data analysis and visualization techniques you’ll need to know to build a machine learning model. One note before we get started: the problem that we’ll work through is just linear regression and we’ll be using an easy-to-use, off-the-shelf dataset. It’s a “toy problem,” which is intentional. Whenever you try to learn a new skill, it is extremely helpful to isolate different details of that skill. The skill that I really want you to focus on here is data visualization (as it applies to machine learning). We’ll also be performing a little bit of data manipulation, but it will be in service of analyzing and visualizing the data. We won’t be doing any data manipulation to “clean” the data. So just keep that in mind. We’re working on a very simplified problem. I’m removing or limiting several other parts of the ML workflow so we can strictly focus on preliminary visualization and analysis for machine learning. The first step almost of any analysis or model building effort is getting the data. For this particular analysis, we’ll use a relatively “off the shelf” dataset that’s available in R within the MASS package. The <inline_code>Boston dataset contains data on median house price for houses in the Boston area. The variable that we’ll try to predict is the <inline_code>medv variable (median house price). The dataset has roughly a dozen other predictors that we’ll be investigating and using in our model. As I already mentioned, the example we’ll be working through is a bit of a “toy” example, and as such, we’re working with a dataset that’s relatively “easy to use.” What I mean is that I’ve chosen this dataset because it’s easy to obtain and it doesn’t require much data cleaning. However, keep in mind that in a typical business or industry setting, you’ll probably need to get your data from a database using SQL or possibly from a spreadsheet or other file. Moreover, it’s very common for data to be “messy.” The data may have lots of missing values; variable names and class names that need to be changed; or other details that need to be altered. Again, I’m intentionally leaving “data cleaning” out of this blog post for the sake of simplicity. Just keep in mind that in many cases, you’ll have some data cleaning to do. After getting the dataset, the next step in the model building workflow is almost always data visualization. Specifically, we’ll perform exploratory data analysis on the data to accomplish several tasks: 1. View data distributions Let’s begin our data exploration by visualizing the data distributions of our variables. We can start by visualizing the distribution of our target variable, <inline_code>medv. To do this, we’ll first use a basic histogram. I strongly believe that the histogram is one of the “core visualization techniques” that every data scientists should master. If you want to be a great data scientist, and if you ultimately want to build machine learning models, then mastering the histogram is one of your “first steps.” By “master”, I mean that you should be able to write this code “with your eyes closed.” A good data scientist should be able to write the code to create a histogram (or scatterplot, or line chart ….) from scratch, without any reference material and without “copying and pasting.” You should be able to write it from memory almost as fast as you can type. One of the reasons that I believe the histogram is so important is because we use it frequently in this sort of exploratory data analysis. When we’re performing an analysis or building a model, it is extremely common to examine the distribution of a variable. Because it’s so common to do this, you should know this technique cold. Here’s the code to create a histogram of our target variable <inline_code>medv. If you don’t really understand how this code works, I’d highly recommend that you read myblog post about how to create a histogram with ggplot2. That post explains how the histogram code works, step by step. Let’s also create a density plot <inline_code>medv. Here’s the exact code to create a density plot <inline_code>medv. The density plot is essentially a variation of the histogram. The code to create a density plot is essentially identical to the code for a histogram, except that the second line is changed from <inline_code>geom_histogram() to <inline_code>stat_density(). Speaking in terms of ggplot2 syntax, we’re replacing the histogram geom with a statistical transformation.本帖隐藏的内容
[color=rgb(255, 255, 255) !important]
2. Identify skewed predictors
3. Identify outliers
# VISUALIZE TARGET VARIABLE
############################
require(ggplot2)
#~~~~~~~~~~~
# histogram
#~~~~~~~~~~~
ggplot(data = Boston, aes(x = medv)) + geom_histogram()
# density plot
#~~~~~~~~~~~~~~
ggplot(data = Boston, aes(x = medv)) + stat_density()
[color=rgb(255, 255, 255) !important]