楼主: oliyiyi
1277 41

Essentials of Machine Learning Algorithms (with Python and R Codes) [推广有奖]






TA的文库  其他...


618353 个
1307 点
1415 点
1218 点
324032 点
4808 小时

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

oliyiyi 发表于 2017-9-12 15:24:39 |显示全部楼层



Google’s self-driving cars and robots get a lot of press, but the company’s real future is in machine learning, the technology that enables computers to get smarter and more personal.

– Eric Schmidt (Google Chairman)

We are probably living in the most defining period of human history. The period when computing moved from large mainframes to PCs to cloud. But what makes it defining is not what has happened, but what is coming our way in years to come.

What makes this period exciting for some one like me is the democratization of the tools and techniques, which followed the boost in computing. Today, as a data scientist, I can build data crunching machines with complex algorithms for a few dollors per hour. But, reaching here wasn’t easy! I had my dark days and nights.

Who can benefit the most from this guide?What I am giving out today is probably the most valuable guide, I have ever created.

The idea behind creating this guide is to simplify the journey of aspiring data scientists and machine learning enthusiasts across the world. Through this guide, I will enable you to work on machine learning problems and gain from experience. I am providing a high level understanding about various machine learning algorithms along with R & Python codes to run them. These should be sufficient to get your hands dirty.

I have deliberately skipped the statistics behind these techniques, as you don’t need to understand them at the start. So, if you are looking for statistical understanding of these algorithms, you should look elsewhere. But, if you are looking to equip yourself to start building machine learning project, you are in for a treat.

Broadly, there are 3 types of Machine Learning Algorithms..1. Supervised Learning

How it works: This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.

2. Unsupervised Learning

How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate.  It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.

3. Reinforcement Learning:

How it works:  Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process

List of Common Machine Learning Algorithms

Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

  • Linear Regression
  • Logistic Regression
  • Decision Tree
  • SVM
  • Naive Bayes
  • KNN
  • K-Means
  • Random Forest
  • Dimensionality Reduction Algorithms
  • Gradient Boosting algorithms
    • GBM
    • XGBoost
    • LightGBM
    • CatBoost

stata SPSS
oliyiyi 发表于 2017-9-12 15:25:34 |显示全部楼层
1. Linear Regression

It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s). Here, we establish relationship between independent and dependent variables by fitting a best line. This best fit line is known as regression line and represented by a linear equation Y= a *X + b.

The best way to understand linear regression is to relive this experience of childhood. Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking them their weights! What do you think the child will do? He / she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life! The child has actually figured out that height and build would be correlated to the weight by a relationship, which looks like the equation above.

In this equation:

  • Y – Dependent Variable
  • a – Slope
  • X – Independent variable
  • b – Intercept

These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

Look at the below example. Here we have identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.

Linear Regression is of mainly two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.

Python Code

  1. #Import Library
  2. #Import other necessary libraries like pandas, numpy...
  3. from sklearn import linear_model
  4. #Load Train and Test datasets
  5. #Identify feature and response variable(s) and values must be numeric and numpy arrays
  6. x_train=input_variables_values_training_datasets
  7. y_train=target_variables_values_training_datasets
  8. x_test=input_variables_values_test_datasets
  9. # Create linear regression object
  10. linear = linear_model.LinearRegression()
  11. # Train the model using the training sets and check score
  12. linear.fit(x_train, y_train)
  13. linear.score(x_train, y_train)
  14. #Equation coefficient and Intercept
  15. print('Coefficient: \n', linear.coef_)
  16. print('Intercept: \n', linear.intercept_)
  17. #Predict Output
  18. predicted= linear.predict(x_test)

R Code

  1. #Load Train and Test datasets
  2. #Identify feature and response variable(s) and values must be numeric and numpy arrays
  3. x_train <- input_variables_values_training_datasets
  4. y_train <- target_variables_values_training_datasets
  5. x_test <- input_variables_values_test_datasets
  6. x <- cbind(x_train,y_train)
  7. # Train the model using the training sets and check score
  8. linear <- lm(y_train ~ ., data = x)
  9. summary(linear)
  10. #Predict Output
  11. predicted= predict(linear,x_test)


使用道具 举报

oliyiyi 发表于 2017-9-12 15:28:26 |显示全部楼层
2. Logistic Regression

Don’t get confused by its name! It is a classification not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s). In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies between 0 and 1 (as expected).

Again, let us try and understand this through a simple example.

Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it or you don’t. Now imagine, that you are being given wide range of puzzles / quizzes in an attempt to understand which subjects you are good at. The outcome to this study would be something like this – if you are given a trignometry based tenth grade problem, you are 70% likely to solve it. On the other hand, if it is grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you.

Coming to the math, the log odds of the outcome is modeled as a linear combination of the predictor variables.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrenceln(odds) = ln(p/(1-p))logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Above, p is the probability of presence of the characteristic of interest. It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).

Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one of the best mathematical way to replicate a step function. I can go in more details, but that will beat the purpose of this article.

Python Code

  1. #Import Library
  2. from sklearn.linear_model import LogisticRegression
  3. #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
  4. # Create logistic regression object
  5. model = LogisticRegression()
  6. # Train the model using the training sets and check score
  7. model.fit(X, y)
  8. model.score(X, y)
  9. #Equation coefficient and Intercept
  10. print('Coefficient: \n', model.coef_)
  11. print('Intercept: \n', model.intercept_)
  12. #Predict Output
  13. predicted= model.predict(x_test)

R Code

  1. x <- cbind(x_train,y_train)
  2. # Train the model using the training sets and check score
  3. logistic <- glm(y_train ~ ., data = x,family='binomial')
  4. summary(logistic)
  5. #Predict Output
  6. predicted= predict(logistic,x_test)


There are many different steps that could be tried in order to improve the model:


使用道具 举报

NOTHINGWMM 发表于 2017-9-12 15:32:38 |显示全部楼层
Essentials of Machine Learning Algorithms (with Python and R Codes) [

使用道具 举报

kavakava 在职认证  发表于 2017-9-12 17:35:44 |显示全部楼层
已有 1 人评分经验 收起 理由
oliyiyi + 10 精彩帖子

总评分: 经验 + 10   查看全部评分


使用道具 举报

MouJack007 发表于 2017-9-12 19:35:50 |显示全部楼层
已有 1 人评分经验 收起 理由
oliyiyi + 10 精彩帖子

总评分: 经验 + 10   查看全部评分


使用道具 举报

MouJack007 发表于 2017-9-12 19:36:16 |显示全部楼层

使用道具 举报

yazxf 发表于 2017-9-12 20:05:24 |显示全部楼层

使用道具 举报

ekscheng 发表于 2017-9-12 23:39:29 |显示全部楼层

使用道具 举报

fengyg 企业认证  发表于 2017-9-13 07:37:17 |显示全部楼层

使用道具 举报

您需要登录后才可以回帖 登录 | 我要注册

GMT+8, 2018-11-19 01:06