楼主: oliyiyi
3621 146

R vs Python: head to head data analysis   [推广有奖]

回帖奖励 36 个论坛币 回复本帖可获得 3 个论坛币奖励! 每人限 3 次(中奖概率 10%)

版主

大师

83%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
6
论坛币
618353 个
学术水平
1307 点
热心指数
1415 点
信用等级
1218 点
经验
324032 点
帖子
8524
精华
66
在线时间
4808 小时
注册时间
2007-5-21
最后登录
2018-11-15

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

oliyiyi 发表于 2017-2-25 08:49:56 |显示全部楼层

本帖隐藏的内容

There have been dozens of articles written comparing Python and R from a subjective standpoint. We’ll add our own views at some point, but this article aims to look at the languages more objectively. We’ll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of each language without the conjecture. At Dataquest, we teach both languages, and think both have a place in a data science toolkit.

We’ll be analyzing a dataset of NBA players and their performance in the 2013-2014 season. You can download the file here. For each step in the analysis, we’ll show the Python and R code, along with some explanation and discussion of the different approaches. Without further ado, let’s get this head to head matchup started!

Read in a csv file

R

nba <- read.csv("nba_2013.csv")

Python

import pandas
nba = pandas.read_csv("nba_2013.csv")

The above code will load the csv file nba_2013.csv, which contains data on NBA players from the 2013-2014 season, into the variable nba in both languages. The only real difference is that in Python, we need to import the pandas library to get access to Dataframes. Dataframes are available in both R and Python, and are two-dimensional arrays (matrices) where each column can be of a different datatype. At the end of this step, the csv file has been loaded by both languages into a dataframe.

Find the number of players

R

dim(nba)
[1] 481 31

Python

nba.shape
(481, 31)

This prints out the number of players and the number of columns in each. We have 481 rows, or players, and 31 columns containing data on the players.

Look at the first row of the data

R

head(nba, 1)

      player pos age bref_team_id
1 Quincy Acy  SF  23          TOT
[output truncated]

Python

nba.head(1)

      player pos age bref_team_id
0 Quincy Acy  SF  23          TOT
[output truncated]

This is pretty much identical. Both print out the first row of the data, and the syntax is very similar. Python is more object-oriented here, and head is a method on the dataframe object, and R has a separate head function. This is a common theme you’ll see as you start to do analysis with these languages, where Python is more object-oriented, and R is more functional.

Find the average of each statistic

Let’s find the average value for each statistic. The columns, as you can see, have names like fg (field goals made), and ast (assists). These are the season statistics for the player. If you want a fuller explanation of all the stats, look here.

R

sapply(nba, mean, na.rm=TRUE)

player NApos NAage 26.5093555093555bref_team_id NA[output truncated]

Python

nba.mean()

age             26.509356g               53.253638gs              25.571726[output truncated]

There are some major differences in approach here. In both, we’re applying a function across the dataframe columns. In python, the mean method on dataframes will find the mean of each column by default.

In R, taking the mean of string values will just result in NA – not available. However, we do need to ignore NA values when we take the mean (requiring us to pass na.rm=TRUE into the mean function). If we don’t, we end up with NA for the mean of columns like x3p.. This column is three point percentage. Some players didn’t take three point shots, so their percentage is missing. If we try the mean function in R, we get NA as a response, unless we specify na.rm=TRUE, which ignores NA values when taking the mean. The .mean() method in Python already ignores these values by default.

Make pairwise scatterplots

One common way to explore a dataset is to see how different columns correlate to others. We’ll compare the ast, fg, and trb columns.

R

library(GGally)
ggpairs(nba[,c("ast", "fg", "trb")])

Python

import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(nba[["ast", "fg", "trb"]])
plt.show()

We get very similar plots in the end, but this shows how the R data science ecosystem has many smaller packages (GGally is a helper package for ggplot2, the most-used R plotting package), and many more visualization packages in general. In Python, matplotlib is the primary plotting package, and seaborn is a widely used layer over matplotlib. With visualization in Python, there is usually one main way to do something, whereas in R, there are many packages supporting different methods of doing things (there are at least a half dozen packages to make pair plots, for instance).




已有 3 人评分经验 学术水平 热心指数 信用等级 收起 理由
日新少年 + 2 + 2 + 2 精彩帖子
remlus + 100 精彩帖子
guo.bailing + 100 精彩帖子

总评分: 经验 + 200  学术水平 + 2  热心指数 + 2  信用等级 + 2   查看全部评分

本帖被以下文库推荐

缺少币币的网友请访问有奖回帖集合
http://bbs.pinggu.org/thread-3990750-1-1.html
stata SPSS
fengyg 企业认证  发表于 2017-2-25 11:06:13 |显示全部楼层
kankan
回复

使用道具 举报

albertwishedu 发表于 2017-2-25 13:44:36 |显示全部楼层
回复

使用道具 举报

albertwishedu 发表于 2017-2-25 13:45:02 |显示全部楼层
回复

使用道具 举报

albertwishedu 发表于 2017-2-25 13:45:19 |显示全部楼层

回帖奖励 +3 个论坛币

回复

使用道具 举报

albertwishedu 发表于 2017-2-25 13:45:35 |显示全部楼层
回复

使用道具 举报

HappyAndy_Lo 发表于 2017-2-25 13:51:31 |显示全部楼层
回复

使用道具 举报

albertwishedu 发表于 2017-2-25 13:51:46 |显示全部楼层
回复

使用道具 举报

HappyAndy_Lo 发表于 2017-2-25 13:51:53 |显示全部楼层
回复

使用道具 举报

albertwishedu 发表于 2017-2-25 13:52:02 |显示全部楼层
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 我要注册

GMT+8, 2018-11-19 01:00