人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › LATEX论坛 › Python vs R: 4 Implementations of Same Machine Learn ...

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

12 3 4 5 下一页

发帖

楼主: oliyiyi

4470 49

Python vs R: 4 Implementations of Same Machine Learning Technique [推广有奖]

24 个论坛币

回复本帖可获得 3 个论坛币奖励! 每人限 3 次(中奖概率 10%)

1关注
184
粉丝

版主

泰斗

还不是VIP/贵宾

TA的文库 其他...

计量文库

威望: 7 级
论坛币: 271951 个
通用积分: 31269.3519
学术水平: 1435 点
热心指数: 1554 点
信用等级: 1345 点
经验: 383775 点
帖子: 9598
精华: 66
在线时间: 5468 小时
注册时间: 2007-5-21
最后登录: 2024-4-18

楼主

oliyiyi 发表于 2017-2-25 08:54:49 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

本帖隐藏的内容

Actually, this is about two R versions (standard and improved), a Python version, and a Perl version of a new machine learning technique recently published here. We asked for help to translate the original Perl script to Python and R, and finally decided to work with Naveenkumar Ramaraju, who is currently pursuing a master's in Data Science at Indiana University. So the Python and R versions are from him.

We believe that this code comparison and translation will be very valuable to anyone learning Python or R with the purpose of applying it to data science and machine learning.

[color=rgb(255, 255, 255) !important]

The code

The source code is easy to read and has deliberately made longer than needed to provide enough details, avoid complicated iterations, and facilitate maintenance.The main output file is hdt-out2.txt. The input data set is HDT-data3.txt. You need to read this article (see section 4 after clicking, it has been updated) to check out what the code is trying to accomplish. In short, it is an algorithm to classify blog posts as popular or not based on extracted features (mostly, keywords in the title.)

The code has been written in Perl, R and Python. Perl and Python run faster than R. Click on the relevant link below to access the source code, available as a text file. The code, originally written in Perl, was translated to Python and R by Naveenkumar Ramaraju.

For those learning Python or R, this is a great opportunity.

Note regarding the R implementation

Required library: hash (R doesn't have inbuilt hash or dictionary without imports.) You can use any one of below script files.

Standard version is the literal translation of the Perl code with same variable names to the maximum extent possible.
Improved version uses functions, more data frames and more R-like approach to reduce code running time (~30 % faster) and less lines of code. Variable names would vary from Perl. Output file would have comma(,) as delimiter between IDs.

Instructions to run: Place the R file and HDT-data3.txt (input file) in root folder of R environment. Execute the '.R' file in R studio or using command line script: > Rscript HDT_improved.R R is known to be slow in text parsing. We can optimize further if all inputs are within double quotes or no quotes at all by using data frames.