楼主: oliyiyi
2809 25

R for Excel Users [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

R for Excel Usersby Gordon Shotwell | February 14, 2017

About Gordon: Gordon Shotwell does law and data science things. This post originally appeared on his blog.


Introduction

Like most people, I first learned to work with numbers through an Excel spreadsheet. After graduating with an undergraduate philosophy degree, I somehow convinced a medical device marketing firm to give me a job writing Excel reports on the orthopedic biomaterials market. When I first started, I remember not knowing how to do anything, but after a few months I became fairly proficient with the tool, and was able to build all sorts of useful models. When you think about it, this is an amazing feature of Excel. Every day, all over the world, people open up a spreadsheet to do some data entry and then, bit by bit, learn to do increasingly complex analytical tasks. Excel is a master at teaching people how to use Excel.

R is not like that. I learned to use R as a side project during law school, and it felt a bit like training with an abusive kung-fu master in the mountains of rural China.

I couldn't get R to do anything. It wouldn't read in files, draw a plot or multiply two numbers together. All I could do was generate mystifying errors and get mocked on Stack Overflow for asking redundant questions. This was all made more frustrating by the fact that I could accomplish all of these things in Excel without much difficulty.

This is the basic pain of learning to program. Programming languages are designed to be general in their application and to allow you to accomplish a huge variety of complex tasks with the same basic set of tools. The cost of this generality is a steep learning curve. When you start learning to do basic tasks in R, you are also learning how to do complex things down the road. As you learn more and more, the marginal cost of complex analyses goes down. Excel is the opposite, and is very easy at the beginning, but the marginal cost goes up with the complexity of the problem. If you were to graph this it might look like this:

At the beginning, when you are trying to accomplish simple things like balancing a budget or entering some data by hand, R is definitely harder to learn than Excel. However, as the task gets more complex, it becomes easier to accomplish in R than Excel, because the core structures of Excel are designed for relatively simple use cases and are not the best for more complex problems. This isn't to say that you can't solve a lot of complex problems with Excel, it's just that the tool won't make it easy for you.

For a lot of us, the pain of learning to program feels like the pain of failure. When the program gives you an incomprehensible error message it feels like it's telling you that you're stupid and lack programming aptitude. But after programming for a while, you learn that nobody really understands those errors, and everybody feels like an imposter when their program fails. The pain you feel is not the pain of failure, it's just the pain of learning.


Why is learning new things so hard?!

The difficulty of learning a new tool is caused by two obstacles:


Obstacle #1: The tool is different from what you know

When you know how to use something you have this vast amount of basic vocabulary about that tool. I haven't used Excel seriously for six years, but I can still remember all of its hot-keys, formula names, and menu structures. When you're learning a new tool you don't know any of this stuff, and that automatically makes it more difficult. Additionally, you might know where to look to find help on the old tool, or how to Google questions in such a way that you find useful answers. You don't know any of these things about the new tool, which is painful.


Obstacle #2: The mental model underlying the tool is different from your current mental model

The way the new tool wants you to think about the problem is different from the way you are used to thinking about the problem. For instance, if you are used to putting your analysis in a rectangular grid, then moving to a tool which is designed around procedural commands is going to be difficult.

In my opinion obstacle #2 is by far the larger barrier for Excel users. Most of the people who learn R have some basis in programming. The mental models underlying languages like Matlab or Python, as well as statistical packages like SPSS and SAS, have a lot in common with R, and there are many resources available for translating the bits which don't make sense. Excel makes you think about analytical problems in a very different way, and there aren't very many resources for translating the two paradigms.


Four Fundamental Differences Between R and Excel1) Text-based analysis

Excel is based on the physical spreadsheet, or accountant's ledger. This was a large piece of paper with rows and columns. Records were stored in the first column on the left, calculations on those records were stored in the boxes to the right, and the sum of those calculations was totaled at the bottom. I would call this a referential model of computation which has a few qualities:

  • The data and computation are usually stored in the same place
  • Data is identified by its location on the grid. Usually you don't name a data range in Excel, but instead refer to it by its location, for instance with $A1:C$36
  • The calculations are usually the same shape as the data. In other words if you want to multiply 20 numbers stored in cells A1:An by 2, you will need 20 calculations: =A1 * 2, =A2 * 2, ...., =An * 2.

Text based data analysis is different:

  • Data and computation are separate. You have one file which stores the data and another file which stores the commands which tell the program how to manipulate that data. This leads to a procedural kind of model in which the raw data is fed through a set of instructions and the output pops out the other side.
  • Data is generally referenced by name. Instead of having a dataset which lives in the range of $A1:C$36you name the data set when you read it in, and refer to it by that name whenever you want to do something with it. You can do this with Excel by naming ranges of cells, but most people don't do this.

2) Data structures

Excel has only one basic data structure: the cell. Cells are extremely flexible in that they can store numeric, character, logical or formula information. The cost of this flexibility is unpredictability. For instance you can store the character "6" in a cell when you mean to store the number 6.

The basic R data structure is a vector. You can think of a vector like a column in an Excel spreadsheet with the limitation that all the data in that vector must be of the same type. If it is a character vector, every element must be a character; if it is a logical vector, every element must be TRUE or FALSE; if it's numeric you can trust that every element is a number. There's no such constraint in Excel: you might have a column which has a bunch of numbers, but then some explanatory test intermingled with the numbers. This isn't allowed in R.


3) Iteration

Iteration is one of the most powerful features of programming languages and is a big adjustment for Excel users. Iteration is just getting the computer to do the same thing over and over again for some period of time. Maybe you want to draw the same graph based on fifty different data sets, or read and filter a lot of data tables. In a programming language like R you write a script which works for all of the cases which you want to apply it to, and then tell the computer to do the application.

Excel analysts typically do a lot of this iteration themselves. For instance if an Excel analyst wanted to combine ten different .xls files into one big file, they would probably open each one individually, copy the data, and paste it into a master spreadsheet. The analyst is effectively taking the place of a for loop by doing one thing over and over again until a condition is met.


4) Simplification through abstraction

Another major difference is that programming encourages you to simplify your analysis by abstracting common functions from that analysis. In the example above you might find that you have to read in the same type of files over and over again and check that they have the right number of rows. R allows you to write a function which does this:

read_and_check <- function(file){  out <- read.csv(file)  if(nrow(out) == 0) {    stop("There's no data in this file!")  } else {    out  }}

All this function does is read in a .csv file and then check to see if it has more than zero rows. If it doesn't, it returns an error. Otherwise it returns the file (which is called "out"). This is a powerful approach because it helps you save time and reduce errors. For instance, if you want to check if the file has more than 23 rows, you only have to change the condition in one place rather than in several spreadsheets.

There's really no analog for these kinds of functions in an Excel-based workflow, and when most analysts get to this point they just start writing VBA code to do some of this work.


Example: Joining two tables together

I thought I'd illustrate these principles by working through the example of joining two tables together in Excel and R. Let's say that we had two data tables, one with some information about cars and another with the colour of those cars, and we want to join the two of them together. For the purpose of this exercise, we're going to assume that the number of cylinders in a car determines its colour.

library(dplyr)library(knitr)cars <- mtcarscolours <- data_frame(  cyl = unique(cars$cyl),  colour = c("Blue", "Green", "Eggplant"))kable(cars[1:10, ]) #kable is just for displaying the table




二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Users EXCEL exce User xcel philosophy marketing remember medical numbers

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
richardgu26 发表于 2017-2-15 08:07:49 |只看作者 |坛友微信交流群

使用道具

藤椅
陈信研究员 发表于 2017-2-15 08:16:44 |只看作者 |坛友微信交流群
谢谢分享,看看呢
已有 1 人评分论坛币 收起 理由
oliyiyi + 10 精彩帖子

总评分: 论坛币 + 10   查看全部评分

使用道具

板凳
yeh77 发表于 2017-2-15 08:26:10 |只看作者 |坛友微信交流群
谢谢分享
已有 1 人评分经验 收起 理由
oliyiyi + 5 精彩帖子

总评分: 经验 + 5   查看全部评分

使用道具

报纸
luhaoyu 发表于 2017-2-15 08:31:03 |只看作者 |坛友微信交流群
好材料
已有 1 人评分经验 收起 理由
oliyiyi + 5 精彩帖子

总评分: 经验 + 5   查看全部评分

使用道具

地板
Lost杰瑞 发表于 2017-2-15 08:32:28 |只看作者 |坛友微信交流群
谢谢分享

使用道具

7
kmjiwa 发表于 2017-2-15 08:34:50 |只看作者 |坛友微信交流群
Good book.

使用道具

8
nieqiang110 学生认证  发表于 2017-2-15 08:56:38 |只看作者 |坛友微信交流群
R for Excel Users

使用道具

9
jerker 发表于 2017-2-15 09:04:16 |只看作者 |坛友微信交流群
谢谢分享

使用道具

10
jcyang 发表于 2017-2-15 09:06:37 |只看作者 |坛友微信交流群
看看,学习

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-26 16:57