人大经济论坛 › 论坛 › 数据科学与人工智能 › 数据分析与数据科学 › R语言论坛 › 入门级别的3个代码问题（Coursera课程疑难点）

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

12 下一页

发帖

楼主: snowyapple

9495 11

[问答] 入门级别的3个代码问题（Coursera课程疑难点） [推广有奖]

0关注
0粉丝

大专生

68%

还不是VIP/贵宾

威望: 0 级
论坛币: 960 个
通用积分: 0.0600
学术水平: 2 点
热心指数: 1 点
信用等级: 1 点
经验: 660 点
帖子: 65
精华: 0
在线时间: 28 小时
注册时间: 2004-12-13
最后登录: 2015-7-12

楼主

snowyapple 发表于 2014-5-19 18:19:00 |只看作者 |坛友微信交流群|倒序 |AI写论文

100论坛币

花了若干个小时做了一次作业，但水平实在太有限了，只能在这请教下各位大大了，只要懂一些R的应该解决起来都很简单。需要的数据在此：

specdata.rar (2.17 MB)
第一个问题：
写一个名为“pollutantmean'的函数，计算整个指定列表中（specdata）的污染物（(sulfate 或nitrate ）的平均值的函数。函数'pollutantmean'有三个参数：'目录'，'污染'和'ID'。无视编码为NA任何遗漏值。函数原型如下:
pollutantmean <- function(directory, pollutant, id = 1:332) {
      ## 'directory' is a character vector of length 1 indicating
      ## the location of the CSV files

      ## 'pollutant' is a character vector of length 1 indicating
      ## the name of the pollutant for which we will calculate the
      ## mean; either "sulfate" or "nitrate".

      ## 'id' is an integer vector indicating the monitor ID numbers
      ## to be used

      ## Return the mean of the pollutant across all monitors list
      ## in the 'id' vector (ignoring NA values)
}

参考答案：
pollutantmean("specdata", "nitrate", 70:72)
## [1] 1.706
pollutantmean("specdata", "nitrate", 23)
## [1] 1.281

我写的如下：
pollutantmean <- function(directory,pollutant,id=1:332){
  files_list <- dir(directory, full.names=T)
  data <- data.frame()
  for (i in 1:332){
data <- rbind(data,read.csv(files_list))
  }
  data_subset <- subset(data, data$ID<=max(id)&data$ID<=max(id)&data$ID>=min(id))
  if(pollutant=="sulfate"){
result<-mean(data_subset$sulfate, na.rm=T)
  }
  if(pollutant=="nitrate"){
result<-mean(data_subset$nitrate, na.rm=T)
  }
  return (result)
}

我计算的结果都是对的，但是：
1.似乎结果位数过多，且运算可能过久导致Coursera系统自动否定了我的答案，希望得到详细的修改意见~

第二个问题：
写一个函数，这个函数的原型如下：
complete <- function(directory, id = 1:332) {
      ## 'directory' is a character vector of length 1 indicating
      ## the location of the CSV files

      ## 'id' is an integer vector indicating the monitor ID numbers
      ## to be used

      ## Return a data frame of the form:
      ## id nobs
      ## 1  117
      ## 2  1041
      ## ...
      ## where 'id' is the monitor ID number and 'nobs' is the
      ## number of complete cases
}

答案示例：
complete("specdata", 30:25)
## id nobs
## 1 30  932
## 2 29  711
## 3 28  475
## 4 27  338
## 5 26  586
## 6 25  463

我写的：
complete<- function(directory,id=1:332){
  files_list <- dir(directory, full.names=T)
  data <- data.frame()
  for (i in 1:332){
data <- rbind(data,read.csv(files_list))
  }
  filecom<-vector()
  for (i in id){
data_subset<-subset(data,data$ID==i)
data2<-data_subset[,2:3]
cc<-sum(complete.cases(data2))
filecom<-rbind(filecom,c(i,cc))
  }
  colnames(filecom)<-c("id","nobs")
  return (filecom) }

1.结果都是对的，但是：
> class(complete("specdata", 30:25))
[1] "matrix"
我希望得到：
> class(complete("specdata", 30:25))
[1] "data.frame"
2.同样，数据运算的非常慢！而且位数蛮多的，希望得到具体意见。

第三个问题：
原型：
corr <- function(directory, threshold = 0) {
      ## 'directory' is a character vector of length 1 indicating
      ## the location of the CSV files

      ## 'threshold' is a numeric vector of length 1 indicating the
      ## number of completely observed observations (on all
      ## variables) required to compute the correlation between
      ## nitrate and sulfate; the default is 0

      ## Return a numeric vector of correlations
这个参考了一位朋友的：
corr <- function(directory,threshold=0){
  filenames <- list.files("specdata", full.names=TRUE)
  n <-length(filenames)
  cr <- numeric()

  for (i in 1:332) {
dat <- data.frame(lapply (filenames, read.csv))

datcomplete <- subset(dat, dat$sulfate != "NA" & dat$nitrate != "NA")
check <- length(datcomplete$ID)
if (check >= threshold & check>0) {

   cal <- cor(datcomplete$sulfate,datcomplete$nitrate)
   cr <- c(cr, cal)
}

  }
  return(cr)
}

cr <- corr("specdata", 150)
head(cr)
## [1] -0.01896 -0.14051 -0.04390 -0.06816 -0.12351 -0.07589
summary(cr)
## Min. 1st Qu.  Median Mean 3rd Qu. Max.
## -0.2110 -0.0500  0.0946  0.1250  0.2680  0.7630
cr <- corr("specdata", 400)
head(cr)
## [1] -0.01896 -0.04390 -0.06816 -0.07589  0.76313 -0.15783
但结果也不太对，head是对的，但后面的有一点点出入...希望得到详细意见。

完全没有编程经验上这个课太痛苦了，希望得到达人的帮助！如果有更简洁的公式希望能直接告诉我，可能我的思路本来就不太好~~

分享0 收藏0 回帖

关键词：coursera Course Urs Our Era 课程

使用道具举报

沙发

snowyapple 发表于 2014-5-19 18:21:22 |只看作者 |坛友微信交流群

怕看不太清，再传一下好了

使用道具举报

藤椅

windblood 发表于 2014-5-19 18:56:02 |只看作者 |坛友微信交流群

关于慢的问题，为什么read.csv的时候要for循环那么多，不就读一个表吗，你那样干就把数据复制了322次，最后算均值结果当然是一样的
直接data<-read.csv(directory)应该就好了

要data.frame可以用as.data.frame

使用道具举报

板凳

dazekey 发表于 2015-4-19 16:18:17 |只看作者 |坛友微信交流群

楼主，第一题你的写法运算效率太低了
根据要求，我帮你改成如下：
pollutantmean <- function(directory,pollutant,id=1:332){
  files_list <- dir(directory, full.names=T)
  data <- data.frame()
  for (i in min(id):max(id)){
data <- rbind(data,read.csv(files_list[i]))
  }
  if(pollutant=="sulfate"){
result<-mean(data$sulfate, na.rm=T)
  }
  if(pollutant=="nitrate"){
result<-mean(data$nitrate, na.rm=T)
  }
  return (result)
}

使用道具举报

报纸

xush4 发表于 2015-4-21 00:56:37 |只看作者 |坛友微信交流群

我认为第一题不是效率问题，第一题最大的问题是文件名读取，如何表示出001,002...所以可以采用这段代码：
filenumber<- sprintf("%03d", i)
filename<-paste( c(directory,"/",filenumber,".csv"), collapse="" )
datasinglefile<- read.csv(filename)
再给多估计要违反honor code了

第二题的确用as.data.frame即可，不过你依然应该加上第一题的这段代码，结果是对的只是因为老师没故意出两位数或一位数的文件名。

第三题好难，我做了几次都没出对的结果。。。
我的思路是用na.omit去掉所有文件里NA列，可以有效提高效率。感觉和楼主是一样的。估计小区别还是因为那些100以下文件的读取吧。。。

综上所述，我觉得楼主你的问题还是文件名没对，其他估计问题不大，受限于honor code，不能给更多代码…………

使用道具举报

地板

xush4 发表于 2015-4-21 00:59:10 |只看作者 |坛友微信交流群

xush4 发表于 2015-4-21 00:56
我认为第一题不是效率问题，第一题最大的问题是文件名读取，如何表示出001,002...所以可以采用这段代码：
...

文件名问题解决后还有问题欢迎继续问，我已经刷到Capstone计划了，应该都会

使用道具举报

7楼

xush4 发表于 2015-4-21 01:15:07 |只看作者 |坛友微信交流群

又发现第一题一个问题：你的pollutant变量没写在里面啊。。。你只考虑了Ni的情况还有别的污染物。。。
还是贴一段代码上来：
filenumber<- sprintf("%03d", i)
filename<-paste( c(directory,"/",filenumber,".csv"), collapse="" )
datasinglefile<- read.csv(filename)
colsinglefileV <- datasinglefile[,pollutant] #这样可以直接选污染物
这里面pollutant是你指定的污染物。

使用道具举报

8楼

xush4 发表于 2015-4-21 01:21:34 |只看作者 |坛友微信交流群

楼主加油哦，这个做出来后面那个hospital更***

使用道具举报

9楼

maybeone123 发表于 2015-10-23 22:23:00 |只看作者 |坛友微信交流群

pollutantmean <- function(directory, pollutant, id = 1:332) {
      k <- pollutant
      w <- list.files(directory, full.names = TRUE)
      f <- w[id]
      nc <- length(f)
      da <- data.frame()
      for(i in 1 : nc) {
            da <- rbind(da, read.csv(f[i]))
      }
      mean(da[,k], na.rm = TRUE)
}

使用道具举报

10楼

maybeone123 发表于 2015-10-23 23:45:48 |只看作者 |坛友微信交流群

complete <- function(directory, id = 1:322){
      z <- list.files(directory, full.names = TRUE)
      x <- z[id]
      no <- numeric(length(x))
      for(i in 1 : length(x)){
            o <- read.csv(x[i])
            pp <- sum(complete.cases(o))
            no[i] <- pp
      }
      coml <- data.frame(id, no)
      colnames(coml) <- c("id", "nobs")
      coml
}

使用道具举报