数据格式是这样的:
E00548:177:HKH53CCXY:4:2101:31629:73229 ATGCGTACCACA TACCAGCAGTTC 163 chr21 5013083 0 138M = 5013132 187
E00548:177:HKH53CCXY:4:2214:23957:16516 TACCAGCAGTTC ATGCGTACCACA 99 chr21 5013083 0 138M = 5013132 187
E00548:177:HKH53CCXY:4:2105:27702:23073 AGAGTTCACGGA TCACCGGTGATA 163 chr21 5021181 0 138M = 5021195 152
E00548:177:HKH53CCXY:4:2113:1428:65810 TCACCGGTGATA AGAGTTCACGGA 99 chr21 5021181 0 138M = 5021195 152
E00548:177:HKH53CCXY:4:1223:12246:18239 GACAGGTCATAC CTCTCCTATAGC 163 chr21 5055301 0 138M = 5055323 160
E00548:177:HKH53CCXY:4:1223:12246:18239 GACAGGTCATAC CTCTCCTATAGC 163 chr21 5055301 0 138M = 5055323 160
E00548:177:HKH53CCXY:4:2221:20872:34307 CTCTCCTATAGC GACAGGTCATAC 99 chr21 5055301 0 138M = 5055323 160
E00548:177:HKH53CCXY:4:2221:21836:35203 CTCTCCTATAGC GACAGGTCATAC 99 chr21 5055301 0 138M = 5055323 160
E00548:177:HKH53CCXY:4:1102:10094:23970 CAAGCAACCGAT TGATACCGGACA 163 chr21 5063635 53 138M = 5063652 155
E00548:177:HKH53CCXY:4:1120:5081:43062 TGATACCGGACA CAAGCAACCGAT 99 chr21 5063635 53 138M = 5063652 155
一共10行,想过滤一下,每两行的第四列必须是163和99这两个数字,否则就过滤掉,我写的脚本如下:
data<-read.table("duplex.txt",header=F,fill=TRUE)
res=data.frame()
m=c(163,99)
col=nrow(data)
i=1
while(i<col){
if((data[i,4] %in% m) & (data[i+1,4] %in% m) & (data[i,4]!=data[i+1,4])){
res=rbind(res,data[i:(i+1),])
i=i+2
}else{
i=i+2
}
}
就这10行很快就跑出来了,也满足条件,但是我跑原始数据的时候,跑了很久很久,原数据也才72M,33万多行,为什么会跑这么慢?是我陷入死循环了?还是这个脚本本身就有问题?


雷达卡





京公网安备 11010802022788号







