| Chr | Start | End | S1 | S2 | S3 | S4 |
| chr1 | 610908 | 610908 | 92.4 | 95.4 | 96.7 | 100 |
| chr1 | 610916 | 610916 | 94.7 | 96.9 | 97.5 | 100 |
| chr1 | 610932 | 610932 | 36.7 | 40 | 73.9 | 60 |
| chr1 | 610963 | 610963 | 85.4 | 80 | 75.6 | 60 |
| chr1 | 629882 | 629882 | 4 | 3.8 | 3.2 | 3.8 |
| chr1 | 630017 | 630017 | 0 | 0 | 0 | 0 |
| chr2 | 631860 | 631860 | 0.6 | 14.3 | 0.6 | 1.5 |
| chr2 | 631933 | 631933 | 0.6 | 0 | 0.3 | 0.8 |
| chr2 | 631969 | 631969 | 0.6 | 0 | 0.6 | 1.2 |
| chr2 | 631979 | 631979 | 0 | 0 | 0.3 | 1.2 |
| chr2 | 631996 | 631996 | 0 | 0 | 0.3 | 0.4 |
| chr2 | 632011 | 632011 | 0 | 0 | 0 | 0.8 |
| chr2 | 632023 | 632023 | 7.9 | 0 | 5.6 | 8 |
| chr3 | 634024 | 634024 | NA | 0 | 0 | 0 |
| chr3 | 634028 | 634028 | NA | 0 | 0 | 0 |
| chr3 | 634047 | 634047 | NA | 0 | 0 | 0 |
| chr3 | 727034 | 727034 | 96.6 | 93.3 | 100 | 100 |
| chr3 | 727048 | 727048 | 100 | 100 | 100 | 100 |
| chr3 | 727061 | 727061 | 89.7 | 100 | 100 | 100 |
| chr3 | 727099 | 727099 | 100 | 100 | 100 | 100 |
在第一列相同的情况下,Start的距离和不超过150,并且在该区域内至少含有3个点,例如上面的结果就有3个区域
| Chr | Start | End | S1 | S2 | S3 | S4 |
| chr1 | 610908 | 610908 | 92.4 | 95.4 | 96.7 | 100 |
| chr1 | 610916 | 610916 | 94.7 | 96.9 | 97.5 | 100 |
| chr1 | 610932 | 610932 | 36.7 | 40 | 73.9 | 60 |
| chr1 | 610963 | 610963 | 85.4 | 80 | 75.6 | 60 |
| chr2 | 631933 | 631933 | 0.6 | 0 | 0.3 | 0.8 |
| chr2 | 631969 | 631969 | 0.6 | 0 | 0.6 | 1.2 |
| chr2 | 631979 | 631979 | 0 | 0 | 0.3 | 1.2 |
| chr2 | 631996 | 631996 | 0 | 0 | 0.3 | 0.4 |
| chr3 | 727048 | 727048 | 100 | 100 | 100 | 100 |
| chr3 | 727061 | 727061 | 89.7 | 100 | 100 | 100 |
| chr3 | 727099 | 727099 | 100 | 100 | 100 | 100 |
同时,各个区域内要满足以下条件:
每一行的4-7列的最大值不超过2,平均值不超过1
因此最后的结果是:
| chr2 | 631933 | 631933 | 0.6 | 0 | 0.3 | 0.8 |
| chr2 | 631969 | 631969 | 0.6 | 0 | 0.6 | 1.2 |
| chr2 | 631979 | 631979 | 0 | 0 | 0.3 | 1.2 |
| chr2 | 631996 | 631996 | 0 | 0 | 0.3 | 0.4 |
| Chr | Start | End | Length | site_number | average |
| chr2 | 631933 | 631996 | 64 | 4 | 0.39375 |
关于第一个输出文件,我的想法是用tidyverse:
- library(tidyverse)
- mydata<-read.table("test.txt",header = T,sep = ',')
- data1<-data.frame(mydata,average=round(rowMeans(mydata[,3:6],na.rm = TRUE),2))
- max<-apply(data1[,3:6],1,max,na.rm = TRUE)
- data2<-cbind(data1,max)
- data3<-data2 %>%
- group_by(Chr) %>%
- arrange(Start, .by_group = TRUE) %>%
- mutate(diff = c(0, diff(Start)),
- diff_flag = cumsum(diff >= 150)) %>%
- group_by(Chr, diff_flag) %>%
- mutate(num = n()) %>%
- filter(num >= 3 & cummean<=1 & max<=2) %>%
- select(1:6)
- write.csv(data2,'filter_1.csv',row.names = F)


雷达卡



京公网安备 11010802022788号







