楼主: ReneeBK
1118 14

【Use Case】Predicting Geographic Population using Genome Variants and K-Means [推广有奖]

  • 1关注
  • 62粉丝

VIP

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49407 个
通用积分
51.8104
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57815 点
帖子
4006
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. Introduction
  2. Over the last few years, we have seen a rapid reduction in costs and time of genome sequencing.  The potential of understanding the variations in genome sequences range from assisting us in identifying people who are predisposed to common diseases, solving rare diseases, and enabling clinicians to personalize prescription and dosage to the individual.

  3. In this three-part blog, we will provide a primer of genome sequencing and its potential.  We will focus on genome variant analysis – that is the differences between genome sequences – and how it can be accelerated by making use of Apache Spark and ADAM (a scalable API and CLI for genome processing) using Databricks Community Edition.  Finally, we will execute a k-means clustering algorithm on genomic variant data and build a model that will predict the individual’s geographic population of origin  based on those variants.

  4. This post will focus on predicting geographic population using genome variants and k-means.  You can also review the refresher Genome Sequencing in a Nutshell or more details behind Parallelizing Genome Variant Analysis.
复制代码

本帖隐藏的内容

Predicting Geographic Population using Genome Variants and K-Means.pdf (431.76 KB)


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Population Geographic predicting GRAPHIC k-means

沙发
ReneeBK 发表于 2017-5-28 03:20:33 |只看作者 |坛友微信交流群
  1. 1) Launch Special Cluster
  2. Your cluster needs to be the Scala 2.10 and Spark 1.6.1 (Hadoop 2).
  3. When launching the cluster, ensure that the following configurations have been set:
  4. spark.serializer org.apache.spark.serializer.KryoSerializer
  5. spark.kryo.registrator org.bdgenomics.adam.serialization.ADAMKryoRegistrator
  6. The library org.bdgenomics.utils.misc.Hadoop (utils-misc_2.10-0.2.4.jar) must be installed /attached to the cluster prior to
  7. For this exact version, search for org.bdgenomics.utils:utils-misc_2.10:0.2.4
  8. Installing / attaching the org.bdgenomics.adam.core (adam-core_2.10-0.19.0) library.
  9. For this exact version, search for org.bdgenomics.adam:adam-core_2.10:0.19.0
  10. You will need the lightning-viz python client. Get the lightning-viz client here and install/attach to your cluster before you start: https://pypi.python.org/pypi/lightning-python
复制代码

使用道具

藤椅
ReneeBK 发表于 2017-5-28 03:21:23 |只看作者 |坛友微信交流群
  1. 2) Import Big Data Genomics Libraries
  2. >

  3. import org.bdgenomics.adam.converters
  4. import org.bdgenomics.formats.avro
  5. import org.bdgenomics.formats.avro._
  6. import org.bdgenomics.adam.models.VariantContext
  7. import org.bdgenomics.adam.rdd.ADAMContext._
  8. import org.bdgenomics.adam.rdd.variation._
  9. import org.bdgenomics.adam.rdd.ADAMContext
复制代码

使用道具

板凳
ReneeBK 发表于 2017-5-28 03:22:02 |只看作者 |坛友微信交流群
  1. 3) Set file paths
  2. Note, the datasets for this notebook do not need to be downloaded as they are already available on dbfs at the filepaths below. The source of these files can be found in teh bullet points below.
  3. The original source for the vcf data is the 1000 genomes project. All of the data for the 1000 genomes project is also publicly available via aws s3.
  4. The source for the abbreviated sample of chrosomose 6 data used in this analysis can be downloaded. (full path on s3). You can also create your own subsample using tabix. We are using a subsample vcf for tutorial purposes. Ultimately, one could run the same code on the whole chromosome vcf, and filter the ADAM records by position.
  5. The entire chromosome 6 data that was sampled from can be found on ftp or s3.
  6. The panel is available via ftp and s3.
  7. >

  8. // Set file locations
  9. val vcf_path = "/databricks-datasets/samples/adam/samples/6-sample.vcf"
  10. val tmp_path = "/tmp/adam/6-sample.adam"
  11. val panel_path = "/databricks-datasets/samples/adam/panel/integrated_call_samples_v3.20130502.ALL.panel"
复制代码

使用道具

报纸
ReneeBK 发表于 2017-5-28 03:22:41 |只看作者 |坛友微信交流群
  1. 4) Load VCF data and convert/save into ADAM parquet format
  2. Parquet files are smaller than vcf files. And importantly, parquet files are designed to be distributable across a cluster (vcf are not).
  3. >

  4. gts: org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.Genotype] = MapPartitionsRDD[2111] at flatMap at ADAMContext.scala:800
复制代码

使用道具

地板
ReneeBK 发表于 2017-5-28 03:23:02 |只看作者 |坛友微信交流群
  1. 5) Prepare to read the ADAM data into RDDs.
  2. VCF data contains sample IDs, but not population codes. Although we are doing an unsupervised algorithm in this analysis, we still need the response variables in order to filter our samples and to estimate our prediction error. We can get the population codes for each sample from the panel file.
  3. >

  4. panel: org.apache.spark.sql.DataFrame = [sample: string, pop: string, super_pop: string, gender: string]
  5. res6: Long = 2504
复制代码

使用道具

7
ReneeBK 发表于 2017-5-28 03:23:46 |只看作者 |坛友微信交流群
  1. 6) Read some of the ADAM data into RDDs to begin parallel processing of genotypes.
  2. Parquet files enable predicate pushdown, so it will be efficient to apply our filter panel when we load the data from ADAM parquet files, and only load data from the people in our 3 populations.
  3. >

  4. val popFilteredGts : RDD[Genotype] = sc.loadGenotypes(tmp_path).filter(genotype => {bPanel.value.contains(genotype.getSampleId)})
  5. popFilteredGts.count
复制代码

使用道具

8
ReneeBK 发表于 2017-5-28 03:24:11 |只看作者 |坛友微信交流群
  1. 7) Explore the data
  2. We know our data comes from 3 populations, but let's do some exploratory analysis to see what locations our data contains. In this case, we know our data is only from chromosome 6. The entire chromosome 6 is around 170 million bp (base pairs) long. But our data, for now, is only a subset of that.
  3. >

  4. // check our data locations. this is the first time we will run a action on our popFilteredGts RDD, so it will take some time, but will then be cached in memory.
  5. val startRDD = popFilteredGts.map(genotype => genotype.getVariant.getStart)
  6. val minstart = startRDD.reduce((a, b) => if (a < b) a else b)
  7. val maxstart = startRDD.reduce((a, b) => if (a > b) a else b)
复制代码

使用道具

9
ReneeBK 发表于 2017-5-28 03:24:56 |只看作者 |坛友微信交流群
  1. 8) Clean the Data (2 filters)
  2. Filter 1 -- If we are missing some data for a variant, or if the variant is triallelic, we want to remove them from our training data.
  3. The variants in the vcf don't come with unique identifiers, so we will make some which we can hash, in order to start filtering the variants efficiently.
  4. >

  5. import scala.collection.JavaConverters._
  6. import org.bdgenomics.formats.avro._

  7. //create a unique ID for each variant which is a combination of chromosome, start, and end position
  8. def variantId(g:Genotype):String = {
  9.   val name = g.getVariant.getContig.getContigName
  10.     val start = g.getVariant.getStart
  11.     val end = g.getVariant.getEnd
  12.     s"$name:$start:$end"
  13. }
复制代码

使用道具

10
ReneeBK 发表于 2017-5-28 03:25:37 |只看作者 |坛友微信交流群
  1. Filter 2 -- Filter for variants with allele frequency > 30
  2. We will represent each genotype for each sample as a double:
  3. 0.0 -> no copies of the alternate allele.
  4. 1.0 -> 1 copy alternate allele and 1 copy reference allele.
  5. 2.0 -> 2 copies alternate allele
  6. >

  7. //get the alternate alleles to count
  8. def altAsDouble(g:Genotype):Double = g.getAlleles.asScala.count(_ == GenotypeAllele.Alt)

  9. val varToData = completeGts.map { g => ((variantId(g).hashCode), altAsDouble(g)) }
  10. altAsDouble: (g: org.bdgenomics.formats.avro.Genotype)Double
  11. varToData: org.apache.spark.rdd.RDD[(Int, Double)] = MapPartitionsRDD[2482] at map at <console>:78
  12. >

  13. val variantFrequencies = varToData.reduceByKey((x,y) => x + y)
  14. variantFrequencies: org.apache.spark.rdd.RDD[(Int, Double)] = ShuffledRDD[2483] at reduceByKey at <console>:79
  15. >

  16. //this is a filter of variant ids to keep
  17. val freqFilter = variantFrequencies.filter { case (k, it) => it > 30.0 }.keys
  18. freqFilter: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2485] at keys at <console>:82
  19. >

  20. freqFilter.count()
  21. res11: Long = 805
复制代码

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-23 17:24