签到
- 苹果/安卓/wp
- 苹果/安卓/wp
客户端
0.0

0.00

人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › winbugs及其他软件专版 › 【Use Case】Predicting Geographic Population using G ...

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

提升主题| 本版置顶| 关闭主题| 变更主题颜色| 抢沙发| 顶贴| 显身卡| 道具中心

楼主: ReneeBK

1118 14

【Use Case】Predicting Geographic Population using Genome Variants and K-Means [推广有奖]

1关注
62粉丝

学术权威

14%

还不是VIP/贵宾

-

TA的文库 其他...

Panel Data Analysis

Experimental Design

0%

威望: 1 级
论坛币: 49407 个
通用积分: 51.8104
学术水平: 370 点
热心指数: 273 点
信用等级: 335 点
经验: 57815 点
帖子: 4006
精华: 21
在线时间: 582 小时
注册时间: 2005-5-8
最后登录: 2023-11-26

楼主

ReneeBK 发表于 2017-5-28 03:19:12 |只看作者 |坛友微信交流群|倒序 |AI写论文

相似文件

换一批

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Introduction
Over the last few years, we have seen a rapid reduction in costs and time of genome sequencing. The potential of understanding the variations in genome sequences range from assisting us in identifying people who are predisposed to common diseases, solving rare diseases, and enabling clinicians to personalize prescription and dosage to the individual.
In this three-part blog, we will provide a primer of genome sequencing and its potential. We will focus on genome variant analysis – that is the differences between genome sequences – and how it can be accelerated by making use of Apache Spark and ADAM (a scalable API and CLI for genome processing) using Databricks Community Edition. Finally, we will execute a k-means clustering algorithm on genomic variant data and build a model that will predict the individual’s geographic population of origin based on those variants.
This post will focus on predicting geographic population using genome variants and k-means. You can also review the refresher Genome Sequencing in a Nutshell or more details behind Parallelizing Genome Variant Analysis.

复制代码

本帖隐藏的内容

Predicting Geographic Population using Genome Variants and K-Means.pdf (431.76 KB)

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏2 回帖

关键词：Population Geographic predicting GRAPHIC k-means

相关帖子

• CDA数据分析师认证考试

回复

使用道具举报

沙发

ReneeBK 发表于 2017-5-28 03:20:33 |只看作者 |坛友微信交流群

1) Launch Special Cluster
Your cluster needs to be the Scala 2.10 and Spark 1.6.1 (Hadoop 2).
When launching the cluster, ensure that the following configurations have been set:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.bdgenomics.adam.serialization.ADAMKryoRegistrator
The library org.bdgenomics.utils.misc.Hadoop (utils-misc_2.10-0.2.4.jar) must be installed /attached to the cluster prior to
For this exact version, search for org.bdgenomics.utils:utils-misc_2.10:0.2.4
Installing / attaching the org.bdgenomics.adam.core (adam-core_2.10-0.19.0) library.
For this exact version, search for org.bdgenomics.adam:adam-core_2.10:0.19.0
You will need the lightning-viz python client. Get the lightning-viz client here and install/attach to your cluster before you start: https://pypi.python.org/pypi/lightning-python

复制代码

回复

使用道具举报

藤椅

ReneeBK 发表于 2017-5-28 03:21:23 |只看作者 |坛友微信交流群

2) Import Big Data Genomics Libraries
>
import org.bdgenomics.adam.converters
import org.bdgenomics.formats.avro
import org.bdgenomics.formats.avro._
import org.bdgenomics.adam.models.VariantContext
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.variation._
import org.bdgenomics.adam.rdd.ADAMContext

复制代码

回复

使用道具举报

板凳

ReneeBK 发表于 2017-5-28 03:22:02 |只看作者 |坛友微信交流群

3) Set file paths
Note, the datasets for this notebook do not need to be downloaded as they are already available on dbfs at the filepaths below. The source of these files can be found in teh bullet points below.
The original source for the vcf data is the 1000 genomes project. All of the data for the 1000 genomes project is also publicly available via aws s3.
The source for the abbreviated sample of chrosomose 6 data used in this analysis can be downloaded. (full path on s3). You can also create your own subsample using tabix. We are using a subsample vcf for tutorial purposes. Ultimately, one could run the same code on the whole chromosome vcf, and filter the ADAM records by position.
The entire chromosome 6 data that was sampled from can be found on ftp or s3.
The panel is available via ftp and s3.
>
// Set file locations
val vcf_path = "/databricks-datasets/samples/adam/samples/6-sample.vcf"
val tmp_path = "/tmp/adam/6-sample.adam"
val panel_path = "/databricks-datasets/samples/adam/panel/integrated_call_samples_v3.20130502.ALL.panel"

复制代码

回复

使用道具举报

报纸

ReneeBK 发表于 2017-5-28 03:22:41 |只看作者 |坛友微信交流群

4) Load VCF data and convert/save into ADAM parquet format
Parquet files are smaller than vcf files. And importantly, parquet files are designed to be distributable across a cluster (vcf are not).
>
gts: org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.Genotype] = MapPartitionsRDD[2111] at flatMap at ADAMContext.scala:800

复制代码

回复

使用道具举报

地板

ReneeBK 发表于 2017-5-28 03:23:02 |只看作者 |坛友微信交流群

5) Prepare to read the ADAM data into RDDs.
VCF data contains sample IDs, but not population codes. Although we are doing an unsupervised algorithm in this analysis, we still need the response variables in order to filter our samples and to estimate our prediction error. We can get the population codes for each sample from the panel file.
>
panel: org.apache.spark.sql.DataFrame = [sample: string, pop: string, super_pop: string, gender: string]
res6: Long = 2504

复制代码

回复

使用道具举报

7楼

ReneeBK 发表于 2017-5-28 03:23:46 |只看作者 |坛友微信交流群

6) Read some of the ADAM data into RDDs to begin parallel processing of genotypes.
Parquet files enable predicate pushdown, so it will be efficient to apply our filter panel when we load the data from ADAM parquet files, and only load data from the people in our 3 populations.
>
val popFilteredGts : RDD[Genotype] = sc.loadGenotypes(tmp_path).filter(genotype => {bPanel.value.contains(genotype.getSampleId)})
popFilteredGts.count

复制代码

回复

使用道具举报

8楼

ReneeBK 发表于 2017-5-28 03:24:11 |只看作者 |坛友微信交流群

7) Explore the data
We know our data comes from 3 populations, but let's do some exploratory analysis to see what locations our data contains. In this case, we know our data is only from chromosome 6. The entire chromosome 6 is around 170 million bp (base pairs) long. But our data, for now, is only a subset of that.
>
// check our data locations. this is the first time we will run a action on our popFilteredGts RDD, so it will take some time, but will then be cached in memory.
val startRDD = popFilteredGts.map(genotype => genotype.getVariant.getStart)
val minstart = startRDD.reduce((a, b) => if (a < b) a else b)
val maxstart = startRDD.reduce((a, b) => if (a > b) a else b)

复制代码

回复

使用道具举报

9楼

ReneeBK 发表于 2017-5-28 03:24:56 |只看作者 |坛友微信交流群

8) Clean the Data (2 filters)
Filter 1 -- If we are missing some data for a variant, or if the variant is triallelic, we want to remove them from our training data.
The variants in the vcf don't come with unique identifiers, so we will make some which we can hash, in order to start filtering the variants efficiently.
>
import scala.collection.JavaConverters._
import org.bdgenomics.formats.avro._
//create a unique ID for each variant which is a combination of chromosome, start, and end position
def variantId(g:Genotype):String = {
val name = g.getVariant.getContig.getContigName
val start = g.getVariant.getStart
val end = g.getVariant.getEnd
s"$name:$start:$end"
}

复制代码

回复

使用道具举报

10楼

ReneeBK 发表于 2017-5-28 03:25:37 |只看作者 |坛友微信交流群

Filter 2 -- Filter for variants with allele frequency > 30
We will represent each genotype for each sample as a double:
0.0 -> no copies of the alternate allele.
1.0 -> 1 copy alternate allele and 1 copy reference allele.
2.0 -> 2 copies alternate allele
>
//get the alternate alleles to count
def altAsDouble(g:Genotype):Double = g.getAlleles.asScala.count(_ == GenotypeAllele.Alt)
val varToData = completeGts.map { g => ((variantId(g).hashCode), altAsDouble(g)) }
altAsDouble: (g: org.bdgenomics.formats.avro.Genotype)Double
varToData: org.apache.spark.rdd.RDD[(Int, Double)] = MapPartitionsRDD[2482] at map at <console>:78
>
val variantFrequencies = varToData.reduceByKey((x,y) => x + y)
variantFrequencies: org.apache.spark.rdd.RDD[(Int, Double)] = ShuffledRDD[2483] at reduceByKey at <console>:79
>
//this is a filter of variant ids to keep
val freqFilter = variantFrequencies.filter { case (k, it) => it > 30.0 }.keys
freqFilter: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2485] at keys at <console>:82
>
freqFilter.count()
res11: Long = 805

复制代码

回复

使用道具举报

发帖

本版微信群

加好友,备注jltj
拉您入交流群

如有投资本站、合作意向或投放广告，请联系：13661292478（刘老师）

联系客服

邮箱：service@pinggu.org 投诉或不良信息处理：（010-68466864）

京ICP备16021002-2号京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明