Over the last few years, we have seen a rapid reduction in costs and time of genome sequencing. The potential of understanding the variations in genome sequences range from assisting us in identifying people who are predisposed to common diseases, solving rare diseases, and enabling clinicians to personalize prescription and dosage to the individual.
In this three-part blog, we will provide a primer of genome sequencing and its potential. We will focus on genome variant analysis – that is the differences between genome sequences – and how it can be accelerated by making use of Apache Spark and ADAM (a scalable API and CLI for genome processing) using Databricks Community Edition. Finally, we will execute a k-means clustering algorithm on genomic variant data and build a model that will predict the individual’s geographic population of origin based on those variants.
This post will focus on predicting geographic population using genome variants and k-means. You can also review the refresher Genome Sequencing in a Nutshell or more details behind Parallelizing Genome Variant Analysis.
The library org.bdgenomics.utils.misc.Hadoop (utils-misc_2.10-0.2.4.jar) must be installed /attached to the cluster prior to
For this exact version, search for org.bdgenomics.utils:utils-misc_2.10:0.2.4
Installing / attaching the org.bdgenomics.adam.core (adam-core_2.10-0.19.0) library.
For this exact version, search for org.bdgenomics.adam:adam-core_2.10:0.19.0
You will need the lightning-viz python client. Get the lightning-viz client here and install/attach to your cluster before you start: https://pypi.python.org/pypi/lightning-python
Note, the datasets for this notebook do not need to be downloaded as they are already available on dbfs at the filepaths below. The source of these files can be found in teh bullet points below.
The original source for the vcf data is the 1000 genomes project. All of the data for the 1000 genomes project is also publicly available via aws s3.
The source for the abbreviated sample of chrosomose 6 data used in this analysis can be downloaded. (full path on s3). You can also create your own subsample using tabix. We are using a subsample vcf for tutorial purposes. Ultimately, one could run the same code on the whole chromosome vcf, and filter the ADAM records by position.
The entire chromosome 6 data that was sampled from can be found on ftp or s3.
The panel is available via ftp and s3.
>
// Set file locations
val vcf_path = "/databricks-datasets/samples/adam/samples/6-sample.vcf"
val tmp_path = "/tmp/adam/6-sample.adam"
val panel_path = "/databricks-datasets/samples/adam/panel/integrated_call_samples_v3.20130502.ALL.panel"
VCF data contains sample IDs, but not population codes. Although we are doing an unsupervised algorithm in this analysis, we still need the response variables in order to filter our samples and to estimate our prediction error. We can get the population codes for each sample from the panel file.
6) Read some of the ADAM data into RDDs to begin parallel processing of genotypes.
Parquet files enable predicate pushdown, so it will be efficient to apply our filter panel when we load the data from ADAM parquet files, and only load data from the people in our 3 populations.
>
val popFilteredGts : RDD[Genotype] = sc.loadGenotypes(tmp_path).filter(genotype => {bPanel.value.contains(genotype.getSampleId)})
We know our data comes from 3 populations, but let's do some exploratory analysis to see what locations our data contains. In this case, we know our data is only from chromosome 6. The entire chromosome 6 is around 170 million bp (base pairs) long. But our data, for now, is only a subset of that.
>
// check our data locations. this is the first time we will run a action on our popFilteredGts RDD, so it will take some time, but will then be cached in memory.
val startRDD = popFilteredGts.map(genotype => genotype.getVariant.getStart)
val minstart = startRDD.reduce((a, b) => if (a < b) a else b)
val maxstart = startRDD.reduce((a, b) => if (a > b) a else b)
Filter 1 -- If we are missing some data for a variant, or if the variant is triallelic, we want to remove them from our training data.
The variants in the vcf don't come with unique identifiers, so we will make some which we can hash, in order to start filtering the variants efficiently.
>
import scala.collection.JavaConverters._
import org.bdgenomics.formats.avro._
//create a unique ID for each variant which is a combination of chromosome, start, and end position