Over the last few years, we have seen a rapid reduction in costs and time of genome sequencing.The potential of understanding the variations in genome sequences range from assisting us in identifying people who are predisposed to common diseases, solving rare diseases, and enabling clinicians to personalize prescription and dosage to the individual.
In this three-part blog, we will provide a primer of genome sequencing and its potential.We will focus on genome variant analysis – that is the differences between genome sequences – and how it can be accelerated by making use of Apache Spark and ADAM (a scalable API and CLI for genome processing) using Databricks Community Edition.Finally, we will execute a k-means clustering algorithm on genomic variant data and build a model that will predict the individual’s geographic population of originbased on those variants.
This post will focus on predicting geographic population using genome variants and k-means.You can also review the refresher Genome Sequencing in a Nutshell or more details behind Parallelizing Genome Variant Analysis.