楼主: ReneeBK
1120 14

【Use Case】Predicting Geographic Population using Genome Variants and K-Means [推广有奖]

11
ReneeBK 发表于 2017-5-28 03:27:36 |只看作者 |坛友微信交流群
  1. 9) Prepare data for K-means clustering
  2. >

  3. import org.apache.spark.mllib.linalg.{Vector=>MLVector, Vectors}

  4. val sampleToData: RDD[(String, (Int, Double))] = finalGts.map { g => (g.getSampleId.toString, ((variantId(g).hashCode), altAsDouble(g))) }

  5. // group our data by sample
  6. val groupedSampleToData = sampleToData.groupByKey


  7. // make an MLVector for each sample, which contains the variants in the exact same order
  8. def makeSortedVector(g: Iterable[(Int,Double)]): MLVector = Vectors.dense( g.toArray.sortBy(_._1).map(_._2) )

  9. val dataPerSampleId:RDD[(String, MLVector)] =
  10.     groupedSampleToData.mapValues { it =>
  11.         makeSortedVector(it)
  12.     }
复制代码

使用道具

12
ReneeBK 发表于 2017-5-28 03:28:10 |只看作者 |坛友微信交流群
  1. 10) Run K-means clustering to build model
  2. >

  3. import org.apache.spark.mllib.clustering.{KMeans,KMeansModel}

  4. // Cluster the data into three classes using KMeans
  5. val numClusters = 3
  6. val numIterations = 20
  7. val clusters:KMeansModel = KMeans.train(features, numClusters, numIterations)

  8. // Evaluate clustering by computing Within Set Sum of Squared Errors
  9. val WSSSE = clusters.computeCost(features)
  10. println(s"Compute Cost: ${WSSSE}")
复制代码

使用道具

13
ReneeBK 发表于 2017-5-28 03:28:26 |只看作者 |坛友微信交流群
  1. 11) Predict populations, compute the confusion matrix.
  2. >

  3. // Create predictionRDD that utilizes clusters.predict method to output the model's predictions
  4. val predictionRDD: RDD[(String, Int)] = dataPerSampleId.map(sd => {
  5.     (sd._1, clusters.predict(sd._2))
  6. })

  7. // Convert to DataFrame to more easily query the data
  8. val predictDF = predictionRDD.toDF("sample","prediction")
  9. predictionRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[2546] at map at <console>:105
  10. predictDF: org.apache.spark.sql.DataFrame = [sample: string, prediction: int]
  11. >

  12. // Join back to the filterPanel to get the original label
  13. val resultsDF =  filterPanel.join(predictDF, "sample")
  14. display(resultsDF)
复制代码

使用道具

14
ReneeBK 发表于 2017-5-28 03:32:00 |只看作者 |坛友微信交流群
  1. 12) Visualize the clusters with a force graph on lightning-viz.
  2. Create the input data vectors for the lightning-viz:
  3. list of the group memberships (the populations)
  4. list of the people (the sample IDs)
  5. Nested list of the graph links, with each person linked to their predicted cluster.
  6. >

  7. %python

  8. #prepare our data into a suitable format for our viz

  9. from pyspark.sql.functions import rowNumber
  10. from pyspark.sql.window import Window

  11. #ensure that our data arrays come out in the same order

  12. df = sqlContext.sql("select sample, popcode, prediction from final_results_table").coalesce(1)
  13. w = Window().orderBy()
  14. df = df.withColumn("rownumber", rowNumber().over(w))

  15. pop = df.select("popcode", "rownumber").collect()
  16. peeps = df.select("sample", "rownumber").collect()
  17. preds= df.select("prediction", "rownumber").collect()

  18. pop = [(str(x),str(y)) for (x,y) in pop]
  19. peeps = [(str(x),str(y)) for (x,y) in peeps]
  20. preds = [( x, str(y), 1) for (x,y) in preds]


  21. def getKey(item):
  22.   return item[1]

  23. g = sorted(pop, key=getKey)
  24. l = sorted(peeps, key=getKey)
  25. pr = sorted(preds, key=getKey)

  26. # add 3 points for our force graph centers

  27. groups = ["0","1","2"] + [x[0] for x in g]
  28. labels = ["0","1","2"] + [x[0] for x in l]
  29. predictions = [[x[0],x[2]] for x in pr]

  30. listIndices= list(range(3,len(predictions) + 3))
  31. i = 0
  32. for sublist in predictions:
  33.   sublist.insert(0,listIndices[i])
  34.   i += 1
  35. >

  36. %python

  37. #create the viz

  38. from lightning import Lightning

  39. lgn = Lightning(host='http://public.lightning-viz.org')
  40. lgn.create_session("new")
  41. viz = lgn.force(predictions, group=groups, labels=labels)
  42. viz.get_public_link()
复制代码

使用道具

15
钱学森64 发表于 2017-5-28 10:58:55 |只看作者 |坛友微信交流群
谢谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-25 16:19