【Apache Spark】Thunder:Large-scale neural data analysis with Spark

0关注
62粉丝

VIP

已卖：4192份资源

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 50278 个
通用积分: 83.5106
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2017-4-18 03:46:14 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

thunder (homepage)Large-scale neural data analysis with Spark
Thunder is a library for analyzing large-scale spatial and temporal neural data. It includes utilities for loading and saving data using a variety of input formats, classes for working with distributed spatial and temporal data, and modular functions for time series analysis, image processing, factorization, and model fitting. Thunder is written against Spark's Python API (PySpark), making extensive use of NumPy and SciPy. It is pip-installable, and requires a working installation of Spark (currently supporting versions 1.0 and 1.1).

复制代码

Tags

3neuroscience
2machine learning
2python
1mllib

How to

This package doesn't have any releases published in the Spark Packages repo, or with maven coordinates supplied. You may have to build this package from source, or it may simply be a script. To use this Spark Package, please follow the instructions in the README.

Releases

[url=]Version: 0.4.1[/url] ( bd268a | zip ) / Date: 2014-11-27 / License: BSD 3-Clause

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Apache Spark Large-Scale Analysis Analysi thunder extensive homepage includes loading library

本帖被以下文库推荐

· Apache Storm NewOccidental|主题: 36, 订阅: 0

沙发

Lisrelchen 发表于 2017-4-18 03:49:29

package thunder.examples
import org.apache.spark.{SparkContext, SparkConf}
import thunder.util.Load
object ExampleLoad {
def main(args: Array[String]) {
val master = args(0)
val file = args(1)
val conf = new SparkConf().setMaster(master).setAppName("ExampleLoad")
if (!master.contains("local")) {
conf.setSparkHome(System.getenv("SPARK_HOME"))
.setJars(List("target/scala-2.10/thunder_2.10-0.1.0.jar"))
.set("spark.executor.memory", "100G")
}
val sc = new SparkContext(conf)
val data = Load.fromBinaryWithKeys(sc, file, recordLength = args(2).toInt, nKeys = 3, format="short")
data.take(10).foreach{x =>
println(x._1.mkString(" "))
println(x._2.mkString(" "))
}
println(data.count())
}
}

复制代码

藤椅

Lisrelchen 发表于 2017-4-18 03:50:30

package thunder.examples
import thunder.util.LoadParam
object ExampleLoadParam {
def main(args: Array[String]) {
val file = args(0)
val param = LoadParam.fromText(file)
println(param.toDebugString)
}
}

复制代码

板凳

Lisrelchen 发表于 2017-4-18 03:51:02

package thunder.examples
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import thunder.util.LoadStreaming
object ExampleLoadStreaming {
def main(args: Array[String]) {
val master = args(0)
val file = args(1)
val batchTime = args(2).toLong
val conf = new SparkConf().setMaster(master).setAppName("ExampleLoadStreaming")
if (!master.contains("local")) {
conf.setSparkHome(System.getenv("SPARK_HOME"))
.setJars(List("target/scala-2.10/thunder_2.10-0.1.0.jar"))
.set("spark.executor.memory", "100G")
}
val ssc = new StreamingContext(conf, Seconds(batchTime))
val data = LoadStreaming.fromBinary(ssc, file)
data.map(v => v.mkString(" ")).print()
data.foreachRDD(rdd => println(rdd.count()))
ssc.start()
ssc.awaitTermination()
}
}

复制代码

报纸

Lisrelchen 发表于 2017-4-18 03:51:40

package thunder.examples
import org.apache.spark.{SparkContext, SparkConf}
import thunder.util.Save
object ExampleSave {
def main(args: Array[String]) {
val master = args(0)
val file = args(1)
val directory = args(2)
val conf = new SparkConf().setMaster(master).setAppName("ExampleLoad")
if (!master.contains("local")) {
conf.setSparkHome(System.getenv("SPARK_HOME"))
.setJars(List("target/scala-2.10/thunder_2.10-0.1.0.jar"))
.set("spark.executor.memory", "100G")
}
val sc = new SparkContext(conf)
val data = sc.parallelize(Seq((Array(3),Array(1.5)),(Array(2),Array(2.0)),(Array(1),Array(3.0))))
Save.asTextWithKeys(data, directory, Seq(file))
}
}

复制代码

地板

Lisrelchen 发表于 2017-4-18 03:52:27

package thunder.streaming
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import thunder.util.LoadStreaming
import thunder.util.Save
import thunder.util.io.Keys
import org.apache.spark.Logging
import scala.math.sqrt
import scala.Some
import cern.colt.matrix.DoubleFactory2D
import cern.colt.matrix.DoubleFactory1D
import cern.colt.matrix.DoubleMatrix2D
import cern.colt.matrix.DoubleMatrix1D
import cern.jet.math.Functions.{plus, minus, bindArg2, pow}
import cern.colt.matrix.linalg.Algebra.DEFAULT.{inverse, mult, transpose}
/** Class for representing parameters and sufficient statistics for a running linear regression model */
class FittedModel(
var count: Double,
var mean: Double,
var sumOfSquaresTotal: Double,
var sumOfSquaresError: Double,
var XX: DoubleMatrix2D,
var Xy: DoubleMatrix1D,
var beta: DoubleMatrix1D) extends Serializable {
def variance = sumOfSquaresTotal / (count - 1)
def std = sqrt(variance)
def R2 = 1 - sumOfSquaresError / sumOfSquaresTotal
def intercept = beta.toArray()(0)
def weights = beta.toArray.drop(1)
}
/**
* Stateful linear regression on streaming data
*
* The underlying model is that every batch of streaming
* data contains a set of records with unique keys,
* one is the feature, and the rest can be predicted
* by the feature. We estimate the sufficient statistics
* of the features, and each of the data points,
* to computing a running estimate of the linear regression
* model for each key. Returns a state stream of fitted models.
*
* Features and labels from different batches
* can have different lengths.
*
* See also: StreamingLinearRegression
*/
class StatefulLinearRegression (
var featureKeys: Array[Int])
extends Serializable with Logging
{
def this() = this(Array(0))
/** Set which indices that correspond to features. Default: Array(0). */
def setFeatureKeys(featureKeys: Array[Int]): StatefulLinearRegression = {
this.featureKeys = featureKeys
this
}
val runningLinearRegression = (input: Seq[Array[Double]], state: Option[FittedModel], features: Array[Double]) => {
val y = input.foldLeft(Array[Double]()) { (acc, i) => acc ++ i}
val currentCount = y.size
val n = features.size
val updatedState = state.getOrElse(new FittedModel(0.0, 0.0, 0.0, 0.0, DoubleFactory2D.dense.make(2, 2),
DoubleFactory1D.dense.make(2), DoubleFactory1D.dense.make(2)))
if ((currentCount != 0) & (n != 0)) {
// append column of 1s
val X = DoubleFactory2D.dense.make(n, 2)
for (i <- 0 until n) {
X.set(i, 0, 1)
}
for (i <- 0 until n ; j <- 1 until 2) {
X.set(i, j, features(i))
}
// create matrix version of y
val ymat = DoubleFactory1D.dense.make(currentCount)
for (i <- 0 until currentCount) {
ymat.set(i, y(i))
}
// store values from previous iteration (needed for update equations)
val oldCount = updatedState.count
val oldMean = updatedState.mean
val oldXy = updatedState.Xy.copy
val oldXX = updatedState.XX.copy
val oldBeta = updatedState.beta
// compute current estimates of all statistics
val currentMean = y.foldLeft(0.0)(_+_) / currentCount
val currentSumOfSquaresTotal = y.map(x => pow(x - currentMean, 2)).foldLeft(0.0)(_+_)
val currentXy = mult(transpose(X), ymat)
val currentXX = mult(transpose(X), X)
// compute new values for X*y (the sufficient statistic) and new beta (needed for update equations)
val newXX = oldXX.copy.assign(currentXX, plus)
val newXy = updatedState.Xy.copy.assign(currentXy, plus)
val newBeta = mult(inverse(newXX), newXy)
// compute terms for update equations
val delta = currentMean - oldMean
val term1 = ymat.copy.assign(mult(X, newBeta), minus).assign(bindArg2(pow, 2)).zSum
val term2 = mult(mult(oldXX, newBeta), newBeta)
val term3 = mult(mult(oldXX, oldBeta), oldBeta)
val term4 = 2 * mult(oldBeta.copy.assign(newBeta, minus), oldXy)
// update the all statistics of the fitted model
updatedState.count += currentCount
updatedState.mean += (delta * currentCount / (oldCount + currentCount))
updatedState.Xy = newXy
updatedState.XX = newXX
updatedState.beta = newBeta
updatedState.sumOfSquaresTotal += currentSumOfSquaresTotal + delta * delta * (oldCount * currentCount) / (oldCount + currentCount)
updatedState.sumOfSquaresError += term1 + term2 - term3 + term4
}
Some(updatedState)
}
def runStreaming(data: DStream[(Int, Array[Double])]): DStream[(Int, FittedModel)] = {
var features = Array[Double]()
data.filter{case (k, _) => featureKeys.contains(k)}.foreachRDD{rdd =>
val batchFeatures = rdd.values.collect().flatten
features = batchFeatures.size match {
case 0 => Array[Double]()
case _ => batchFeatures
}
}
data.filter{case (k, _) => !featureKeys.contains(k)}.updateStateByKey{
(x, y) => runningLinearRegression(x, y, features)}
}
}
/**
* Top-level methods for calling Stateful Linear Regression.
*/
object StatefulLinearRegression {
/**
* Train a Stateful Linear Regression model.
* We assume that in each batch of streaming data we receive
* one or more features and several vectors of labels, each
* with a unique key, and a subset of keys indicate the features.
* We fit separate linear models that relate the common features
* to the labels associated with each key.
*
* @param input DStream of (Int, Array[Double]) keyed data point
* @param featureKeys Array of keys associated with features
* @return DStream of (Int, LinearRegressionModel) with fitted regression models
*/
def trainStreaming(input: DStream[(Int, Array[Double])],
featureKeys: Array[Int]): DStream[(Int, FittedModel)] =
{
new StatefulLinearRegression().setFeatureKeys(featureKeys).runStreaming(input)
}
def main(args: Array[String]) {
if (args.length != 6) {
System.err.println(
"Usage: StatefulLinearRegression <master> <directory> <batchTime> <outputDirectory> <dims> <featureKeys>")
System.exit(1)
}
val (master, directory, batchTime, outputDirectory, dims, features) = (
args(0), args(1), args(2).toLong, args(3),
args(4).drop(1).dropRight(1).split(",").map(_.trim.toInt),
Array(args(5).drop(1).dropRight(1).split(",").map(_.trim.toInt)))
val conf = new SparkConf().setMaster(master).setAppName("StatefulLinearRegression")
if (!master.contains("local")) {
conf.setSparkHome(System.getenv("SPARK_HOME"))
.setJars(List("target/scala-2.10/thunder_2.10-0.1.0.jar"))
.set("spark.executor.memory", "100G")
.set("spark.default.parallelism", "100")
}
/** Get feature keys with linear indexing */
val featureKeys = Keys.subToInd(features, dims)
/** Create Streaming Context */
val ssc = new StreamingContext(conf, Seconds(batchTime))
ssc.checkpoint(System.getenv("CHECKPOINT"))
/** Load streaming data */
val data = LoadStreaming.fromTextWithKeys(ssc, directory, dims.size, dims)
/** Train Linear Regression models */
val state = StatefulLinearRegression.trainStreaming(data, featureKeys)
/** Print output (for testing) */
state.mapValues(x => "\n" + "mean: " + "%.5f".format(x.mean) +
"\n" + "count: " + "%.5f".format(x.count) +
"\n" + "variance: " + "%.5f".format(x.variance) +
"\n" + "beta: " + x.beta.toArray.mkString(",") +
"\n" + "R2: " + "%.5f".format(x.R2) +
"\n" + "SSE: " + "%.5f".format(x.sumOfSquaresError) +
"\n" + "SST: " + "%.5f".format(x.sumOfSquaresTotal) +
"\n" + "Xy: " + x.Xy.toArray.mkString(",")).print()
///** Save output (for production) */
//val out = state.mapValues(x => Array(x.R2))
//Save.saveStreamingDataAsText(out, outputDirectory, Seq("r2"))
ssc.start()
ssc.awaitTermination()
}
}

复制代码

7楼

Lisrelchen 发表于 2017-4-18 03:53:26

package thunder.streaming
import org.apache.spark.{SparkConf, Logging}
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.streaming.StreamingKMeansModel
import scala.util.Random.nextDouble
import scala.util.Random.nextGaussian
import thunder.util.LoadStreaming
/**
* K-means clustering on streaming data with support for
* sequential and forgetful algorithms.
*
* The underlying assumption is that all streaming data points
* belong to one of several clusters, and we want to
* learn the identity of those clusters (the "KMeans Model")
* as new data arrive. All records MUST have the same dimensionality.
*
* For sequential algorithms, we update the underlying
* cluster identities once for each batch of data, and keep
* a count of the number of data points per cluster.
* The number of data points per batch can be arbitrary.
* This implementation is based on the sequential k-means algorithm
* (see https://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm)
* except that each update is based on a batch of data rather than
* a single data point. It is also similar to the offline mini-batch
* algorithm (see http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf)
* except that batches arrive over time, rather than through sampling.
*
* For forgetful algorithms, each new batch of data is weighted in
* its contribution so that more recent data is weighted more strongly
* (see https://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm).
* The weighting is per batch (i.e. per time window), rather than per data point,
* so for meaningful interpretation, the number of data points per batch
* should be approximately constant.
*
*/
class StreamingKMeans (
var k: Int,
var d: Int,
var a: Double,
var maxIterations: Int,
var initializationMode: String)
extends Serializable with Logging
{
private type ClusterCentersAndCounts = Array[(Array[Double], Int)]
/** Construct a StreamingKMeans object with default parameters */
def this() = this(2, 5, 1.0, 1, "gauss")
/** Set the number of clusters to create (k). Default: 2. */
def setK(k: Int): StreamingKMeans = {
this.k = k
this
}
/** Set the dimensionality of the data (d). Default: 5
* TODO: if possible, set this automatically based on first data point
*/
def setD(d: Int): StreamingKMeans = {
this.d = d
this
}
/**
* Set the parameter alpha to determine the update rule.
* If alpha = 1, perform seqeutnail or "mini batch" KMeans, treating all
* data equivalently. If alpha < 1, perform forgetful KMeans,
* which uses a constant to weight old data
* less strongly (with exponential weighting), e.g. 0.9 will
* favor only recent data, whereas 0.1 will update slowly.
* Weighting over time is per batch, so this algorithm implicitly
* assumes an approximately constant number of data points per batch
* Default: 1 (sequential)
*/
def setAlpha(a: Double): StreamingKMeans = {
this.a = a
this
}
// TODO: characterize the effect of max iterations in forgetful version
/** Set the maximum number of iterations per batch of data. */
def setMaxIterations(maxIterations: Int): StreamingKMeans = {
this.maxIterations = maxIterations
this
}
/**
* Set the initialization algorithm. Unlike batch KMeans, we
* initialize randomly before we have seen any data. Options are "gauss"
* for random Gaussian centers, and "pos" for random positive uniform centers.
* Default: gauss
*/
def setInitializationMode(initializationMode: String): StreamingKMeans = {
if (initializationMode != "gauss" && initializationMode != "pos") {
throw new IllegalArgumentException("Invalid initialization mode: " + initializationMode)
}
this.initializationMode = initializationMode
this
}
/** Initialize random points for KMeans clustering */
def initRandom(): StreamingKMeansModel = {
val clusters = new Array[(Array[Double], Int)](k)
for (ik <- 0 until k) {
clusters(ik) = initializationMode match {
case "gauss" => (Array.fill(d)(nextGaussian()), 0)
case "pos" => (Array.fill(d)(nextDouble()), 0)
}
}
new StreamingKMeansModel(clusters.map(_._1).map(x => Vectors.dense(x)), clusters.map(_._2))
}
// TODO: stop iterating if clusters have converged
/** Update KMeans clusters by doing training passes over an RDD */
def update(data: RDD[Vector], model: StreamingKMeansModel): StreamingKMeansModel = {
val centers = model.clusterCenters
val counts = model.clusterCounts
// do iterative KMeans updates on a batch of data
for (i <- Range(0, maxIterations)) {
// find nearest cluster to each point
val closest = data.map(point => (model.predict(point), (point, 1)))
// get sums and counts for updating each cluster
val pointStats = closest.reduceByKey{
case ((x1, y1), (x2, y2)) => (Vectors.dense(x1.toArray.zip(x2.toArray).map{case (x, y) => x + y}), y1 + y2)}
val newPoints = pointStats.map{
pair => (pair._1, (pair._2._1, pair._2._2))}.collectAsMap()
a match {
case 1 => for (newP <- newPoints) {
// remove previous count scaling
centers(newP._1) = Vectors.dense(centers(newP._1).toArray.map(x => x * counts(newP._1)))
// update sums
centers(newP._1) = Vectors.dense(centers(newP._1).toArray.zip(newP._2._1.toArray).map{case (x, y) => x + y})
// update counts
counts(newP._1) += newP._2._2
// rescale to compute new means (of both old and new points)
centers(newP._1) = Vectors.dense(centers(newP._1).toArray.map(x => x / counts(newP._1)))
}
case _ => for (newP <- newPoints) {
// update centers with forgetting factor a
centers(newP._1) = Vectors.dense(centers(newP._1).toArray.zip(newP._2._1.toArray.map(x => x / newP._2._2)).map{
case (x, y) => x + a * (y - x)})
}
}
val model.clusterCenters = centers
}
// log the cluster centers
centers.zip(Range(0, centers.length)).foreach{
case (x, ix) => logInfo("Cluster center " + ix.toString + ": " + x.toString)}
new StreamingKMeansModel(centers, counts)
}
/**
* Main streaming operation: initialize the KMeans model
* and then update it based on new data from the stream.
*/
def runStreaming(data: DStream[Vector]): DStream[Int] = {
var model = initRandom()
data.foreachRDD(RDD => model = update(RDD, model))
data.map(point => model.predict(point))
}
}
/** Top-level methods for calling Streaming KMeans clustering. */
object StreamingKMeans {
/**
* Train a Streaming KMeans model. We initialize a set of
* cluster centers randomly and then update them
* after receiving each batch of data from the stream.
* If a = 1 this is equivalent to sequential or mini-batch KMeans,
* where each batch of data from the stream is treated as a different
* mini-batch. If a < 1, perform forgetful KMeans, which
* weights more recent data points more strongly.
*
* @param input Input DStream of (Array[Double]) data points
* @param k Number of clusters to estimate.
* @param d Number of dimensions per data point.
* @param a Update rule (1 mini batch, < 1 forgetful).
* @param maxIterations Maximum number of iterations per batch.
* @param initializationMode Random initialization of cluster centers.
* @return Output DStream of (Int) assignments of data points to clusters.
*/
def trainStreaming(input: DStream[Vector],
k: Int,
d: Int,
a: Double,
maxIterations: Int,
initializationMode: String)
: DStream[Int] =
{
new StreamingKMeans().setK(k)
.setD(d)
.setAlpha(a)
.setMaxIterations(maxIterations)
.setInitializationMode(initializationMode)
.runStreaming(input)
}
def main(args: Array[String]) {
if (args.length != 8) {
System.err.println("Usage: StreamingKMeans <master> <directory> <batchTime> <k> <d> <a> <maxIterations> <initializationMode>")
System.exit(1)
}
val (master, directory, batchTime, k, d, a, maxIterations, initializationMode) = (
args(0), args(1), args(2).toLong, args(3).toInt, args(4).toInt, args(5).toDouble, args(6).toInt, args(7))
val conf = new SparkConf().setMaster(master).setAppName("StreamingKMeans")
if (!master.contains("local")) {
conf.setSparkHome(System.getenv("SPARK_HOME"))
.setJars(List("target/scala-2.10/thunder_2.10-0.1.0.jar"))
.set("spark.executor.memory", "100G")
}
/** Create Streaming Context */
val ssc = new StreamingContext(conf, Seconds(batchTime))
/** Train KMeans model */
val data = LoadStreaming.fromText(ssc, directory).map(x => Vectors.dense(x))
val assignments = StreamingKMeans.trainStreaming(data, k, d, a, maxIterations, initializationMode)
/** Print assignments (for testing) */
assignments.print()
ssc.start()
ssc.awaitTermination()
}
}

复制代码

8楼

MouJack007 发表于 2017-4-18 07:09:49

谢谢楼主分享！

9楼

MouJack007 发表于 2017-4-18 07:10:08

10楼

franky_sas 发表于 2017-4-18 09:57:53

【Apache Spark】Thunder:Large-scale neural data analysis with Spark [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

【Apache Spark】Thunder:Large-scale neural data analysis with Spark [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群