[Tutorials]Deep Leaning Using Apache Spark's BigDL

0关注
62粉丝

VIP

已卖：4196份资源

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 50294 个
通用积分: 83.8106
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2017-9-18 09:27:13 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Deep Leaning Tutorials on Apache Spark using BigDL
Step-by-step Deep Leaning Tutorials on Apache Spark using BigDL. The tutorials are inspired by Apache Spark examples, the Theano Tutorials and the Tensorflow tutorials.
Topics
RDD
DataFrame
SparkSQL
StructureStreaming
Forward and backward
Linear Regression
Introduction to MNIST
Logistic Regression
Feedforward Neural Network
Convolutional Neural Network
Recurrent Neural Network
LSTM
Bi-directional RNN
Auto-encoder
Environment
Python 2.7
JDK 8
Apache Spark 2.1.0
Jupyter Notebook 4.1
BigDL 0.2.0
Setup env on Mac OS / Setup env on Linux
Start Jupyter Server
Download BigDL 0.2.0(linux64, mac) and unzip file.
Run export BIGDL_HOME=where is your unzipped bigdl folder
Run export SPARK_HOME=where is your unpacked spark folder
Run ./start_notebook.sh
Run Demo
Open a browser - Suggest Chrome or Firefox or Safari
Access notebook client at address http://localhost:8888, open the example ipynb files and execute.

复制代码

本帖隐藏的内容

https://github.com/intel-analytics/BigDL-Tutorials

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Apache Spark Tutorials Tutorial Leaning apache

Using BigDL to Train Linear Regression Model

In [2]:
FEATURES_DIM = 2
data_len = 100
def gen_rand_sample():
features = np.random.uniform(0, 1, (FEATURES_DIM))
label = (2 * features).sum() + 0.4
return Sample.from_ndarray(features, label)
rdd_train = sc.parallelize(range(0, data_len)).map( lambda i: gen_rand_sample() )

复制代码

In [3]:
# Parameters
learning_rate = 0.2
training_epochs = 5
batch_size = 4
n_input = FEATURES_DIM
n_output = 1
def linear_regression(n_input, n_output):
# Initialize a sequential container
model = Sequential()
# Add a linear layer
model.add(Linear(n_input, n_output))
return model
model = linear_regression(n_input, n_output)

复制代码

In [4]:
# Create an Optimizer
optimizer = Optimizer(
model=model,
training_rdd=rdd_train,
criterion=MSECriterion(),
optim_method=SGD(learningrate=learning_rate),
end_trigger=MaxEpoch(training_epochs),
batch_size=batch_size)

复制代码

In [5]:
# Start to train
trained_model = optimizer.optimize()

复制代码

In [6]:
# Print the first five predicted results of training data.
predict_result = trained_model.predict(rdd_train)
p = predict_result.take(5)
print("predict predict: \n")
for i in p:
print(str(i) + "\n")

复制代码

In [7]:
def test_predict(trained_model):
np.random.seed(100)
total_length = 10
features = np.random.uniform(0, 1, (total_length, 2))
label = (features).sum() + 0.4
predict_data = sc.parallelize(range(0, total_length)).map(
lambda i: Sample.from_ndarray(features[i], label))
predict_result = trained_model.predict(predict_data)
p = predict_result.take(6)
ground_label = np.array([[-0.47596836], [-0.37598032], [-0.00492062],
[-0.5906958], [-0.12307882], [-0.77907401]], dtype="float32")
mse = ((p - ground_label) ** 2).mean()
print mse
test_predict(trained_model)

复制代码

藤椅

MouJack007 发表于 2017-9-18 09:39:58

谢谢楼主分享！

板凳

Lisrelchen 发表于 2017-9-18 09:40:09

In [1]:
%pylab inline
import pandas
import datetime as dt
from bigdl.nn.layer import *
from bigdl.nn.criterion import *
from bigdl.optim.optimizer import *
from bigdl.util.common import *
from bigdl.dataset.transformer import *
from bigdl.dataset import mnist
from utils import get_mnist
init_engine()

复制代码

In [2]:
mnist_path = "datasets/mnist"
(train_data, test_data) = get_mnist(sc, mnist_path)
print train_data.count()
print test_data.count()

复制代码

In [3]:
# Parameters
learning_rate = 0.2
training_epochs = 15
batch_size = 2048
# Network Parameters
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)

复制代码

In [4]:
# Define the logistic_regression model
def logistic_regression(n_input, n_classes):
# Initialize a sequential container
model = Sequential()
model.add(Reshape([28*28]))
model.add(Linear(n_input, n_classes))
model.add(LogSoftMax())
return model
model = logistic_regression(n_input, n_classes)

复制代码

# Create an Optimizer
optimizer = Optimizer(
model=model,
training_rdd=train_data,
criterion=ClassNLLCriterion(),
optim_method=SGD(learningrate=learning_rate),
end_trigger=MaxEpoch(training_epochs),
batch_size=batch_size)

复制代码

%%time
# Start to train
trained_model = optimizer.optimize()
print "Optimization Done."

复制代码

In [7]:
def map_predict_label(l):
return l.argmax()
def map_groundtruth_label(l):
return l[0] - 1

复制代码

In [8]:
# Prediction
predictions = trained_model.predict(test_data)
imshow(np.column_stack([np.array(s.features).reshape(28,28) for s in test_data.take(8)]),cmap='gray'); axis('off')
print 'Ground Truth labels:'
print ', '.join(str(map_groundtruth_label(s.label)) for s in test_data.take(8))
print 'Predicted labels:'
print ', '.join(str(map_predict_label(s)) for s in predictions.take(8))

复制代码

报纸

MouJack007 发表于 2017-9-18 09:40:27

地板

Lisrelchen 发表于 2017-9-18 09:40:57

复制代码

In [2]:
# Get and store MNIST into RDD of Sample, please edit the "mnist_path" accordingly.
mnist_path = "datasets/mnist"
(train_data, test_data) = get_mnist(sc, mnist_path)
print train_data.count()
print test_data.count()

复制代码

In [3]:
learning_rate = 0.2
training_epochs = 15
batch_size = 2048
display_step = 1
# Network Parameters
n_hidden_1 = 256 # 1st layer number of features
n_hidden_2 = 256 # 2nd layer number of features
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)

复制代码

In [4]:
# Create model
def multilayer_perceptron(n_hidden_1, n_hidden_2, n_input, n_classes):
# Initialize a sequential container
model = Sequential()
# Hidden layer with ReLu activation
model.add(Reshape([28*28]))
model.add(Linear(n_input, n_hidden_1).set_name('mlp_fc1'))
model.add(ReLU())
# Hidden layer with ReLu activation
model.add(Linear(n_hidden_1, n_hidden_2).set_name('mlp_fc2'))
model.add(ReLU())
# output layer
model.add(Linear(n_hidden_2, n_classes).set_name('mlp_fc3'))
model.add(LogSoftMax())
return model
model = multilayer_perceptron(n_hidden_1, n_hidden_2, n_input, n_classes)

复制代码

In [5]:
# Create an Optimizer
optimizer = Optimizer(
model=model,
training_rdd=train_data,
criterion=ClassNLLCriterion(),
optim_method=SGD(learningrate=learning_rate),
end_trigger=MaxEpoch(training_epochs),
batch_size=batch_size)
# Set the validation logic
optimizer.set_validation(
batch_size=batch_size,
val_rdd=test_data,
trigger=EveryEpoch(),
val_method=[Top1Accuracy()]
)
app_name='multilayer_perceptron-'+dt.datetime.now().strftime("%Y%m%d-%H%M%S")
train_summary = TrainSummary(log_dir='/tmp/bigdl_summaries',
app_name=app_name)
train_summary.set_summary_trigger("Parameters", SeveralIteration(50))
val_summary = ValidationSummary(log_dir='/tmp/bigdl_summaries',
app_name=app_name)
optimizer.set_train_summary(train_summary)
optimizer.set_val_summary(val_summary)
print "saving logs to ",app_name

复制代码

In [6]:
%%time
# Boot training process
trained_model = optimizer.optimize()
print "Optimization Done."

复制代码

In [7]:
def map_predict_label(l):
return np.array(l).argmax()
def map_groundtruth_label(l):
return l[0] - 1

复制代码

In [8]:
%%time
predictions = trained_model.predict(test_data)
imshow(np.column_stack([np.array(s.features).reshape(28,28) for s in test_data.take(8)]),cmap='gray'); axis('off')
print 'Ground Truth labels:'
print ', '.join(str(map_groundtruth_label(s.label)) for s in test_data.take(8))
print 'Predicted labels:'
print ', '.join(str(map_predict_label(s)) for s in predictions.take(8))

复制代码

7楼

Lisrelchen 发表于 2017-9-18 09:44:52

复制代码

In [4]:
# Create an Optimizer
optimizer = Optimizer(
model=lenet_model,
training_rdd=train_data,
criterion=ClassNLLCriterion(),
optim_method=SGD(learningrate=0.4, learningrate_decay=0.0002),
end_trigger=MaxEpoch(20),
batch_size=2048)
# Set the validation logic
optimizer.set_validation(
batch_size=2048,
val_rdd=test_data,
trigger=EveryEpoch(),
val_method=[Top1Accuracy()]
)
app_name='lenet-'+dt.datetime.now().strftime("%Y%m%d-%H%M%S")
train_summary = TrainSummary(log_dir='/tmp/bigdl_summaries',
app_name=app_name)
train_summary.set_summary_trigger("Parameters", SeveralIteration(50))
val_summary = ValidationSummary(log_dir='/tmp/bigdl_summaries',
app_name=app_name)
optimizer.set_train_summary(train_summary)
optimizer.set_val_summary(val_summary)
print "saving logs to ",app_name

复制代码

In [5]:
%%time
# Boot training process
trained_model = optimizer.optimize()
print "Optimization Done."

复制代码

In [6]:
def map_predict_label(l):
return np.array(l).argmax()
def map_groundtruth_label(l):
return l[0] - 1

复制代码

In [7]:
# label-1 to restore the original label.
print "Ground Truth labels:"
print ', '.join([str(map_groundtruth_label(s.label)) for s in train_data.take(8)])
imshow(np.column_stack([np.array(s.features).reshape(28,28) for s in train_data.take(8)]),cmap='gray'); axis('off')

复制代码

In [9]:
params = trained_model.parameters()
#batch num, output_dim, input_dim, spacial_dim
for layer_name, param in params.iteritems():
print layer_name,param['weight'].shape,param['bias'].shape

复制代码

In [10]:
#vis_square is borrowed from caffe example
def vis_square(data):
"""Take an array of shape (n, height, width) or (n, height, width, 3)
and visualize each (height, width) thing in a grid of size approx. sqrt(n) by sqrt(n)"""
# normalize data for display
data = (data - data.min()) / (data.max() - data.min())
# force the number of filters to be square
n = int(np.ceil(np.sqrt(data.shape[0])))
padding = (((0, n ** 2 - data.shape[0]),
(0, 1), (0, 1)) # add some space between filters
+ ((0, 0),) * (data.ndim - 3)) # don't pad the last dimension (if there is one)
data = np.pad(data, padding, mode='constant', constant_values=1) # pad with ones (white)
# tile the filters into an image
data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple(range(4, data.ndim + 1)))
data = data.reshape((n * data.shape[1], n * data.shape[3]) + data.shape[4:])
plt.imshow(data,cmap='gray'); plt.axis('off')

复制代码

In [11]:
filters_conv1 = params['conv1']['weight']
filters_conv1[0,0,0]
vis_square(np.squeeze(filters_conv1, axis=(0,)).reshape(1*6,5,5))

复制代码

In [12]:
# the parameters are a list of [weights, biases]
filters_conv2 = params['conv2']['weight']
vis_square(np.squeeze(filters_conv2, axis=(0,)).reshape(12*6,5,5))

复制代码

In [13]:
loss = np.array(train_summary.read_scalar("Loss"))
top1 = np.array(val_summary.read_scalar("Top1Accuracy"))
plt.figure(figsize = (12,12))
plt.subplot(2,1,1)
plt.plot(loss[:,0],loss[:,1],label='loss')
plt.xlim(0,loss.shape[0]+10)
plt.grid(True)
plt.title("loss")
plt.subplot(2,1,2)
plt.plot(top1[:,0],top1[:,1],label='top1')
plt.xlim(0,loss.shape[0]+10)
plt.title("top1 accuracy")
plt.grid(True)

复制代码

8楼

Lisrelchen 发表于 2017-9-18 09:54:19

Using Recurrent Neural Network

In [2]:
# Get and store MNIST into RDD of Sample, please edit the "mnist_path" accordingly.
mnist_path = "datasets/mnist"
(train_data, test_data) = get_mnist(sc, mnist_path)
train_data = train_data.map(lambda s: Sample.from_ndarray(np.resize(s.features, (28, 28)), s.label))
test_data = test_data.map(lambda s: Sample.from_ndarray(np.resize(s.features, (28, 28)), s.label))
print train_data.count()
print test_data.count()

复制代码

In [3]:
# Parameters
batch_size = 64
# Network Parameters
n_input = 28 # MNIST data input (img shape: 28*28)
n_hidden = 128 # hidden layer num of features
n_classes = 10 # MNIST total classes (0-9 digits)

复制代码

In [4]:
def build_model(input_size, hidden_size, output_size):
model = Sequential()
recurrent = Recurrent()
recurrent.add(RnnCell(input_size, hidden_size, Tanh()))
model.add(InferReshape([-1, input_size], True))
model.add(recurrent)
model.add(Select(2, -1))
model.add(Linear(hidden_size, output_size))
return model
rnn_model = build_model(n_input, n_hidden, n_classes)

复制代码

In [5]:
# Create an Optimizer
#criterion = TimeDistributedCriterion(CrossEntropyCriterion())
criterion = CrossEntropyCriterion()
optimizer = Optimizer(
model=rnn_model,
training_rdd=train_data,
criterion=criterion,
optim_method= Adam(),
end_trigger=MaxEpoch(5),
batch_size=batch_size)
# Set the validation logic
optimizer.set_validation(
batch_size=batch_size,
val_rdd=test_data,
trigger=EveryEpoch(),
val_method=[Top1Accuracy()]
)
app_name='rnn-'+dt.datetime.now().strftime("%Y%m%d-%H%M%S")
train_summary = TrainSummary(log_dir='/tmp/bigdl_summaries',
app_name=app_name)
train_summary.set_summary_trigger("Parameters", SeveralIteration(50))
val_summary = ValidationSummary(log_dir='/tmp/bigdl_summaries',
app_name=app_name)
optimizer.set_train_summary(train_summary)
optimizer.set_val_summary(val_summary)
print "saving logs to ",app_name

复制代码

In [6]:
%%time
# Boot training process
trained_model = optimizer.optimize()
print "Optimization Done."

复制代码

In [7]:
def map_predict_label(l):
return np.array(l).argmax()
def map_groundtruth_label(l):
return l[0] - 1

复制代码

In [8]:
%%time
predictions = trained_model.predict(test_data)
imshow(np.column_stack([np.array(s.features).reshape(28,28) for s in test_data.take(8)]),cmap='gray'); axis('off')
print 'Ground Truth labels:'
print ', '.join(str(map_groundtruth_label(s.label)) for s in test_data.take(8))
print 'Predicted labels:'
print ', '.join(str(map_predict_label(s)) for s in predictions.take(8))

复制代码

In [9]:
loss = np.array(train_summary.read_scalar("Loss"))
top1 = np.array(val_summary.read_scalar("Top1Accuracy"))
plt.figure(figsize = (12,12))
plt.subplot(2,1,1)
plt.plot(loss[:,0],loss[:,1],label='loss')
plt.xlim(0,loss.shape[0]+10)
plt.grid(True)
plt.title("loss")
plt.subplot(2,1,2)
plt.plot(top1[:,0],top1[:,1],label='top1')
plt.xlim(0,loss.shape[0]+10)
plt.title("top1 accuracy")
plt.grid(True)

复制代码

9楼

Lisrelchen 发表于 2017-9-18 09:58:23

Digit Classfication using LSTM

In [2]:
# Get and store MNIST into RDD of Sample, please edit the "mnist_path" accordingly.
mnist_path = "datasets/mnist"
(train_data, test_data) = get_mnist(sc, mnist_path)
train_data = train_data.map(lambda s: Sample.from_ndarray(np.resize(s.features, (28, 28)), s.label))
test_data = test_data.map(lambda s: Sample.from_ndarray(np.resize(s.features, (28, 28)), s.label))
print train_data.count()
print test_data.count()

复制代码

In [3]:
# Parameters
batch_size = 64
# Network Parameters
n_input = 28 # MNIST data input (img shape: 28*28)
n_hidden = 128 # hidden layer num of features
n_classes = 10 # MNIST total classes (0-9 digits)

复制代码

In [4]:
def build_model(input_size, hidden_size, output_size):
model = Sequential()
recurrent = Recurrent()
recurrent.add(LSTM(input_size, hidden_size))
model.add(InferReshape([-1, input_size], True))
model.add(recurrent)
model.add(Select(2, -1))
model.add(Linear(hidden_size, output_size))
return model
rnn_model = build_model(n_input, n_hidden, n_classes)

复制代码

In [5]:
# Create an Optimizer
#criterion = TimeDistributedCriterion(CrossEntropyCriterion())
criterion = CrossEntropyCriterion()
optimizer = Optimizer(
model=rnn_model,
training_rdd=train_data,
criterion=criterion,
optim_method=Adam(),
end_trigger=MaxEpoch(5),
batch_size=batch_size)
# Set the validation logic
optimizer.set_validation(
batch_size=batch_size,
val_rdd=test_data,
trigger=EveryEpoch(),
val_method=[Top1Accuracy()]
)
app_name='rnn-'+dt.datetime.now().strftime("%Y%m%d-%H%M%S")
train_summary = TrainSummary(log_dir='/tmp/bigdl_summaries',
app_name=app_name)
train_summary.set_summary_trigger("Parameters", SeveralIteration(50))
val_summary = ValidationSummary(log_dir='/tmp/bigdl_summaries',
app_name=app_name)
optimizer.set_train_summary(train_summary)
optimizer.set_val_summary(val_summary)
print "saving logs to ",app_name

复制代码

In [6]:
%%time
# Boot training process
trained_model = optimizer.optimize()
print "Optimization Done."

复制代码

In [7]:
def map_predict_label(l):
return np.array(l).argmax()
def map_groundtruth_label(l):
return l[0] - 1

复制代码

In [8]:
%%time
predictions = trained_model.predict(test_data)
imshow(np.column_stack([np.array(s.features).reshape(28,28) for s in test_data.take(8)]),cmap='gray'); axis('off')
print 'Ground Truth labels:'
print ', '.join(str(map_groundtruth_label(s.label)) for s in test_data.take(8))
print 'Predicted labels:'
print ', '.join(str(map_predict_label(s)) for s in predictions.take(8))

复制代码

In [9]:
loss = np.array(train_summary.read_scalar("Loss"))
top1 = np.array(val_summary.read_scalar("Top1Accuracy"))
plt.figure(figsize = (12,12))
plt.subplot(2,1,1)
plt.plot(loss[:,0],loss[:,1],label='loss')
plt.xlim(0,loss.shape[0]+10)
plt.grid(True)
plt.title("loss")
plt.subplot(2,1,2)
plt.plot(top1[:,0],top1[:,1],label='top1')
plt.xlim(0,loss.shape[0]+10)
plt.title("top1 accuracy")
plt.grid(True)

复制代码

10楼

钱学森64 发表于 2017-9-18 10:06:34

谢谢分享

[Tutorials]Deep Leaning Using Apache Spark's BigDL [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

Using BigDL to Train Linear Regression Model

Digit Classfication using Deep Feed Foward Neural Network

Digit Classfication using Convolutional Neural Network

Using Recurrent Neural Network

Digit Classfication using LSTM

浏览过的帖子

浏览过的版块

本版微信群

[Tutorials]Deep Leaning Using Apache Spark's BigDL [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

Using BigDL to Train Linear Regression Model

Digit Classfication using Deep Feed Foward Neural Network

Digit Classfication using Convolutional Neural Network

Using Recurrent Neural Network

Digit Classfication using LSTM

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群