楼主: igs816
8690 31

Learning PySpark by Tomasz Drabas [推广有奖]

21
jackbrown(真实交易用户) 发表于 2017-3-29 22:55:56
一直在用pyspark ,谢谢分享

22
Lisrelchen(未真实交易用户) 发表于 2017-3-30 07:45:34
  1. The .filter(...) transformation
  2. Another most often used transformation is the .filter(...) method,
  3. which allows you to select elements from your dataset that fit specified
  4. criteria. As an example, from the data_from_file_conv dataset, let's
  5. count how many people died in an accident in 2014:
  6. data_filtered = data_from_file_conv.filter(
  7. lambda row: row[16] == '2014' and row[21] == '0')
  8. data_filtered.count()
复制代码

23
Lisrelchen(未真实交易用户) 发表于 2017-3-30 07:46:28
  1. The .flatMap(...) transformation
  2. The .flatMap(...) method works similarly to .map(...), but it returns
  3. a flattened result instead of a list. If we execute the following code:
  4. data_2014_flat = data_from_file_conv.flatMap(lambda row:
  5. (row[16], int(row[16]) + 1))
  6. data_2014_flat.take(10)
复制代码

24
Lisrelchen(未真实交易用户) 发表于 2017-3-30 07:47:04
  1. The .distinct(...) transformation
  2. This method returns a list of distinct values in a specified column. It is
  3. extremely useful if you want to get to know your dataset or validate it.
  4. Let's check if the gender column contains only males and females; that
  5. would verify that we parsed the dataset properly. Let's run the following
  6. code:
  7. distinct_gender = data_from_file_conv.map(
  8. lambda row: row[5]).distinct()
  9. distinct_gender.collect()
复制代码

25
Lisrelchen(未真实交易用户) 发表于 2017-3-30 07:47:51
  1. The .sample(...) transformation
  2. The .sample(...) method returns a randomized sample from the
  3. dataset. The first parameter specifies whether the sampling should be
  4. with a replacement, the second parameter defines the fraction of the
  5. data to return, and the third is seed to the pseudo-random numbers
  6. generator:
  7. fraction = 0.1
  8. data_sample = data_from_file_conv.sample(False, fraction, 666)
复制代码

26
kile31920(真实交易用户) 发表于 2017-3-31 18:17:33
Learning PySpark by Tomasz Drabas

27
sacromento(真实交易用户) 学生认证  发表于 2017-5-24 06:24:09
谢谢分享啊!

28
zeldarxf(未真实交易用户) 发表于 2018-2-12 21:23:35

Learning PySpark by Tomasz Drabas

29
bearfighting(未真实交易用户) 发表于 2018-4-1 00:53:01
好书谢谢

30
yuezzyy(未真实交易用户) 发表于 2019-7-5 17:00:40
康看看看

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2026-1-2 12:49