楼主: igs816
7794 31

Learning PySpark by Tomasz Drabas [推广有奖]

21
jackbrown 发表于 2017-3-29 22:55:56 |只看作者 |坛友微信交流群
一直在用pyspark ,谢谢分享

使用道具

22
Lisrelchen 发表于 2017-3-30 07:45:34 |只看作者 |坛友微信交流群
  1. The .filter(...) transformation
  2. Another most often used transformation is the .filter(...) method,
  3. which allows you to select elements from your dataset that fit specified
  4. criteria. As an example, from the data_from_file_conv dataset, let's
  5. count how many people died in an accident in 2014:
  6. data_filtered = data_from_file_conv.filter(
  7. lambda row: row[16] == '2014' and row[21] == '0')
  8. data_filtered.count()
复制代码

使用道具

23
Lisrelchen 发表于 2017-3-30 07:46:28 |只看作者 |坛友微信交流群
  1. The .flatMap(...) transformation
  2. The .flatMap(...) method works similarly to .map(...), but it returns
  3. a flattened result instead of a list. If we execute the following code:
  4. data_2014_flat = data_from_file_conv.flatMap(lambda row:
  5. (row[16], int(row[16]) + 1))
  6. data_2014_flat.take(10)
复制代码

使用道具

24
Lisrelchen 发表于 2017-3-30 07:47:04 |只看作者 |坛友微信交流群
  1. The .distinct(...) transformation
  2. This method returns a list of distinct values in a specified column. It is
  3. extremely useful if you want to get to know your dataset or validate it.
  4. Let's check if the gender column contains only males and females; that
  5. would verify that we parsed the dataset properly. Let's run the following
  6. code:
  7. distinct_gender = data_from_file_conv.map(
  8. lambda row: row[5]).distinct()
  9. distinct_gender.collect()
复制代码

使用道具

25
Lisrelchen 发表于 2017-3-30 07:47:51 |只看作者 |坛友微信交流群
  1. The .sample(...) transformation
  2. The .sample(...) method returns a randomized sample from the
  3. dataset. The first parameter specifies whether the sampling should be
  4. with a replacement, the second parameter defines the fraction of the
  5. data to return, and the third is seed to the pseudo-random numbers
  6. generator:
  7. fraction = 0.1
  8. data_sample = data_from_file_conv.sample(False, fraction, 666)
复制代码

使用道具

26
kile31920 发表于 2017-3-31 18:17:33 |只看作者 |坛友微信交流群
Learning PySpark by Tomasz Drabas

使用道具

27
sacromento 学生认证  发表于 2017-5-24 06:24:09 |只看作者 |坛友微信交流群
谢谢分享啊!

使用道具

28
zeldarxf 发表于 2018-2-12 21:23:35 |只看作者 |坛友微信交流群

Learning PySpark by Tomasz Drabas

使用道具

29
bearfighting 发表于 2018-4-1 00:53:01 |只看作者 |坛友微信交流群
好书谢谢

使用道具

30
yuezzyy 发表于 2019-7-5 17:00:40 |只看作者 |坛友微信交流群
康看看看

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-11-21 18:43