Python + Hadoop: Real Python in Pig trunk-经管之家官网!

人大经济论坛-经管之家 收藏本站
您当前的位置> 软件培训>>

hadoop

>>

Python + Hadoop: Real Python in Pig trunk

Python + Hadoop: Real Python in Pig trunk

发布:Nicolle | 分类:hadoop

关于本站

人大经济论坛-经管之家:分享大学、考研、论文、会计、留学、数据、经济学、金融学、管理学、统计学、博弈论、统计年鉴、行业分析包括等相关资源。
经管之家是国内活跃的在线教育咨询平台!

经管之家新媒体交易平台

提供"微信号、微博、抖音、快手、头条、小红书、百家号、企鹅号、UC号、一点资讯"等虚拟账号交易,真正实现买卖双方的共赢。【请点击这里访问】

提供微信号、微博、抖音、快手、头条、小红书、百家号、企鹅号、UC号、一点资讯等虚拟账号交易,真正实现买卖双方的共赢。【请点击这里访问】

Foralongtime,datascientistsandengineershadtochoosebetweenleveragingthepowerofHadoopandusingPython’samazingdatasciencelibraries(likeNLTK,NumPy,andSciPy).It’sapainfuldecision,andonewethoughtshouldbeel ...
坛友互助群


扫码加入各岗位、行业、专业交流群


For a long time, data scientists and engineers had to choose between leveraging the power of Hadoop and using Python’s amazing data science libraries (like NLTK, NumPy, and SciPy). It’s a painful decision, and one we thought should be eliminated.
So about a year ago, we solved this problem by extending Pig to work with CPython, allowing our users to take advantage of Hadoop with real Python (see our presentation here). To say Mortar users have loved that combination would be an understatement.
However, only Mortar users could use Pig and real Python together…until now.
As of this week, our work with Pig and CPython has now been committed into Apache Pig trunk. We’ve always beendeeply dedicated to open source and have contributed as much as possible back to the community, so this is just one more example of that commitment.
Why is CPython support so exciting? To fully understand, you need to know a little bit about the previous options for writing Python code with Hadoop.
One common option is for people to use a Python-specific Hadoop framework like mrjob, Pydoop, or Dumbo. While these frameworks make it easy to write Python, you’re stuck writing low-level MapReduce jobs and thus miss out on most ofPig’s benefits as compared to MapReduce.
So what about Python in Pig? Before CPython support, you had two options: Jython User Defined Functions (UDFs) or Pig streaming.
Jython UDFs are really easy to write in Pig and work well for a lot of common use cases. Unfortunately, they also have a couple of limitations. For serious data science work in Python, you often want to turn to libraries like NLTK, NumPy, and SciPy. However, using Jython means that all of these libraries that rely on C implementations are out of reach and unusable. Jython also lags behind CPython in support for new Python features, so porting any of your existing Python code to Jython isn’t always a pleasant experience.
Streaming is a powerful and flexible tool that is Pig’s way of working with any external process. Unfortunately, streaming is difficult to use for all but the most trivial of Python scripts. It’s up to the user to write Python code to manage reading from the input stream and writing to the output stream, and that user needs to understand Pig’s serialization formats and write their own deserialization/serialization code in Python. Moreover, the serialization code lacks support for many common cases like data containing newline characters, parentheses, commas, etc. Errors in the Python code are hard to capture and send back to Pig, and even harder to diagnose and debug. It’s not a process for the faint of heart.
Looking at these alternatives, people who want to use Python and CPython libraries in Pig are stuck. But with CPython UDFs, users can leverage Pig to get the power and flexibility of streaming directly to a CPython process without the headaches associated with Pig streaming.
Here’s a quick example: Let’s say you want to use NLTK to find the 5 most common bigrams by place name in some Twitter data. Here’s how you can do that (using data from the Twitter gardenhose we provide as a public convenience):
Pig
https://gist.github.com/e67d6f280b68202471e4
Python
https://gist.github.com/085359fc93f337acc370
And that’s it.You get to focus just on the logic you need, and streaming Python takes care of all the plumbing.mTo run this yourself, you’ll need a Pig 0.12 build and a Hadoop cluster with Python and NLTK installed on it. If that’s too much hassle, you can run it locally with the Mortar framework or at scale on the Mortar platform for free.
扫码或添加微信号:坛友素质互助


「经管之家」APP:经管人学习、答疑、交友,就上经管之家!
免流量费下载资料----在经管之家app可以下载论坛上的所有资源,并且不额外收取下载高峰期的论坛币。
涵盖所有经管领域的优秀内容----覆盖经济、管理、金融投资、计量统计、数据分析、国贸、财会等专业的学习宝库,各类资料应有尽有。
来自五湖四海的经管达人----已经有上千万的经管人来到这里,你可以找到任何学科方向、有共同话题的朋友。
经管之家(原人大经济论坛),跨越高校的围墙,带你走进经管知识的新世界。
扫描下方二维码下载并注册APP
本文关键词:

本文论坛网址:https://bbs.pinggu.org/thread-3126308-1-1.html

人气文章

1.凡人大经济论坛-经管之家转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
经管之家 人大经济论坛 大学 专业 手机版