This is the code repository for Apache Spark 2 for Beginners, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish. ##Instructions and Navigations All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.
Detailed installation steps (software-wise)
The steps should be listed in a way that it prepares the system environment to be able to test the codes of the book. ###1. Apache Spark: a. Download Spark version mentioned in the table
b. Build Spark from source or use the binary download and follow the detailed instructions given in the pagehttp://spark.apache.org/docs/latest/building-spark.html
c. If building Spark from source, make sure that the R profile is also built and the instructions to do that is given in the link given inthe step b.
###2. Apache Kafka a. Download Kafka version mentioned in the table
b. The “quick start” section of the Kafka documentation gives the instructions to setup Kafka.http://kafka.apache.org/documentation.html#quickstart
c. Apart from the installation instructions, the topic creation and the other Kafka setup pre-requisites have been covered in detail in the chapter of the book
The code will look like the following:
Python 3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19)[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Spark 2.0.0 or above is to be installed on at least a standalone machine to run the code samples and do further activities to learn more about the subject. For Spark Stream Processing, Kafka needs to be installed and configured as a message broker with its command line producer producing messages and the application developed using Spark as a consumer of those messages.
genreDF = spark.sql("SELECT sum(unknown) as unknown, sum(action) as action,sum(adventure) as adventure,sum(animation) as animation, sum(childrens) as childrens,sum(comedy) as comedy,sum(crime) as crime,sum(documentary) as documentary,sum(drama) as drama,sum(fantasy) as fantasy,sum(filmNoir) as filmNoir,sum(horror) as horror,sum(musical) as musical,sum(mystery) as mystery,sum(romance) as romance,sum(sciFi) as sciFi,sum(thriller) as thriller,sum(war) as war,sum(western) as western FROM movies")
yearDF = sqlContext.sql("SELECT substring(releaseDate,8,4) as releaseYear, count(*) as movieCount FROM movies GROUP BY substring(releaseDate,8,4) ORDER BY movieCount DESC LIMIT 10")
yearActionDF = spark.sql("SELECT substring(releaseDate,8,4) as actionReleaseYear, count(*) as actionMovieCount FROM movies WHERE action = 1 GROUP BY substring(releaseDate,8,4) ORDER BY actionReleaseYear DESC LIMIT 10")
yearActionDF.show()
yearActionDF.createOrReplaceTempView("action")
# In[41]:
yearDramaDF = spark.sql("SELECT substring(releaseDate,8,4) as dramaReleaseYear, count(*) as dramaMovieCount FROM movies WHERE drama = 1 GROUP BY substring(releaseDate,8,4) ORDER BY dramaReleaseYear DESC LIMIT 10")
yearDramaDF.show()
yearDramaDF.createOrReplaceTempView("drama")
# In[42]:
yearCombinedDF = spark.sql("SELECT a.actionReleaseYear as releaseYear, a.actionMovieCount, d.dramaMovieCount FROM action a, drama d WHERE a.actionReleaseYear = d.dramaReleaseYear ORDER BY a.actionReleaseYear DESC LIMIT 10")