If you’ve been following recent news in the Big Data world, you’ve probably heard about Apache Flink. This platform for batch and stream processing, which is built on a few significant technical innovations, can become a real game changer and it is starting to compete with existing products like Apache Spark.
In this post, I would like to show how to implement a simple batch processing algorithm using Apache Flink. We will work with a dataset of movie ratings and will produce a distribution of user ratings. In the process, I’ll show few tricks that you can use to improve the performance of your Flink applications.
Creating an Apache Flink project is pretty straightforward. The Apache Flink developers created a project template for us, so all we need to do is to use the Maven archetype:generate command:
It will generate a pom.xml file and several example Flink applications. To write our own, we need to create a new Java class with a main method. It will work in both development mode and on a Flink cluster.