In September 2016, RStudio announced sparklyr, a new R interface to Apache Spark. sparklyr’s interface to Spark follows the popular dplyr syntax. At Databricks, we provide the best place to run Apache Spark and all applications and packages powered by it, from all the languages that Spark supports. sparklyr’s addition to the Spark ecosystem not only complements SparkR but also extends Spark’s reach to new users and communities.
Today, we are happy to announce that sparklyr can be seamlessly used in Databricks clusters running Apache Spark 2.2 or higher. In this blog post, we show how you can install and configure sparklyr in Databricks. We also introduce some of the latest improvements in Databricks R Notebooks.
sparklyr is under active development and new versions are regularly released with new API and bug fixes. We do not pre-install sparklyr, allowing our users can install and enjoy the latest version of the package. You can install the latest development version from GitHub:
devtools::install_github("rstudio/sparklyr")
After sparklyr 0.6 is released to CRAN, the installation process can be done much simpler.
After setting up the sparklyr connection, you can use all sparklyr APIs. You can import and combine sparklyr with dplyr or MLlib. You can also use sparklyr extensions. Note that if the extension packages include third-party JARs, you may need to install those JARs as libraries in your workspace.
We find SparkR and sparklyr complementary. You can use the packages next to each other in a single notebook or job. To do so you can import SparkR along with sparklyr in Databricks notebooks. The SparkR connection is pre-configured in the notebook, and after importing the package, you can start using SparkR API. Also, remember that some of the functions in SparkR mask a number of functions in dplyr.
library(SparkR)
The following objects are masked from ‘package:dplyr’: