Hands-on examples of scalar and predicate type of subqueries
Davies LiuHerman van Hövell by Davies Liu and Herman van Hövell Posted in ENGINEERING BLOG June 17, 2016
Try this notebook in Databricks
In the upcoming Apache Spark 2.0 release, we have substantially expanded the SQL standard capabilities. In this brief blog post, we will introduce subqueries in Apache Spark 2.0, including their limitations, potential pitfalls and future expansions, and through a notebook, we will explore both the scalar and predicate type of subqueries, with short examples that you can try yourself.
A subquery is a query that is nested inside of another query. A subquery as a source (inside a SQL FROM clause) is technically also a subquery, but it is beyond the scope of this post. There are basically two kinds of subqueries: scalar and predicate subqueries. And within scalar and predicate queries, there are uncorrelated scalar and correlated scalar queries and nested predicate queries respectively.
For brevity, we will let you jump and explore the notebook, which is more an interactive experience rather than an exposition here in the blog. Click on this diagram below to view and explore the subquery notebook with Apache Spark 2.0 preview on Databricks.
An uncorrelated subquery returns the same single value for all records in a query. Uncorrelated subqueries are executed by the Spark enging before the main query is executed. The SQL below shows an example of an uncorrelated scalar subquery, here we add the maximum age in table employee to the select.
Subqueries can be correlated, this means that the subquery contains references to the outer query. These outer references are typically used in filter clauses (SQL WHERE clause). Spark 2.0 currently only supports this case. The SQL below shows an example of a correlated scalar subquery, here we add the maximum age in an employee’s department to the select list using A.dep_id = B.dep_id as the correlated condition.
%sql
SELECT A.dep_id,
A.employee_id,
A.age,
(SELECT MAX(age) FROM employee B WHERE A.dep_id = B.dep_id) max_age