Every MySQL database programmer should learn and apply the newly added MariaDB and MySQL Window Functions and Common Table Expressions(CTEs) in their daily work. Both CTEs and window functions enable easy solutions to many query challenges that in prior releases have been difficult and sometimes impossible to surmount. Mastering these features opens the door to query solutions that are more robust, execute faster, and are easier to maintain over time than prior solutions using older techniques.
In our last blog we compared User Roles in MySQL 8 vs. MariaDB. Today, we will here compare Window Functions and Common Table Expressions in both databases.
Window Functions
While all database administrators are familiar with aggregate functions like COUNT(), SUM(), and AVG(), far less people make use of window functions in their queries. Unlike aggregate functions, which operate on an entire table, window functions operate on a set of rows and return a single aggregated value for each row.
The main advantage of using window functions over regular aggregate functions is that window functions do not cause rows to become grouped into a single output row. Instead, the rows retain their separate identities and an aggregated value is added to each row.
Window Functions in MariaDB
Windowing functions were added to the ANSI/ISO Standard SQL:2003 and then extended in ANSI/ISO Standard SQL:2008. DB2, Oracle, Sybase, PostgreSQL and other products have had full implementations for years. Other vendors added support for window functions later on. Case in point, when Microsoft introduced Window Functions in SQL Server 2005, it only included a handful of functions, namely ROW_NUMBER, RANK, NTILE and DENSE_RANK. It was not until SQL Server 2012 that they implemented a full range of Window Functions.
After numerous wishes and feature requests for window functions over the years, they were finally introduced in MariaDB 10.2.0 to great fanfare! Now MariaDB includes window functions such as ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, NTILE, COUNT, SUM, AVG, BIT_OR, BIT_AND and BIT_XOR.
The Syntax
Window function queries are characterised by the OVER keyword, following which the set of rows used for the calculation is specified. By default, the set of rows used for the calculation (the “window”) is the entire dataset, which can be ordered with the ORDER BY clause. The PARTITION BY clause is then employed to reduce the window to a particular group within the dataset.
Here’s an example:
Given the following table of student test scores:
+------------+---------+--------+ | name | test | score | +------------+---------+--------+ | Steve | SQL | 75 | +------------+---------+--------+ | Robert | SQL | 43 | +------------+---------+--------+ | Tracy | SQL | 56 | +------------+---------+--------+ | Tatiana | SQL | 87 | +------------+---------+--------+ | Steve | Tuning | 73 | +------------+---------+--------+ | Robert | Tuning | 31 | +------------+---------+--------+ | Tracy | Tuning | 88 | +------------+---------+--------+ | Tatiana | Tuning | 83 | +------------+---------+--------+
The following two queries return the average test scores partitioned by test and by name respectively – in other words, aggregated by test and by name:
SELECT name, test, score, AVG(score) OVER (PARTITION BY test) AS average_by_test FROM test_scores; +----------+--------+-------+-----------------+ | name | test | score | average_by_test | +----------+--------+-------+-----------------+ | Steve | SQL | 75 | 65.2500 | | Steve | Tuning | 73 | 68.7500 | | Robert | SQL | 43 | 65.2500 | | Robert | Tuning | 31 | 68.7500 | | Tracy | SQL | 56 | 65.2500 | | Tracy | Tuning | 88 | 68.7500 | | Tatiana | SQL | 87 | 65.2500 | | Tatiana | Tuning | 83 | 68.7500 | +----------+--------+-------+-----------------+
SELECT name, test, score, AVG(score) OVER (PARTITION BY name) AS average_by_name FROM student; +---------+--------+-------+-----------------+ | name | test | score | average_by_name | +---------+--------+-------+-----------------+ | Steve | SQL | 75 | 74.0000 | | Steve | Tuning | 73 | 74.0000 | | Robert | SQL | 43 | 37.0000 | | Robert | Tuning | 31 | 37.0000 | | Tracy | SQL | 56 | 72.0000 | | Tracy | Tuning | 88 | 72.0000 | | Tatiana | SQL | 87 | 85.0000 | | Tatiana | Tuning | 83 | 85.0000 | +---------+--------+-------+-----------------+
In both cases, note that the original scores are still available to each row.
Window Functions in MySQL 8
MySQL has been even later in adopting the Window Functions standard, with it being part of version 8.0 that is expected to be released later this year.
MySQL employs the same ANSI/ISO Standard as other DBMSes whereby Window function queries are characterised by the OVER keyword and the PARTITION BY clause is employed to reduce the window to a specific group within the result set.
The currently supported functions include:
Name | Description |
CUME_DIST() | Cumulative distribution value |
DENSE_RANK() | Rank of current row within its partition, without gaps |
FIRST_VALUE() | Value of argument from the first row of window frame |
LAG() | Value of argument from row lagging current row within partition |
LAST_VALUE() | Value of argument from the last row of window frame |
LEAD() | Value of argument from row leading current row within partition |
NTH_VALUE() | Value of argument from N-th row of window frame |
NTILE() | Bucket number of the current row within its partition. |
PERCENT_RANK() | Percentage rank value |
RANK() | Rank of current row within its partition, with gaps |
ROW_NUMBER() | Number of current row within its partition |
As an example, we’ll explore the CUME_DIST() function.
It returns the cumulative distribution of a value within a group of values; that is, the percentage of partition values less than or equal to the value in the current row. This represents the number of rows preceding or peer with the current row in the window ordering of the window partition divided by the total number of rows in the window partition. Return values range from 0 to 1.
This function is usually used with ORDER BY to sort partition rows into the desired order. Without ORDER BY, all rows are peers having a value of 1.
The following query shows, for the set of values in the val column, the CUME_DIST() value for each row, as well as the percentage rank value returned by the similar PERCENT_RANK() function. For reference, the query also displays row numbers using ROW_NUMBER():
SELECT val, ROW_NUMBER() OVER w AS 'row_number', CUME_DIST() OVER w AS 'cume_dist', PERCENT_RANK() OVER w AS 'percent_rank' FROM numbers WINDOW w AS (ORDER BY val); +------+------------+--------------------+--------------+ | val | row_number | cume_dist | percent_rank | +------+------------+--------------------+--------------+ | 1 | 1 | 0.2222222222222222 | 0 | | 1 | 2 | 0.2222222222222222 | 0 | | 2 | 3 | 0.3333333333333333 | 0.25 | | 3 | 4 | 0.6666666666666666 | 0.375 | | 3 | 5 | 0.6666666666666666 | 0.375 | | 3 | 6 | 0.6666666666666666 | 0.375 | | 4 | 7 | 0.8888888888888888 | 0.75 | | 4 | 8 | 0.8888888888888888 | 0.75 | | 5 | 9 | 1 | 1 | +------+------------+--------------------+--------------+
The OVER clause is permitted for many aggregate functions, including:
- AVG()
- COUNT()
- MAX()
- MIN()
- STDDEV_POP(), STDDEV(), STD()
- STDDEV_SAMP()
- SUM()
- VAR_POP(), VARIANCE()
- VAR_SAMP()
These can be used as window or non-window functions, depending on whether the OVER clause is present or absent:
MySQL also supports non-aggregate functions that are used only as window functions. For these, the OVER clause is mandatory:
- CUME_DIST()
- DENSE_RANK()
- FIRST_VALUE()
- LAG()
- LAST_VALUE()
- LEAD()
- NTH_VALUE()
- NTILE()
- PERCENT_RANK()
- RANK()
- ROW_NUMBER()
As an example of a non-aggregate window functions, this query uses ROW_NUMBER(), which produces the row number of each row within its partition. In this case, rows are numbered per country. By default, partition rows are unordered and row numbering is indeterminate. To sort partition rows, include an ORDER BY clause within the window definition. The query uses unordered and ordered partitions (the row_num1 and row_num2 columns) to illustrate the difference that omitting and including ORDER BY makes:
SELECT year, country, product, profit, ROW_NUMBER() OVER(PARTITION BY country) AS row_num1, ROW_NUMBER() OVER(PARTITION BY country ORDER BY year, product) AS row_num2 FROM sales; +------+---------+------------+--------+----------+----------+ | year | country | product | profit | row_num1 | row_num2 | +------+---------+------------+--------+----------+----------+ | 2000 | Finland | Computer | 1500 | 2 | 1 | | 2000 | Finland | Phone | 100 | 1 | 2 | | 2001 | Finland | Phone | 10 | 3 | 3 | | 2000 | India | Calculator | 75 | 2 | 1 | | 2000 | India | Calculator | 75 | 3 | 2 | | 2000 | India | Computer | 1200 | 1 | 3 | | 2000 | USA | Calculator | 75 | 5 | 1 | | 2000 | USA | Computer | 1500 | 4 | 2 | | 2001 | USA | Calculator | 50 | 2 | 3 | | 2001 | USA | Computer | 1500 | 3 | 4 | | 2001 | USA | Computer | 1200 | 7 | 5 | | 2001 | USA | TV | 150 | 1 | 6 | | 2001 | USA | TV | 100 | 6 | 7 | +------+---------+------------+--------+----------+----------+
Common Table Expressions (CTEs)
A common table expression (CTE) can be thought of as a temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. A CTE is similar to a derived table in that it is not stored as an object and lasts only for the duration of the query. However, unlike a derived table, a CTE can be self-referencing and can be referenced multiple times in the same query.
A CTE can be used to:
- Create a recursive query.
- Substitute for a view when the general use of a view is not required; that is, you do not have to store the definition in metadata.
- Enable grouping by a column that is derived from a scalar subselect, or a function that is either not deterministic or has external access.
- Reference the resulting table multiple times in the same statement.
Using a CTE offers the advantages of improved readability and ease in maintenance of complex queries. The query can be divided into separate, simple, logical building blocks. These simple blocks can then be used to build more complex, interim CTEs until the final result set is generated.
Recursive CTEs
Recursive common table expressions (CTEs) were an implementation of standard SQL:1999 for hierarchical queries. The first implementations of Recursive CTEs began appearing in 2007. The recursive CTEs from the standard were relatively close to the existing implementation in IBM DB2 version 2. Recursive CTEs were eventually supported by Microsoft SQL Server (since SQL Server 2008 R2), Firebird 2.1, PostgreSQL 8.4+, SQLite 3.8.3+, Oracle 11g Release 2, and IBM Informix version 11.50+.
Without Common-table-expressions or a connected-by clause it is still possible to achieve hierarchical queries with user-defined recursive functions, but these tend to result in very complex SQL.
CTEs in MariaDB
In MariaDB, a non-recursive CTE is basically considered to be a query-local VIEW whose syntax is more readable than nested FROM (SELECT …). A CTE can refer to another and it can be referenced from multiple places.
Thus, CTEs are similar to derived tables. For example,
SQL with derived table:
SELECT * FROM ( SELECT * FROM employees WHERE dept = 'Engineering' ) AS engineers WHERE ...
SQL with CTE:
WITH engineers AS ( SELECT * FROM employees WHERE dept = 'Engineering' ) SELECT * FROM engineers WHERE ...
SQL is generally poor at recursion. One of the advantages of CTEs is that they permit a query to reference itself, hence recursive SQL. A recursive CTE will repeatedly execute subsets of the data until it obtains the complete result set. This makes it particularly useful for handling hierarchical or tree-structured data.
With recursive CTEs you can achieve things that would be very difficult to do with standard SQL and at a faster execution speed. They can help solve many types of business problems and even simplify some complex SQL/application logic down to a simple recursive call to the database.
Some example uses for recursive CTE are to find gaps in data, create organization charts and create test data.
WITH RECURSIVE signifies a recursive CTE. It is given a name, followed by a body (the main query) as follows:
Below is a recursive CTE that counts from 1 to 50. WITH cte AS (SELECT 1 AS n -- anchor member UNION ALL SELECT n + 1 -- recursive member FROM cte WHERE n < 50 -- terminator ) SELECT n FROM cte;
The above statement prints a number series from 1 to 49.
CTEs in MySQL
MySQL 8.0 adds CTEs via the standard WITH keyword, in much the same way it is implemented in competing products.
To specify common table expressions, use a WITH clause that has one or more comma-separated subclauses. Each subclause provides a subquery that produces a result set, and associates a name with the subquery. The following example defines CTEs named cte1 and cte2 in the WITH clause, and refers to them in the top-level SELECT that follows the WITH clause:
WITH cte1 AS (SELECT a, b FROM table1), cte2 AS (SELECT c, d FROM table2) SELECT b, d FROM cte1 JOIN cte2 WHERE cte1.a = cte2.c;
Immediately preceding SELECT for statements that include a SELECT statement:
- INSERT … WITH … SELECT …
- REPLACE … WITH … SELECT …
- CREATE TABLE … WITH … SELECT …
- CREATE VIEW … WITH … SELECT …
- DECLARE CURSOR … WITH … SELECT …
- EXPLAIN … WITH … SELECT …
A recursive common table expression is one having a subquery that refers to its own name. For example:
WITH RECURSIVE cte (n) AS ( SELECT 1 UNION ALL SELECT n + 1 FROM cte WHERE n < 5 ) SELECT * FROM cte; +------+ | n | +------+ | 1 | | 2 | | 3 | | 4 | | 5 | +------+
Conclusion
Window Functions and Common Table Expressions (CTEs) have been a mainstay of many popular database products for some time now. With the release of MySQL version 8 and MariaDB 10.2.0, both vendors have caught up with competing DBMSes such as SQL Server and Oracle.