eXtremeSQL Window Functions

A “Window function” is a variation on an aggregation function. Where an aggregation function, like SUM() and AVG(), takes n inputs and returns a single value, a window function returns n values.

For example, consider the simple table defined as:

 
    CREATE TABLE test(i int);
    INSERT INTO test VALUES(1),(2),(3),(4),(5);
     

The SUM() and AVG()aggregate functions return the following values:

 
    SELECT SUM(i), AVG(i) FROM test;
    #1      #2
    ------------------------------------------------------------------------------
    15      3
    Selected records: 1
     

In an SQL SELECT statement, the OVER keyword indicates that the preceding function is to be applied to a window representing a group of rows relative to the current row in the result set of the query. For example, the window functions SUM() and AVG() return the following values:

 
    SELECT i, SUM(i) OVER(), AVG(i) OVER() FROM test;
    i       #2      #3
    ------------------------------------------------------------------------------
    1       1       1
    2       3       1.5
    3       6       2
    4       10      2.5
    5       15      3
    Selected records: 5
     

The output of a window function depends on a number of input values (function arguments, computed values from other rows in the frame, etc.), so they don’t include functions that work element-wise, like + or ROUND(). Window functions include variations on aggregate functions, like SUM(), AVG(), MIN() and MAX(), functions for ranking and ordering, like RANK(), and functions for taking offsets, like LEAD() and LAG().

Note that these SQL window functions are not to be confused with the “Window_Agg” functions in the Vector-based Analytics Library. The “Window_Agg” functions operate only on sequences (i.e. fields of type sequence) where the “window” is specified as a fixed number of elements (i.e. interval) to perform computations on.

Window function syntax

A window function call always contains an OVER clause directly following the window function's name and argument(s). This is what syntactically distinguishes it from a regular function or aggregate function. The OVER clause determines exactly how the rows of the query are to be split up for processing by the window function. A PARTITION BY list within the OVER clause can be used for dividing the rows into groups, or partitions that share the same values specified in the PARTITION BY expression(s). For each row, the window function is computed across the rows that fall into the same partition as the current row.

For a given window we can specify:

The following window functions are supported. Note that those below “for ordered rows” require an ORDER BY clause in the OVER specification:

  • count – the cumulative row count
  • row_number – the row number
  • min – the minimum value
  • max – the maximum value
  • sum – the cumulative sum of this and preceding rows
  • avg – the average of this row and all previous rows
  • median - the median of this row and all previous rows
  • first_value – the first value
  • last_value – the last value
  • nth_value – the specified nth value
  • lead – access a row at a specified offset after the current row
  • lag – access a row at a specified offset before the current row
  • rank – the rank of the value in this row (there can be gaps in the sequence of ranks)
  • dense_rank – the rank of the value in this row (with no gaps in the sequence of ranks)
  • percent_rank – used for computing the percentile of this row
  • ntile - distributes the rows into a specified number of groups

Specifying window frames

The window frame for each row of the result set is determined by specifying BETWEEN bounds. These BETWEEN bounds can be specified by either ROWS or RANGE options. The ROWS option defines a fixed number of rows preceding and following the current row; and the rows that will comprise the frame will be determined by the ORDER BY clause. If no bounds are explicitly specified, the default frame is from the first row (UNBOUNDED PRECEDING) to the current row.

Note that the BETWEEN bounds must be specified with explicit constant values, not with parameters. For example, the following statement is not valid:

     
    SELECT stamp, SUM(price*quantity) OVER w AS pq1, SUM(quantity) OVER w AS vol, pq1/vol AS VWAP
    FROM Quotes
    WINDOW w AS (PARTITION BY pkid ORDER BY stamp RANGE BETWEEN ? PRECEDING AND 0 FOLLOWING)
     

Here the parameter substitution "BETWEEN ? PRECEDING" is not allowed.

Instead the following statement using constant value 5 is valid:

 
    SELECT stamp, SUM(price*quantity) OVER w AS pq1, SUM(quantity) OVER w AS vol, pq1/vol AS VWAP
    FROM Quotes
    WINDOW w AS (PARTITION BY pkid ORDER BY stamp RANGE BETWEEN 5 PRECEDING AND 0 FOLLOWING)
     

If the RANGE option is used, then all rows within the frame that have the same ORDER BY value as the current row will be included in that row’s function computation. To illustrate the difference between physical ROWS BETWEEN frames and logical RANGE BETWEEN frames, we will add a column to the example table test and insert values as follows:

 
    ALTER TABLE test ADD val int;
    UPDATE test SET val=100 WHERE i=1;
    UPDATE test SET val=200 WHERE i=2;
    UPDATE test SET val=200 WHERE i=3;
    UPDATE test SET val=200 WHERE i=4;
    UPDATE test SET val=300 WHERE i=5;
    SELECT * FROM test;
    i       val
    ------------------------------------------------------------------------------
    1       100
    2       200
    3       200
    4       200
    5       300
    Selected records: 5
     

Now to compute the sum over the physical frame of ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (the default frame) we could perform a query like the following:

 
    SELECT i, val, SUM(val) OVER(ORDER BY val ROWS BETWEEN UNBOUNDED PRECEDING AND
    CURRENT ROW) AS Row_Sum FROM test;
    i       val     Row_Sum
    ------------------------------------------------------------------------------
    1       100     100
    2       200     300
    3       200     500
    4       200     700
    5       300     1000
    Selected records: 5
     

Note that SUM(val) is calculated as the sum of the value val for the current row plus its value for all preceding rows. To compute the sum over the logical frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW we could perform a query like the following:

 
    SELECT i, val, SUM(val) OVER(ORDER BY val RANGE BETWEEN UNBOUNDED PRECEDING AND
    CURRENT ROW) AS Range_Sum FROM test;
    i       val     Range_Sum
    ------------------------------------------------------------------------------
    1       100     100
    2       200     700
    3       200     700
    4       200     700
    5       300     1000
    Selected records: 5
     

Note that the ORDER BY value of column val is the same for rows 2, 3 and 4; so SUM(val) here is calculated for these rows as 100 + 200 + 200 + 200 = 700.

 

Examples

The following examples use data from Washington DC’s Capital Bikeshare Program, which publishes detailed trip-level historical data on their website. Each row represents one ride. The following import script was used to create the table bikes, the index bike_num_idx to speed up lookups by bike_num, and then insert the sample data:

 
    CREATE TABLE bikes(duration int, start_date string, end_date string, start_st_num int,
    start_st string, end_st_num int, end_st string, bike_num string,
    member_type string);
    INSERT INTO bikes SELECT * from foreign table(path='2016-Q1-Trips-History-Data.csv', delimiter=',', skip=1) AS bikes;
    CREATE INDEX bike_num_idx ON bikes(bike_num);
     

A simple practical example of a Window function is calculating a running total. For example the following query creates an aggregation (running_total) as the cumulative sum of field duration (note that this is without using a GROUP BY clause):

 
    SELECT duration, start_date, SUM(duration) OVER (ORDER BY start_date) AS running_total
    FROM bikes WHERE bike_num = 'W22196' AND start_date >= '3/31/2016';
     

The first part of the above aggregation, SUM(duration), looks similar to any other aggregation. Adding the keyword OVER designates it as a window function. So the above aggregation could be read as “take the sum of duration over the entire result set, in order of start_date”. Here, for brevity of output, we have specified a WHERE clause for a bike_num (W22196) and a single day (3/31/2016) with a limited number of trips. The result set shows how the running_total displays the cumulative sum:

 
    duration        start_date        running_total
    ------------------------------------------------------------------------------
    456465  	 3/31/2016 0:05    456465
    300889  	 3/31/2016 14:25   757354
    1001144 	 3/31/2016 23:51   1758498
    577035  	 3/31/2016 8:20    2335533
    Selected records: 4
     

The following query demonstrates the use of a PARTITION BY clause to calculate the cumulative sum for blocks of result set rows grouped by the start terminal number start_st_num and ordered by start_date:

 
    SELECT start_st_num, duration, start_date, SUM(duration)
    OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_total
    FROM bikes WHERE bike_num = 'W22196' AND start_st_num BETWEEN 31610 and 31625;
    start_st_num    duration   start_date        running_total
    ------------------------------------------------------------------------------
    31624   	 1375776    1/12/2016 18:28   1375776
    31624   	 396755     1/21/2016 8:47    1772531
    31624   	 988750     1/7/2016 19:38    2761281
    31624   	 203844     2/29/2016 10:26   2965125
    31624   	 987070     3/11/2016 16:30   3952195
    31624   	 372239     3/11/2016 18:09   4324434
    31623   	 602551     1/14/2016 7:02    602551
    31623   	 691224     1/16/2016 21:03   1293775
    31623   	 372176     1/18/2016 12:40   1665951
    31623   	 278196     1/20/2016 13:48   1944147
    31623   	 389189     1/7/2016 7:42     2333336
    31623  		 387578     3/29/2016 18:44   2720914
    31622   	 382357     1/18/2016 8:24    382357
    31622   	 797607     1/19/2016 8:36    1179964
    31622   	 508194     3/30/2016 9:09    1688158
    31621   	 837364     1/13/2016 17:37   837364
    31621   	 143947     3/1/2016 8:52     981311
    31620   	 188661     3/27/2016 12:22   188661
    31620   	 300889     3/31/2016 14:25   489550
    31618 		 3336939    3/30/2016 14:09   3336939
    31615   	 216014     1/6/2016 16:36    216014
    31615   	 577035     3/31/2016 8:20    793049
    31613   	 1493051    3/30/2016 10:18   1493051
    Selected records: 23
     

This query groups the results by start_st_num. Note that within each value of start_st_num, the running total sums duration across the current row and all previous rows of that window. This is the effect of the PARTITION BY clause. Also note that the optional ORDER BY clause simply orders by the designated column(s) the same way that an ORDER BY clause would for the SELECT statement, except that it treats every partition separately.

Note that you cannot use window functions and standard aggregations in the same query. More specifically, you can’t include window functions in a GROUP BY clause.

As the SUM() function calculates the cumulative sum, similarly the COUNT()and AVG() functions calculate the cumulative number of rows in the partition and their average. For example:

 
    SELECT start_st_num, duration, start_date,
    SUM(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_total,
    COUNT(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_count,
    AVG(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_avg
    FROM bikes WHERE bike_num = 'W22196' and start_st_num BETWEEN 31610 and 31625;
    start_st_num    duration start_date  running_total running_count  running_avg
    ------------------------------------------------------------------------------
    31624   1375776 1/12/2016 18:28 	1375776 	1       	1375776
    31624   396755  1/21/2016 8:47  	1772531 	2       	886265.5
    31624   988750  1/7/2016 19:38  	2761281 	3       	920427
    31624   203844  2/29/2016 10:26 	2965125 	4       	741281.25
    31624   987070  3/11/2016 16:30 	3952195 	5       	790439
    31624   372239  3/11/2016 18:09 	4324434 	6       	720739
    31623   602551  1/14/2016 7:02  	602551  	1       	602551
    31623   691224  1/16/2016 21:03 	1293775 	2       	646887.5
    31623   372176  1/18/2016 12:40 	1665951 	3       	555317
    31623   278196  1/20/2016 13:48 	1944147 	4       	486036.75
    31623   389189  1/7/2016 7:42   	2333336 	5       	466667.2
    31623   387578  3/29/2016 18:44 	2720914 	6       	453485.666666667
    31622   382357  1/18/2016 8:24  	382357  	1       	382357
    31622   797607  1/19/2016 8:36  	1179964 	2       	589982
    31622   508194  3/30/2016 9:09  	1688158 	3       	562719.333333333
    31621   837364  1/13/2016 17:37 	837364  	1       	837364
    31621   143947  3/1/2016 8:52   	981311  	2       	490655.5
    31620   188661  3/27/2016 12:22 	188661  	1       	188661
    31620   300889  3/31/2016 14:25 	489550  	2       	244775
    31618   3336939 3/30/2016 14:09 	3336939 	1       	3336939
    31615   216014  1/6/2016 16:36  	216014  	1       	216014
    31615   577035  3/31/2016 8:20  	793049  	2       	396524.5
    31613   1493051 3/30/2016 10:18 	1493051 	1       	1493051
    Selected records: 23
     

To demonstrate the next set of functions we will use a simple table and dataset created by the following script:

 
    CREATE TABLE test(i int);
    INSERT INTO test VALUES(1),(2),(3),(4),(5);
     

The LAG() and LEAD()functions return the value for the specified row following or preceding the current row, respectively:

 
    SELECT i,
    LAG(i,1) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
    LEAD(i,1) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM test;
    i       #2      #3
    ------------------------------------------------------------------------------
    1       null    2
    2       1       3
    3       2       4
    4       3       5
    5       4       null
    Selected records: 5
     

The MIN() and MAX()functions return the minimal and the maximum values respectively for the specified frames. For example, when using the default bounds (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), the MIN() and MAX()are determined by the ever increasing frame. For example:

 
    SELECT i, MIN(i) OVER (ORDER BY i), MAX(i) OVER (ORDER BY i) FROM test;
    i       #2      #3
    ------------------------------------------------------------------------------
    1       1       1
    2       1       2
    3       1       3
    4       1       4
    5       1       5
    Selected records: 5
     

But when an unbounded frame is specified (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) the minimum and maximum values are for the entire window:

 
    SELECT i,
    MIN(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
    MAX(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    FROM test;
    i       #2      #3
    ------------------------------------------------------------------------------
    1       1       5
    2       1       5
    3       1       5
    4       1       5
    5       1       5
    Selected records: 5
     

The functions LAST_VALUE(), FIRST_VALUE(), and NTH_VALUE() with no boundaries specified and ordering by the field i will produce 5 frames and the value of LAST_VALUE() will be the value of i for the current row. This is because the default frame for window functions is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. So by not specifying a frame explicitly, FIRST_VALUE() returns the value of the first row in the table and LAST_VALUE()returns the value in the current row. The NTH_VALUE() function returns the value of the nth row (zero-based) of the current frame or NULL if there is no nth row in the current frame. For example:

 
    SELECT i, FIRST_VALUE(i) OVER (ORDER BY i), LAST_VALUE(i) OVER (ORDER BY i),
    NTH_VALUE(i, 1) OVER (ORDER BY i) FROM test;
    i       #2      #3      #4
    ------------------------------------------------------------------------------
    1       1       1       null
    2       1       2       2
    3       1       3       2
    4       1       4       2
    5       1       5       2
    Selected records: 5
     

If we specify an unbounded window the value of LAST_VALUE() will be the last value in the entire window:

 
    SELECT i,
    FIRST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
    LAST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
    NTH_VALUE(i, 1) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM test;
    i       #2      #3      #4
    ------------------------------------------------------------------------------
    1       1       5       2
    2       1       5       2
    3       1       5       2
    4       1       5       2
    5       1       5       2
    Selected records: 5
     

And if we specify a bounded window the value of LAST_VALUE() will be the last value in the specified frame (range of rows):

 
    SELECT i, FIRST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING),
    LAST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING),
    NTH_VALUE(i, 2) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM test;
    i       #2      #3      #4
    ------------------------------------------------------------------------------
    1       1       2       null
    2       1       3       3
    3       2       4       4
    4       3       5       5
    5       4       5       null
    Selected records: 5
     

To demonstrate ranking window functions we will use another simple table and dataset created by the following script:

 
    CREATE TABLE t(v string);
    INSERT INTO t(v) VALUES ('a'),('a'),('a'),('b'),('c'),('c'),('d'),('e');
     

The ROW_NUMBER() function assigns unique numbers to each row within the partition. For example:

 
    SELECT v, ROW_NUMBER() OVER(ORDER BY v) FROM t;
    v       #2
    ------------------------------------------------------------------------------
    a       1
    a       2
    a       3
    b       4
    c       5
    c       6
    d       7
    e       8
    Selected records: 8
     

The RANK() function behaves like ROW_NUMBER() except that “equal” rows are ranked the same. For example:

 
    SELECT v, RANK() OVER(ORDER BY v) FROM t;
    v       #2
    ------------------------------------------------------------------------------
    a       1
    a       1
    a       1
    b       4
    c       5
    c       5
    d       7
    e       8
    Selected records: 8
     

The DENSE_RANK() function calculates a rank with no gaps, i.e. “dense”. For example:

 
    SELECT v, DENSE_RANK() OVER(ORDER BY v) FROM t;
    v       #2
    ------------------------------------------------------------------------------
    a       1
    a       1
    a       1
    b       2
    c       3
    c       3
    d       4
    e       5
    Selected records: 8
     

The PERCENT_RANK() function calculates the relative rank of the current row to others in the partition. This function is used in statistical analysis to determine the percentile for a group of values. The actual formula used to calculate the percent_rank is:

 
    PERCENT_RANK() = ( RANK() – 1 ) / ( TotalRows - 1 )
     

To demonstrate this we create yet another table with the following script:

 
    CREATE TABLE salaries(DepartmentID int, Salary int);
    INSERT INTO salaries VALUES (1, 15000), (1, 18000), (1, 23000), (1, 23000), (1, 25000);
     

The RANK() and PERCENT_RANK() functions are applied to the Salary column in the following query:

 
    SELECT DepartmentID, Salary,
    ROW_NUMBER() OVER (PARTITION BY DepartmentID ORDER BY Salary ROWS BETWEEN UNBOUNDED
    PRECEDING AND UNBOUNDED FOLLOWING) AS RowNumber,
    RANK() OVER (PARTITION BY DepartmentID ORDER BY Salary ROWS BETWEEN UNBOUNDED
    PRECEDING AND UNBOUNDED FOLLOWING) AS RowRank,
    PERCENT_RANK() OVER (PARTITION BY DepartmentID ORDER BY Salary ROWS BETWEEN UNBOUNDED
    PRECEDING AND UNBOUNDED FOLLOWING) AS PercentRank
    FROM salaries;
    DepartmentID    Salary  RowNumber RowRank PercentRank
    ------------------------------------------------------------------------------
    1       	 15000   1		1       0
    1       	 18000   2		2       0.25
    1       	 23000   3		3       0.5
    1       	 23000   4		3       0.5
    1       	 25000   5		5       1
    Selected records: 5
     

Here we have 4 different Salary values in 5 rows. The values returned by PERCENT_RANK() are computed as follows:

 
    1.	RANK = 1 so PERCENT_RANK = ( 1 – 1 ) / 4 = 0
    2.	RANK = 2 so PERCENT_RANK = ( 2 – 1 ) / 4 = 0.25
    3.	RANK = 3 so PERCENT_RANK = ( 3 – 1 ) / 4 = 0.5
    4.	RANK = 3 so PERCENT_RANK = ( 3 – 1 ) / 4 = 0.5
    5.	RANK = 5 so PERCENT_RANK = ( 5 – 1 ) / 4 = 1
     

The NTILE()function distributes result set rows into a specified number of groups. To demonstrate we will create yet another table with the following script:

 
    CREATE TABLE students(StudentID char(2), Marks integer);
    INSERT INTO students VALUES('S1', 75);
    INSERT INTO students VALUES('S2', 83);
    INSERT INTO students VALUES('S3', 91);
    INSERT INTO students VALUES('S4', 83);
    INSERT INTO students VALUES('S5', 93);
     

We can apply the NTILE()function to this table with a script like the following:

 
    SELECT StudentID, Marks, NTILE(2) OVER(ORDER BY Marks ROWS BETWEEN UNBOUNDED PRECEDING
    AND UNBOUNDED FOLLOWING) FROM students;
    StudentID      Marks   #3
    ------------------------------------------------------------------------------
    S1      	75      1
    S2      	83      1
    S4      	83      1
    S3      	91      2
    S5      	93      2
    Selected records: 5
     

Note that the group number 1 has three rows while group 2 has two simply because there is an uneven number of result rows.