A “Window function” is a variation on an aggregation function. Where an aggregation function, like
SUM()
andAVG()
, takesn
inputs and returns a single value, a window function returnsn
values.For example, consider the simple table defined as:
CREATE TABLE test(i int); INSERT INTO test VALUES(1),(2),(3),(4),(5);
The
SUM()
andAVG()
aggregate functions return the following values:
SELECT SUM(i), AVG(i) FROM test; #1 #2 ------------------------------------------------------------------------------ 15 3 Selected records: 1
In an SQL
SELECT
statement, theOVER
keyword indicates that the preceding function is to be applied to awindow
representing a group of rows relative to the current row in the result set of the query. For example, the window functionsSUM()
andAVG()
return the following values:
SELECT i, SUM(i) OVER(), AVG(i) OVER() FROM test; i #2 #3 ------------------------------------------------------------------------------ 1 1 1 2 3 1.5 3 6 2 4 10 2.5 5 15 3 Selected records: 5
The output of a window function depends on a number of input values (function arguments, computed values from other rows in the frame, etc.), so they don’t include functions that work element-wise, like + or
ROUND()
. Window functions include variations on aggregate functions, likeSUM()
,AVG()
,MIN()
andMAX()
, functions for ranking and ordering, likeRANK()
, and functions for taking offsets, likeLEAD()
andLAG()
.
Note that these SQL window functions are not to be confused with the “Window_Agg” functions in the Vector-based Analytics Library. The “Window_Agg” functions operate only on sequences (i.e. fields of type sequence) where the “window” is specified as a fixed number of elements (i.e. interval) to perform computations on.
Window function syntax
A window function call always contains an
OVER
clause directly following the window function's name and argument(s). This is what syntactically distinguishes it from a regular function or aggregate function. TheOVER
clause determines exactly how the rows of the query are to be split up for processing by the window function. APARTITION BY
list within theOVER
clause can be used for dividing the rows into groups, or partitions that share the same values specified in thePARTITION BY
expression(s). For each row, the window function is computed across the rows that fall into the same partition as the current row.For a given window we can specify:
- the order of the rows in the window (the
ORDER BY
clause)- a subset of the table data to operate on (the
PARTITION BY
clause)- the window defined by the previous two clauses can be further narrowed down into frames of rows by use of the
ROWS BETWEEN
clause which can be:
- bounded – eg.
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
- partially bounded – eg.
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
- unbounded – eg.
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
- “logical”
frames
can also be defined through theRANGE BETWEEN
option (see “Specifying window frames” below)The following window functions are supported. Note that those below “for ordered rows” require an
ORDER BY
clause in theOVER
specification:
- aggregate functions
count
– the cumulative row countrow_number
– the row numbermin
– the minimum valuemax
– the maximum valuesum
– the cumulative sum of this and preceding rowsavg
– the average of this row and all previous rowsmedian
- the median of this row and all previous rows
- for ordered rows
first_value
– the first valuelast_value
– the last valuenth_value
– the specified nth valuelead
– access a row at a specified offset after the current rowlag
– access a row at a specified offset before the current rowrank
– the rank of the value in this row (there can be gaps in the sequence of ranks)dense_rank
– the rank of the value in this row (with no gaps in the sequence of ranks)percent_rank
– used for computing the percentile of this rowntile
- distributes the rows into a specified number of groupsSpecifying window frames
The window frame for each row of the result set is determined by specifying
BETWEEN
bounds. TheseBETWEEN
bounds can be specified by eitherROWS
orRANGE
options. TheROWS
option defines a fixed number of rows preceding and following the current row; and the rows that will comprise the frame will be determined by theORDER BY
clause. If no bounds are explicitly specified, the default frame is from the first row (UNBOUNDED PRECEDING
) to the current row.
Note that the
BETWEEN
bounds must be specified with explicit constant values, not with parameters. For example, the following statement is not valid:SELECT stamp, SUM(price*quantity) OVER w AS pq1, SUM(quantity) OVER w AS vol, pq1/vol AS VWAP FROM Quotes WINDOW w AS (PARTITION BY pkid ORDER BY stamp RANGE BETWEEN ? PRECEDING AND 0 FOLLOWING)Here the parameter substitution "
BETWEEN ? PRECEDING
" is not allowed.Instead the following statement using constant value
5
is valid:SELECT stamp, SUM(price*quantity) OVER w AS pq1, SUM(quantity) OVER w AS vol, pq1/vol AS VWAP FROM Quotes WINDOW w AS (PARTITION BY pkid ORDER BY stamp RANGE BETWEEN 5 PRECEDING AND 0 FOLLOWING)If the
RANGE
option is used, then all rows within the frame that have the sameORDER BY
value as the current row will be included in that row’s function computation. To illustrate the difference betweenphysical
ROWS BETWEEN
frames andlogical
RANGE BETWEEN
frames, we will add a column to the example table test and insert values as follows:
ALTER TABLE test ADD val int; UPDATE test SET val=100 WHERE i=1; UPDATE test SET val=200 WHERE i=2; UPDATE test SET val=200 WHERE i=3; UPDATE test SET val=200 WHERE i=4; UPDATE test SET val=300 WHERE i=5; SELECT * FROM test; i val ------------------------------------------------------------------------------ 1 100 2 200 3 200 4 200 5 300 Selected records: 5
Now to compute the sum over the physical frame of
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
(the default frame) we could perform a query like the following:
SELECT i, val, SUM(val) OVER(ORDER BY val ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Row_Sum FROM test; i val Row_Sum ------------------------------------------------------------------------------ 1 100 100 2 200 300 3 200 500 4 200 700 5 300 1000 Selected records: 5
Note that
SUM(val)
is calculated as the sum of the value val for the current row plus its value for all preceding rows. To compute the sum over the logical frameRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
we could perform a query like the following:
SELECT i, val, SUM(val) OVER(ORDER BY val RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Range_Sum FROM test; i val Range_Sum ------------------------------------------------------------------------------ 1 100 100 2 200 700 3 200 700 4 200 700 5 300 1000 Selected records: 5
Note that the
ORDER BY
value of column val is the same for rows 2, 3 and 4; soSUM(val)
here is calculated for these rows as100 + 200 + 200 + 200 = 700
.
Examples
The following examples use data from Washington DC’s Capital Bikeshare Program, which publishes detailed trip-level historical data on their website. Each row represents one ride. The following import script was used to create the table bikes, the index
bike_num_idx
to speed up lookups bybike_num
, and then insert the sample data:
CREATE TABLE bikes(duration int, start_date string, end_date string, start_st_num int, start_st string, end_st_num int, end_st string, bike_num string, member_type string); INSERT INTO bikes SELECT * from foreign table(path='2016-Q1-Trips-History-Data.csv', delimiter=',', skip=1) AS bikes; CREATE INDEX bike_num_idx ON bikes(bike_num);
A simple practical example of a Window function is calculating a running total. For example the following query creates an aggregation (
running_total
) as the cumulative sum of field duration (note that this is without using aGROUP BY
clause):
SELECT duration, start_date, SUM(duration) OVER (ORDER BY start_date) AS running_total FROM bikes WHERE bike_num = 'W22196' AND start_date >= '3/31/2016';
The first part of the above aggregation,
SUM(duration)
, looks similar to any other aggregation. Adding the keywordOVER
designates it as a window function. So the above aggregation could be read as “take the sum of duration over the entire result set, in order ofstart_date
”. Here, for brevity of output, we have specified aWHERE
clause for abike_num (W22196)
and a single day (3/31/2016) with a limited number of trips. The result set shows how therunning_total
displays the cumulative sum:
duration start_date running_total ------------------------------------------------------------------------------ 456465 3/31/2016 0:05 456465 300889 3/31/2016 14:25 757354 1001144 3/31/2016 23:51 1758498 577035 3/31/2016 8:20 2335533 Selected records: 4
The following query demonstrates the use of a
PARTITION BY
clause to calculate the cumulative sum for blocks of result set rows grouped by the start terminal numberstart_st_num
and ordered bystart_date
:
SELECT start_st_num, duration, start_date, SUM(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_total FROM bikes WHERE bike_num = 'W22196' AND start_st_num BETWEEN 31610 and 31625; start_st_num duration start_date running_total ------------------------------------------------------------------------------ 31624 1375776 1/12/2016 18:28 1375776 31624 396755 1/21/2016 8:47 1772531 31624 988750 1/7/2016 19:38 2761281 31624 203844 2/29/2016 10:26 2965125 31624 987070 3/11/2016 16:30 3952195 31624 372239 3/11/2016 18:09 4324434 31623 602551 1/14/2016 7:02 602551 31623 691224 1/16/2016 21:03 1293775 31623 372176 1/18/2016 12:40 1665951 31623 278196 1/20/2016 13:48 1944147 31623 389189 1/7/2016 7:42 2333336 31623 387578 3/29/2016 18:44 2720914 31622 382357 1/18/2016 8:24 382357 31622 797607 1/19/2016 8:36 1179964 31622 508194 3/30/2016 9:09 1688158 31621 837364 1/13/2016 17:37 837364 31621 143947 3/1/2016 8:52 981311 31620 188661 3/27/2016 12:22 188661 31620 300889 3/31/2016 14:25 489550 31618 3336939 3/30/2016 14:09 3336939 31615 216014 1/6/2016 16:36 216014 31615 577035 3/31/2016 8:20 793049 31613 1493051 3/30/2016 10:18 1493051 Selected records: 23
This query groups the results by
start_st_num
. Note that within each value ofstart_st_num
, the running total sums duration across the current row and all previous rows of that window. This is the effect of thePARTITION BY
clause. Also note that the optionalORDER BY
clause simply orders by the designated column(s) the same way that anORDER BY
clause would for theSELECT
statement, except that it treats every partition separately.
Note that you cannot use window functions and standard aggregations in the same query. More specifically, you can’t include window functions in a
GROUP BY
clause.As the
SUM()
function calculates the cumulative sum, similarly theCOUNT()
andAVG()
functions calculate the cumulative number of rows in the partition and their average. For example:
SELECT start_st_num, duration, start_date, SUM(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_total, COUNT(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_count, AVG(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_avg FROM bikes WHERE bike_num = 'W22196' and start_st_num BETWEEN 31610 and 31625; start_st_num duration start_date running_total running_count running_avg ------------------------------------------------------------------------------ 31624 1375776 1/12/2016 18:28 1375776 1 1375776 31624 396755 1/21/2016 8:47 1772531 2 886265.5 31624 988750 1/7/2016 19:38 2761281 3 920427 31624 203844 2/29/2016 10:26 2965125 4 741281.25 31624 987070 3/11/2016 16:30 3952195 5 790439 31624 372239 3/11/2016 18:09 4324434 6 720739 31623 602551 1/14/2016 7:02 602551 1 602551 31623 691224 1/16/2016 21:03 1293775 2 646887.5 31623 372176 1/18/2016 12:40 1665951 3 555317 31623 278196 1/20/2016 13:48 1944147 4 486036.75 31623 389189 1/7/2016 7:42 2333336 5 466667.2 31623 387578 3/29/2016 18:44 2720914 6 453485.666666667 31622 382357 1/18/2016 8:24 382357 1 382357 31622 797607 1/19/2016 8:36 1179964 2 589982 31622 508194 3/30/2016 9:09 1688158 3 562719.333333333 31621 837364 1/13/2016 17:37 837364 1 837364 31621 143947 3/1/2016 8:52 981311 2 490655.5 31620 188661 3/27/2016 12:22 188661 1 188661 31620 300889 3/31/2016 14:25 489550 2 244775 31618 3336939 3/30/2016 14:09 3336939 1 3336939 31615 216014 1/6/2016 16:36 216014 1 216014 31615 577035 3/31/2016 8:20 793049 2 396524.5 31613 1493051 3/30/2016 10:18 1493051 1 1493051 Selected records: 23
To demonstrate the next set of functions we will use a simple table and dataset created by the following script:
CREATE TABLE test(i int); INSERT INTO test VALUES(1),(2),(3),(4),(5);
The
LAG()
andLEAD()
functions return the value for the specified row following or preceding the current row, respectively:
SELECT i, LAG(i,1) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LEAD(i,1) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM test; i #2 #3 ------------------------------------------------------------------------------ 1 null 2 2 1 3 3 2 4 4 3 5 5 4 null Selected records: 5
The
MIN()
andMAX()
functions return the minimal and the maximum values respectively for the specified frames. For example, when using the default bounds (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
), theMIN()
andMAX()
are determined by the ever increasing frame. For example:
SELECT i, MIN(i) OVER (ORDER BY i), MAX(i) OVER (ORDER BY i) FROM test; i #2 #3 ------------------------------------------------------------------------------ 1 1 1 2 1 2 3 1 3 4 1 4 5 1 5 Selected records: 5
But when an unbounded frame is specified (
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) the minimum and maximum values are for the entire window:
SELECT i, MIN(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), MAX(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM test; i #2 #3 ------------------------------------------------------------------------------ 1 1 5 2 1 5 3 1 5 4 1 5 5 1 5 Selected records: 5
The functions
LAST_VALUE()
,FIRST_VALUE()
, andNTH_VALUE()
with no boundaries specified and ordering by the fieldi
will produce 5 frames and the value ofLAST_VALUE()
will be the value ofi
for the current row. This is because the default frame for window functions isROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. So by not specifying a frame explicitly,FIRST_VALUE()
returns the value of the first row in the table andLAST_VALUE()
returns the value in the current row. TheNTH_VALUE()
function returns the value of thenth
row (zero-based) of the current frame orNULL
if there is nonth
row in the current frame. For example:
SELECT i, FIRST_VALUE(i) OVER (ORDER BY i), LAST_VALUE(i) OVER (ORDER BY i), NTH_VALUE(i, 1) OVER (ORDER BY i) FROM test; i #2 #3 #4 ------------------------------------------------------------------------------ 1 1 1 null 2 1 2 2 3 1 3 2 4 1 4 2 5 1 5 2 Selected records: 5
If we specify an unbounded window the value of
LAST_VALUE()
will be the last value in the entire window:
SELECT i, FIRST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LAST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), NTH_VALUE(i, 1) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM test; i #2 #3 #4 ------------------------------------------------------------------------------ 1 1 5 2 2 1 5 2 3 1 5 2 4 1 5 2 5 1 5 2 Selected records: 5
And if we specify a bounded window the value of
LAST_VALUE()
will be the last value in the specified frame (range of rows):
SELECT i, FIRST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), LAST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), NTH_VALUE(i, 2) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM test; i #2 #3 #4 ------------------------------------------------------------------------------ 1 1 2 null 2 1 3 3 3 2 4 4 4 3 5 5 5 4 5 null Selected records: 5
To demonstrate ranking window functions we will use another simple table and dataset created by the following script:
CREATE TABLE t(v string); INSERT INTO t(v) VALUES ('a'),('a'),('a'),('b'),('c'),('c'),('d'),('e');
The
ROW_NUMBER()
function assigns unique numbers to each row within the partition. For example:
SELECT v, ROW_NUMBER() OVER(ORDER BY v) FROM t; v #2 ------------------------------------------------------------------------------ a 1 a 2 a 3 b 4 c 5 c 6 d 7 e 8 Selected records: 8
The
RANK()
function behaves likeROW_NUMBER()
except that “equal” rows are ranked the same. For example:
SELECT v, RANK() OVER(ORDER BY v) FROM t; v #2 ------------------------------------------------------------------------------ a 1 a 1 a 1 b 4 c 5 c 5 d 7 e 8 Selected records: 8
The
DENSE_RANK()
function calculates a rank with no gaps, i.e. “dense”. For example:
SELECT v, DENSE_RANK() OVER(ORDER BY v) FROM t; v #2 ------------------------------------------------------------------------------ a 1 a 1 a 1 b 2 c 3 c 3 d 4 e 5 Selected records: 8
The
PERCENT_RANK()
function calculates the relative rank of the current row to others in the partition. This function is used in statistical analysis to determine the percentile for a group of values. The actual formula used to calculate thepercent_rank
is:
PERCENT_RANK() = ( RANK() – 1 ) / ( TotalRows - 1 )
To demonstrate this we create yet another table with the following script:
CREATE TABLE salaries(DepartmentID int, Salary int); INSERT INTO salaries VALUES (1, 15000), (1, 18000), (1, 23000), (1, 23000), (1, 25000);
The
RANK()
andPERCENT_RANK()
functions are applied to theSalary
column in the following query:
SELECT DepartmentID, Salary, ROW_NUMBER() OVER (PARTITION BY DepartmentID ORDER BY Salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS RowNumber, RANK() OVER (PARTITION BY DepartmentID ORDER BY Salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS RowRank, PERCENT_RANK() OVER (PARTITION BY DepartmentID ORDER BY Salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS PercentRank FROM salaries; DepartmentID Salary RowNumber RowRank PercentRank ------------------------------------------------------------------------------ 1 15000 1 1 0 1 18000 2 2 0.25 1 23000 3 3 0.5 1 23000 4 3 0.5 1 25000 5 5 1 Selected records: 5
Here we have 4 different
Salary
values in 5 rows. The values returned byPERCENT_RANK()
are computed as follows:
1. RANK = 1 so PERCENT_RANK = ( 1 – 1 ) / 4 = 0 2. RANK = 2 so PERCENT_RANK = ( 2 – 1 ) / 4 = 0.25 3. RANK = 3 so PERCENT_RANK = ( 3 – 1 ) / 4 = 0.5 4. RANK = 3 so PERCENT_RANK = ( 3 – 1 ) / 4 = 0.5 5. RANK = 5 so PERCENT_RANK = ( 5 – 1 ) / 4 = 1
The
NTILE()
function distributes result set rows into a specified number of groups. To demonstrate we will create yet another table with the following script:
CREATE TABLE students(StudentID char(2), Marks integer); INSERT INTO students VALUES('S1', 75); INSERT INTO students VALUES('S2', 83); INSERT INTO students VALUES('S3', 91); INSERT INTO students VALUES('S4', 83); INSERT INTO students VALUES('S5', 93);
We can apply the
NTILE()
function to this table with a script like the following:
SELECT StudentID, Marks, NTILE(2) OVER(ORDER BY Marks ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM students; StudentID Marks #3 ------------------------------------------------------------------------------ S1 75 1 S2 83 1 S4 83 1 S3 91 2 S5 93 2 Selected records: 5
Note that the group number
1
has three rows while group2
has two simply because there is an uneven number of result rows.