A “Window function” is a variation on an aggregation function. Where an aggregation function, like
SUM()andAVG(), takesninputs and returns a single value, a window function returnsnvalues.For example, consider the simple table defined as:
CREATE TABLE test(i int); INSERT INTO test VALUES(1),(2),(3),(4),(5);
The
SUM()andAVG()aggregate functions return the following values:
SELECT SUM(i), AVG(i) FROM test; #1 #2 ------------------------------------------------------------------------------ 15 3 Selected records: 1
In an SQL
SELECTstatement, theOVERkeyword indicates that the preceding function is to be applied to awindowrepresenting a group of rows relative to the current row in the result set of the query. For example, the window functionsSUM()andAVG()return the following values:
SELECT i, SUM(i) OVER(), AVG(i) OVER() FROM test; i #2 #3 ------------------------------------------------------------------------------ 1 1 1 2 3 1.5 3 6 2 4 10 2.5 5 15 3 Selected records: 5
The output of a window function depends on a number of input values (function arguments, computed values from other rows in the frame, etc.), so they don’t include functions that work element-wise, like + or
ROUND(). Window functions include variations on aggregate functions, likeSUM(),AVG(),MIN()andMAX(), functions for ranking and ordering, likeRANK(), and functions for taking offsets, likeLEAD()andLAG().
Note that these SQL window functions are not to be confused with the “Window_Agg” functions in the Vector-based Analytics Library. The “Window_Agg” functions operate only on sequences (i.e. fields of type sequence) where the “window” is specified as a fixed number of elements (i.e. interval) to perform computations on.
Window function syntax
A window function call always contains an
OVERclause directly following the window function's name and argument(s). This is what syntactically distinguishes it from a regular function or aggregate function. TheOVERclause determines exactly how the rows of the query are to be split up for processing by the window function. APARTITION BYlist within theOVERclause can be used for dividing the rows into groups, or partitions that share the same values specified in thePARTITION BYexpression(s). For each row, the window function is computed across the rows that fall into the same partition as the current row.For a given window we can specify:
- the order of the rows in the window (the
ORDER BYclause)- a subset of the table data to operate on (the
PARTITION BYclause)- the window defined by the previous two clauses can be further narrowed down into frames of rows by use of the
ROWS BETWEENclause which can be:
- bounded – eg.
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW- partially bounded – eg.
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING- unbounded – eg.
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING- “logical”
framescan also be defined through theRANGE BETWEENoption (see “Specifying window frames” below)The following window functions are supported. Note that those below “for ordered rows” require an
ORDER BYclause in theOVERspecification:
- aggregate functions
count– the cumulative row countrow_number– the row numbermin– the minimum valuemax– the maximum valuesum– the cumulative sum of this and preceding rowsavg– the average of this row and all previous rowsmedian- the median of this row and all previous rows
- for ordered rows
first_value– the first valuelast_value– the last valuenth_value– the specified nth valuelead– access a row at a specified offset after the current rowlag– access a row at a specified offset before the current rowrank– the rank of the value in this row (there can be gaps in the sequence of ranks)dense_rank– the rank of the value in this row (with no gaps in the sequence of ranks)percent_rank– used for computing the percentile of this rowntile- distributes the rows into a specified number of groupsSpecifying window frames
The window frame for each row of the result set is determined by specifying
BETWEENbounds. TheseBETWEENbounds can be specified by eitherROWSorRANGEoptions. TheROWSoption defines a fixed number of rows preceding and following the current row; and the rows that will comprise the frame will be determined by theORDER BYclause. If no bounds are explicitly specified, the default frame is from the first row (UNBOUNDED PRECEDING) to the current row.
Note that the
BETWEENbounds must be specified with explicit constant values, not with parameters. For example, the following statement is not valid:SELECT stamp, SUM(price*quantity) OVER w AS pq1, SUM(quantity) OVER w AS vol, pq1/vol AS VWAP FROM Quotes WINDOW w AS (PARTITION BY pkid ORDER BY stamp RANGE BETWEEN ? PRECEDING AND 0 FOLLOWING)Here the parameter substitution "
BETWEEN ? PRECEDING" is not allowed.Instead the following statement using constant value
5is valid:SELECT stamp, SUM(price*quantity) OVER w AS pq1, SUM(quantity) OVER w AS vol, pq1/vol AS VWAP FROM Quotes WINDOW w AS (PARTITION BY pkid ORDER BY stamp RANGE BETWEEN 5 PRECEDING AND 0 FOLLOWING)If the
RANGEoption is used, then all rows within the frame that have the sameORDER BYvalue as the current row will be included in that row’s function computation. To illustrate the difference betweenphysicalROWS BETWEENframes andlogicalRANGE BETWEENframes, we will add a column to the example table test and insert values as follows:
ALTER TABLE test ADD val int; UPDATE test SET val=100 WHERE i=1; UPDATE test SET val=200 WHERE i=2; UPDATE test SET val=200 WHERE i=3; UPDATE test SET val=200 WHERE i=4; UPDATE test SET val=300 WHERE i=5; SELECT * FROM test; i val ------------------------------------------------------------------------------ 1 100 2 200 3 200 4 200 5 300 Selected records: 5
Now to compute the sum over the physical frame of
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(the default frame) we could perform a query like the following:
SELECT i, val, SUM(val) OVER(ORDER BY val ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Row_Sum FROM test; i val Row_Sum ------------------------------------------------------------------------------ 1 100 100 2 200 300 3 200 500 4 200 700 5 300 1000 Selected records: 5
Note that
SUM(val)is calculated as the sum of the value val for the current row plus its value for all preceding rows. To compute the sum over the logical frameRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWwe could perform a query like the following:
SELECT i, val, SUM(val) OVER(ORDER BY val RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Range_Sum FROM test; i val Range_Sum ------------------------------------------------------------------------------ 1 100 100 2 200 700 3 200 700 4 200 700 5 300 1000 Selected records: 5
Note that the
ORDER BYvalue of column val is the same for rows 2, 3 and 4; soSUM(val)here is calculated for these rows as100 + 200 + 200 + 200 = 700.
Examples
The following examples use data from Washington DC’s Capital Bikeshare Program, which publishes detailed trip-level historical data on their website. Each row represents one ride. The following import script was used to create the table bikes, the index
bike_num_idxto speed up lookups bybike_num, and then insert the sample data:
CREATE TABLE bikes(duration int, start_date string, end_date string, start_st_num int, start_st string, end_st_num int, end_st string, bike_num string, member_type string); INSERT INTO bikes SELECT * from foreign table(path='2016-Q1-Trips-History-Data.csv', delimiter=',', skip=1) AS bikes; CREATE INDEX bike_num_idx ON bikes(bike_num);
A simple practical example of a Window function is calculating a running total. For example the following query creates an aggregation (
running_total) as the cumulative sum of field duration (note that this is without using aGROUP BYclause):
SELECT duration, start_date, SUM(duration) OVER (ORDER BY start_date) AS running_total FROM bikes WHERE bike_num = 'W22196' AND start_date >= '3/31/2016';
The first part of the above aggregation,
SUM(duration), looks similar to any other aggregation. Adding the keywordOVERdesignates it as a window function. So the above aggregation could be read as “take the sum of duration over the entire result set, in order ofstart_date”. Here, for brevity of output, we have specified aWHEREclause for abike_num (W22196)and a single day (3/31/2016) with a limited number of trips. The result set shows how therunning_totaldisplays the cumulative sum:
duration start_date running_total ------------------------------------------------------------------------------ 456465 3/31/2016 0:05 456465 300889 3/31/2016 14:25 757354 1001144 3/31/2016 23:51 1758498 577035 3/31/2016 8:20 2335533 Selected records: 4
The following query demonstrates the use of a
PARTITION BYclause to calculate the cumulative sum for blocks of result set rows grouped by the start terminal numberstart_st_numand ordered bystart_date:
SELECT start_st_num, duration, start_date, SUM(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_total FROM bikes WHERE bike_num = 'W22196' AND start_st_num BETWEEN 31610 and 31625; start_st_num duration start_date running_total ------------------------------------------------------------------------------ 31624 1375776 1/12/2016 18:28 1375776 31624 396755 1/21/2016 8:47 1772531 31624 988750 1/7/2016 19:38 2761281 31624 203844 2/29/2016 10:26 2965125 31624 987070 3/11/2016 16:30 3952195 31624 372239 3/11/2016 18:09 4324434 31623 602551 1/14/2016 7:02 602551 31623 691224 1/16/2016 21:03 1293775 31623 372176 1/18/2016 12:40 1665951 31623 278196 1/20/2016 13:48 1944147 31623 389189 1/7/2016 7:42 2333336 31623 387578 3/29/2016 18:44 2720914 31622 382357 1/18/2016 8:24 382357 31622 797607 1/19/2016 8:36 1179964 31622 508194 3/30/2016 9:09 1688158 31621 837364 1/13/2016 17:37 837364 31621 143947 3/1/2016 8:52 981311 31620 188661 3/27/2016 12:22 188661 31620 300889 3/31/2016 14:25 489550 31618 3336939 3/30/2016 14:09 3336939 31615 216014 1/6/2016 16:36 216014 31615 577035 3/31/2016 8:20 793049 31613 1493051 3/30/2016 10:18 1493051 Selected records: 23
This query groups the results by
start_st_num. Note that within each value ofstart_st_num, the running total sums duration across the current row and all previous rows of that window. This is the effect of thePARTITION BYclause. Also note that the optionalORDER BYclause simply orders by the designated column(s) the same way that anORDER BYclause would for theSELECTstatement, except that it treats every partition separately.
Note that you cannot use window functions and standard aggregations in the same query. More specifically, you can’t include window functions in a
GROUP BYclause.As the
SUM()function calculates the cumulative sum, similarly theCOUNT()andAVG()functions calculate the cumulative number of rows in the partition and their average. For example:
SELECT start_st_num, duration, start_date, SUM(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_total, COUNT(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_count, AVG(duration) OVER (PARTITION BY start_st_num ORDER BY start_date) AS running_avg FROM bikes WHERE bike_num = 'W22196' and start_st_num BETWEEN 31610 and 31625; start_st_num duration start_date running_total running_count running_avg ------------------------------------------------------------------------------ 31624 1375776 1/12/2016 18:28 1375776 1 1375776 31624 396755 1/21/2016 8:47 1772531 2 886265.5 31624 988750 1/7/2016 19:38 2761281 3 920427 31624 203844 2/29/2016 10:26 2965125 4 741281.25 31624 987070 3/11/2016 16:30 3952195 5 790439 31624 372239 3/11/2016 18:09 4324434 6 720739 31623 602551 1/14/2016 7:02 602551 1 602551 31623 691224 1/16/2016 21:03 1293775 2 646887.5 31623 372176 1/18/2016 12:40 1665951 3 555317 31623 278196 1/20/2016 13:48 1944147 4 486036.75 31623 389189 1/7/2016 7:42 2333336 5 466667.2 31623 387578 3/29/2016 18:44 2720914 6 453485.666666667 31622 382357 1/18/2016 8:24 382357 1 382357 31622 797607 1/19/2016 8:36 1179964 2 589982 31622 508194 3/30/2016 9:09 1688158 3 562719.333333333 31621 837364 1/13/2016 17:37 837364 1 837364 31621 143947 3/1/2016 8:52 981311 2 490655.5 31620 188661 3/27/2016 12:22 188661 1 188661 31620 300889 3/31/2016 14:25 489550 2 244775 31618 3336939 3/30/2016 14:09 3336939 1 3336939 31615 216014 1/6/2016 16:36 216014 1 216014 31615 577035 3/31/2016 8:20 793049 2 396524.5 31613 1493051 3/30/2016 10:18 1493051 1 1493051 Selected records: 23
To demonstrate the next set of functions we will use a simple table and dataset created by the following script:
CREATE TABLE test(i int); INSERT INTO test VALUES(1),(2),(3),(4),(5);
The
LAG()andLEAD()functions return the value for the specified row following or preceding the current row, respectively:
SELECT i, LAG(i,1) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LEAD(i,1) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM test; i #2 #3 ------------------------------------------------------------------------------ 1 null 2 2 1 3 3 2 4 4 3 5 5 4 null Selected records: 5
The
MIN()andMAX()functions return the minimal and the maximum values respectively for the specified frames. For example, when using the default bounds (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), theMIN()andMAX()are determined by the ever increasing frame. For example:
SELECT i, MIN(i) OVER (ORDER BY i), MAX(i) OVER (ORDER BY i) FROM test; i #2 #3 ------------------------------------------------------------------------------ 1 1 1 2 1 2 3 1 3 4 1 4 5 1 5 Selected records: 5
But when an unbounded frame is specified (
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) the minimum and maximum values are for the entire window:
SELECT i, MIN(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), MAX(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM test; i #2 #3 ------------------------------------------------------------------------------ 1 1 5 2 1 5 3 1 5 4 1 5 5 1 5 Selected records: 5
The functions
LAST_VALUE(),FIRST_VALUE(), andNTH_VALUE()with no boundaries specified and ordering by the fieldiwill produce 5 frames and the value ofLAST_VALUE()will be the value ofifor the current row. This is because the default frame for window functions isROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. So by not specifying a frame explicitly,FIRST_VALUE()returns the value of the first row in the table andLAST_VALUE()returns the value in the current row. TheNTH_VALUE()function returns the value of thenthrow (zero-based) of the current frame orNULLif there is nonthrow in the current frame. For example:
SELECT i, FIRST_VALUE(i) OVER (ORDER BY i), LAST_VALUE(i) OVER (ORDER BY i), NTH_VALUE(i, 1) OVER (ORDER BY i) FROM test; i #2 #3 #4 ------------------------------------------------------------------------------ 1 1 1 null 2 1 2 2 3 1 3 2 4 1 4 2 5 1 5 2 Selected records: 5
If we specify an unbounded window the value of
LAST_VALUE()will be the last value in the entire window:
SELECT i, FIRST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LAST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), NTH_VALUE(i, 1) OVER (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM test; i #2 #3 #4 ------------------------------------------------------------------------------ 1 1 5 2 2 1 5 2 3 1 5 2 4 1 5 2 5 1 5 2 Selected records: 5
And if we specify a bounded window the value of
LAST_VALUE()will be the last value in the specified frame (range of rows):
SELECT i, FIRST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), LAST_VALUE(i) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), NTH_VALUE(i, 2) OVER (ORDER BY i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM test; i #2 #3 #4 ------------------------------------------------------------------------------ 1 1 2 null 2 1 3 3 3 2 4 4 4 3 5 5 5 4 5 null Selected records: 5
To demonstrate ranking window functions we will use another simple table and dataset created by the following script:
CREATE TABLE t(v string); INSERT INTO t(v) VALUES ('a'),('a'),('a'),('b'),('c'),('c'),('d'),('e');
The
ROW_NUMBER()function assigns unique numbers to each row within the partition. For example:
SELECT v, ROW_NUMBER() OVER(ORDER BY v) FROM t; v #2 ------------------------------------------------------------------------------ a 1 a 2 a 3 b 4 c 5 c 6 d 7 e 8 Selected records: 8
The
RANK()function behaves likeROW_NUMBER()except that “equal” rows are ranked the same. For example:
SELECT v, RANK() OVER(ORDER BY v) FROM t; v #2 ------------------------------------------------------------------------------ a 1 a 1 a 1 b 4 c 5 c 5 d 7 e 8 Selected records: 8
The
DENSE_RANK()function calculates a rank with no gaps, i.e. “dense”. For example:
SELECT v, DENSE_RANK() OVER(ORDER BY v) FROM t; v #2 ------------------------------------------------------------------------------ a 1 a 1 a 1 b 2 c 3 c 3 d 4 e 5 Selected records: 8
The
PERCENT_RANK()function calculates the relative rank of the current row to others in the partition. This function is used in statistical analysis to determine the percentile for a group of values. The actual formula used to calculate thepercent_rankis:
PERCENT_RANK() = ( RANK() – 1 ) / ( TotalRows - 1 )
To demonstrate this we create yet another table with the following script:
CREATE TABLE salaries(DepartmentID int, Salary int); INSERT INTO salaries VALUES (1, 15000), (1, 18000), (1, 23000), (1, 23000), (1, 25000);
The
RANK()andPERCENT_RANK()functions are applied to theSalarycolumn in the following query:
SELECT DepartmentID, Salary, ROW_NUMBER() OVER (PARTITION BY DepartmentID ORDER BY Salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS RowNumber, RANK() OVER (PARTITION BY DepartmentID ORDER BY Salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS RowRank, PERCENT_RANK() OVER (PARTITION BY DepartmentID ORDER BY Salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS PercentRank FROM salaries; DepartmentID Salary RowNumber RowRank PercentRank ------------------------------------------------------------------------------ 1 15000 1 1 0 1 18000 2 2 0.25 1 23000 3 3 0.5 1 23000 4 3 0.5 1 25000 5 5 1 Selected records: 5
Here we have 4 different
Salaryvalues in 5 rows. The values returned byPERCENT_RANK()are computed as follows:
1. RANK = 1 so PERCENT_RANK = ( 1 – 1 ) / 4 = 0 2. RANK = 2 so PERCENT_RANK = ( 2 – 1 ) / 4 = 0.25 3. RANK = 3 so PERCENT_RANK = ( 3 – 1 ) / 4 = 0.5 4. RANK = 3 so PERCENT_RANK = ( 3 – 1 ) / 4 = 0.5 5. RANK = 5 so PERCENT_RANK = ( 5 – 1 ) / 4 = 1
The
NTILE()function distributes result set rows into a specified number of groups. To demonstrate we will create yet another table with the following script:
CREATE TABLE students(StudentID char(2), Marks integer); INSERT INTO students VALUES('S1', 75); INSERT INTO students VALUES('S2', 83); INSERT INTO students VALUES('S3', 91); INSERT INTO students VALUES('S4', 83); INSERT INTO students VALUES('S5', 93);
We can apply the
NTILE()function to this table with a script like the following:
SELECT StudentID, Marks, NTILE(2) OVER(ORDER BY Marks ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM students; StudentID Marks #3 ------------------------------------------------------------------------------ S1 75 1 S2 83 1 S4 83 1 S3 91 2 S5 93 2 Selected records: 5
Note that the group number
1has three rows while group2has two simply because there is an uneven number of result rows.