June 23rd 2023

Windowing functions are a powerful feature of SQL that allow you to perform calculations over a group of rows, such as running totals, moving averages, rankings, percentiles, and more. Unlike aggregate functions, which return a single value for a group of rows, Windowing Functions return a single value for each Row in the group, while preserving the original row order and structure.

Unveiling the Secrets of Seamless Clo…
ØªØ¹Ù„Ù… Ø£Ø³Ø§Ø³ÙŠØ§Øª Ø§Ù„Ù‚Ø±Ø§Ø¡Ø…
best projectors for home
Konzumacija jaja moÅ¾e pomoÄ‡i u prev…
Fully Electric Vehicles vs. Plug-in H…

Before diving in, let’s clarify some terms:

A window is a group of related rows with shared common traits. Think of it as a timeframe, like rows from the same month or rows from a specific city.
A partition is a group of rows within a window that have the same value for a specified expression. For instance, if we group rows by customer ID, each partition contains all the rows for that customer.
An order sorts the rows within a partition or a window based on a specified expression, like sorting rows by date within each partition or window.
A frame is a group of adjacent rows within a partition or a window related to the current row based on the specified order. If we define a frame as the current row and the previous two rows, then each row in the partition or window has a frame of three rows.

Windowing functions are amazing for data analysis because they let you perform complex calculations over a group of rows without joining multiple tables or using subqueries. For example, you can use windowing functions to:

Calculate running totals or cumulative sums over time or categories.
Calculate moving averages or standard deviations over a fixed or variable number of rows.
Rank rows based on custom criteria and groupings, like sales performance or product popularity.
Calculate percentiles or quartiles over a distribution of values.
Compare values across different partitions or windows, like year-over-year or month-over-month changes.

In this article, we will explore how to use windowing functions in Snowflake, one of the leading cloud data platforms. We will cover the syntax and types of Snowflake windowing functions supported, including navigation functions(we will go deeper into navigation functions), numbering functions, and analytic functions.

We will also show how to use a low code wizard tool (such as Datameer) to perform windowing functions without writing SQL code. By the end of this article, you will be able to leverage Snowflake windowing functions easily and effectively in your data analysis projects.

Snowflake Windowing Functions

Let’s dive into using windowing functions in Snowflake, a fantastic cloud data platform. Snowflake supports various windowing functions, which we can group into three categories: navigation functions, numbering functions, and analytic functions. We’ll go through the syntax and usage of each function type, complete with examples, code snippets, and results.

Snowflake Windowing Functions: Navigation Functions

Navigation functions are windowing functions that return a value based on a specific location criteria within the window. For instance, you can use these functions to get the first or last value in a window or the value of a row a certain number of rows before or after the current row. Snowflake supports these navigation functions:

FIRST_VALUE: Returns the value for the first row in the current window frame, with optional null value handling.
LAST_VALUE: Returns the value for the last row in the current window frame, with optional null value handling.
NTH_VALUE: Returns the value for the Nth row of the current window frame, with optional counting direction and null value handling.
LEAD: Returns the value of an expression on the row after the current one, with optional offset, default value, and null value handling.
LAG: Returns the value of an expression on the row before the current one, with optional offset, default value, and null value handling.

Here’s the syntax for navigation functions:

function> ( expression> [ , offset> , default> ] ) [ { IGNORE | RESPECT } NULLS ] OVER ( [ PARTITION BY partition_expression> ] ORDER BY order_expression> [ { ASC | DESC } ] [ window_frame> ] )

Let’s break it down:

‘‘ is one of FIRST_VALUE, LAST_VALUE, NTH_VALUE, LEAD, or LAG.
‘’ is any valid SQL expression that evaluates to a scalar value.
‘’ is an optional integer argument specifying how many rows before or after the current row to return. The default is 1.
‘’ is an optional argument specifying what value to return if there’s no such row as specified by the offset. The default is NULL.
‘{ IGNORE | RESPECT } NULLS’ is an optional clause specifying whether to ignore or respect null values in the expression. The default is RESPECT NULLS.
‘OVER’ is a mandatory clause specifying how to define the window for the function.
‘PARTITION BY ’ is an optional clause specifying how to divide rows into partitions based on an expression value. If omitted, the entire table is treated as a single partition.
‘ORDER BY [ { ASC | DESC } ]’ is a mandatory clause specifying how to sort rows within each partition based on an expression value. You can optionally specify the sort order as ascending or descending. The default is ascending.
‘’ is an optional clause specifying how to define the window frame for the function. The window frame can be cumulative or sliding. A cumulative window frame includes all rows from the partition start up to the current row. A sliding window frame includes a subset of rows around the current row based on a specified range or number of rows. We’ll cover the syntax and options for the window frame clause later.

Let’s go through some examples of navigation functions using a sample dataset of sales transactions containing these columns: date, customer_id, product_id, quantity, and price.

Example 1: FIRST_VALUE

The following query returns the first product purchased by each customer in each month, along with the date and price of the purchase.

SELECT DATE_TRUNC('month', date) AS month, customer_id, FIRST_VALUE(product_id) OVER ( PARTITION BY DATE_TRUNC('month', date), customer_id ORDER BY date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) FROM Table1;

The result is:

month	customer_id	first_product	first_date	first_price
2023-01	1	101	2023-01-02	50.00
2023-01	2	102	2023-01-05	40.00
2023-02	1	103	2023-02-01	30.00
2023-02	3	101	2023-02-10	50.00

In this example, we use the DATE_TRUNC function to truncate the date column to the month level and use it as a partition expression. We also use the date column as an order expression to sort the rows within each partition chronologically. We then use the FIRST_VALUE function to return the product_id, date, and price of the first row in each partition.

Example 2: LAST_VALUE

This example demonstrates a query that returns the last product purchased by each customer in each month, along with the date and price of the purchase.

SELECT DATE_TRUNC('month', date) AS month, customer_id, LAST_VALUE(product_id) OVER ( PARTITION BY DATE_TRUNC('month', date), customer_id ORDER BY date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) AS last_product, LAST_VALUE(date) OVER ( PARTITION BY DATE_TRUNC('month', date), customer_id ORDER BY date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) AS last_date, LAST_VALUE(price) OVER ( PARTITION BY DATE_TRUNC('month', date), customer_id ORDER BY date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) AS last_price FROM sales;

The result is:

month	customer_id	last_product	last_date	last_price
2023-01	1	102	2023-01-10	40.00
2023-01	2	NULL	NULL	NULL
2023-02	1	104	2023-02-05	25.00
2023-02	3	NULL	NULL	NULL

In this example, the partition and order expressions remain the same as the previous example. However, the LAST_VALUE function is used to return the product_id, date, and price of the last row in each partition. Since the default window frame for LAST_VALUE is the current row only, we need to specify a different window frame using the ROWS BETWEEN clause. We use the UNBOUNDED FOLLOWING option to include all rows from the current row to the end of the partition in the window frame.

Example 3: NTH_VALUE

This query returns the second product purchased by each customer in each month, along with the date and price of the purchase. If there is no such product, it returns NULL.

SELECT DATE_TRUNC('month', date) AS month, customer_id, NTH_VALUE(product_id, 2) OVER ( PARTITION BY DATE_TRUNC('month', date), customer_id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS second_product, NTH_VALUE(date, 2) OVER ( PARTITION BY DATE_TRUNC('month', date), customer_id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS second_date, NTH_VALUE(price, 2) OVER ( PARTITION BY DATE_TRUNC('month', date), customer_id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS second_price FROM sales;

Sample result:

month	customer_id	second_product	second_date	second_price
2023-01	1	103	2023-01-28	30.00
2023-01	2	104	2023-01-20	25.00
2023-02	1	101	2023-02-15	50.00
2023-02	3	102	2023-02-08	40.00

In this example, the partition and order expressions are the same as in the previous examples. The NTH_VALUE function is used to return the product_id, date, and price of the second row in each partition. The ROWS BETWEEN clause with the UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING options is utilized to include all rows in the partition in the window frame. The default FROM FIRST option indicates counting from the first row in the window frame.

Example 4: LEAD

This query returns the product purchased by each customer on their next purchase date, along with the date and price of the purchase. If there is no such product, it returns NULL.

SELECT date AS current_date, customer_id, product_id AS current_product, price AS current_price, LEAD(product_id) OVER ( PARTITION BY customer_id ORDER BY date ) AS next_product, LEAD(date) OVER ( PARTITION BY customer_id ORDER BY date ) AS next_date, LEAD(price) OVER ( PARTITION BY customer_id ORDER BY date ) AS next_price FROM sales;

The result is:

current_date	customer_id	current_product	current_price	next_product	next_date	next_price
2023-01-02	1	101	50.00	102	2023-01-10	40.00
2023-01-10	1	102	40.00	103	2023-01-28	30.00
2023-01-20	2	104	25.00	NULL	NULL	NULL
2023-01-05	1	104	25.00	101	2023-02-15	50.00

Here, the rows are partitioned by customer_id and ordered by date. The LEAD function is used to return the product_id, date, and price of the subsequent row in each partition. The default offset of 1 indicates the next row, and the default RESPECT NULLS option specifies to return NULL if there is no such row.

Example 5: LAG

This query returns the product purchased by each customer on their previous purchase date, along with the date and price of the purchase. If there is no such product, it returns NULL.

SELECT date AS current_date, customer_id, product_id AS current_product, price AS current_price, LAG(product_id) OVER ( PARTITION BY customer_id ORDER BY date ) AS previous_product, LAG(date) OVER ( PARTITION BY customer_id ORDER BY date ) AS previous_date, LAG(price) OVER ( PARTITION BY customer_id ORDER BY date ) AS previous_price FROM sales;

Sample result:

current_date	customer_id	current_product	current_price	previous_product	previous_date	previous_price
2023-01-02	1	101	50.00	NULL	NULL	NULL
2023-01-10	1	102	40.00	101	2023-01-02	50.00
2023-01-20	2	104	25.00	NULL	NULL	NULL
2023-01-05	1	104	25.00	103	2023-01-28	30.00

In this example, the rows are partitioned by customer_id and ordered by date. The LAG function is used to return the product_id, date, and price of the preceding row in each partition. The default offset of 1 indicates the previous row, and the default RESPECT NULLS option specifies to return NULL if there is no such row.

These examples demonstrate various Snowflake windowing functions, such as FIRST_VALUE, LAST_VALUE, NTH_VALUE, LEAD, and LAG, which can be used to analyze and process data based on specific requirements. The partition and order expressions, along with the ROWS BETWEEN clause, allow you to customize the window frame and provide flexibility in defining the range of rows considered for each function.

Snowflake Windowing Functions: Numbering Functions

Numbering functions are a type of windowing functions that assign a number to each row based on their position in the window. You can use them to rank rows according to custom criteria and groupings, like sales performance, customer satisfaction, or product popularity. Snowflake supports a variety of numbering functions:

ROW_NUMBER: This function assigns a sequential number to each row in the window, starting from 1. The row order is determined by the order expression. In case of duplicate values in the order expression, the order of rows with the same value is arbitrary.
RANK: Ranks each row in the window, starting from 1. The order expression determines the row order. Rows with duplicate values in the order expression share the same rank, and the next rank is skipped (e.g., if two rows have rank 2, the next rank is 4).
DENSE_RANK: Similar to RANK, but it doesn’t skip the next rank when there are duplicate values (e.g., if two rows have rank 2, the next rank is 3).
PERCENT_RANK: Provides a percentage rank for each row in the window, ranging from 0 to 1. The order expression determines the row order. The percentage rank is calculated as (rank — 1) / (total rows — 1). Rows with the same value in the order expression have the same percentage rank.
CUME_DIST: Calculates a cumulative distribution for each row in the window, ranging from 0 to 1. The order expression determines the row order. The cumulative distribution is calculated as (number of rows preceding or peer with current row) / (total rows). Rows with the same value in the order expression have the same cumulative distribution.
NTILE: Assigns a bucket number to each row in the window, ranging from 1 to N. The order expression determines the row order. The bucket number is calculated by dividing the rows into N roughly equal groups and assigning each group a number. If N isn’t a divisor of the total number of rows, some buckets will have one more row than others.

The syntax for numbering functions looks like this:

 ( [  ] ) OVER ( [ PARTITION BY  ] ORDER BY [ { ASC | DESC } ] )

where:

‘’ is one of ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, or NTILE.
‘’ is an optional argument that specifies a value or an expression for some functions. For NTILE, it specifies the number of buckets to divide the rows into. For other functions, it is ignored.
‘OVER’ is a mandatory clause that specifies how to define the window for the function.
‘PARTITION BY ’ is an optional clause that specifies how to divide the rows into partitions based on the value of an expression. If omitted, the entire table is treated as a single partition.
‘ORDER BY [ { ASC | DESC } ]’ is a mandatory clause that specifies how to sort the rows within each partition based on the value of an expression. You can optionally specify the sort order as ascending or descending. The default is ascending.

Snowflake Windowing Functions: Analytics Functions

Analytics functions are a type of Snowflake windowing function that perform calculations on a set of values within a window. You can use them to figure out things like running totals, cumulative sums over time or categories, moving averages, standard deviations over a specific number of rows, percentiles, quartiles, and so on. Snowflake has a bunch of analytics functions available:

SUM: Calculates the sum of an expression over the window.
AVG: Gets the average of an expression over the window.
MIN: Finds the minimum value of an expression over the window.
MAX: Identifies the maximum value of an expression over the window.
COUNT: Counts the number of rows in the window.
COUNT_IF: Counts the rows in the window that meet a specific condition.
STDDEV: Calculates the standard deviation of an expression over the window.
STDDEV_POP: Computes the population standard deviation of an expression over the window.
STDDEV_SAMP: Determines the sample standard deviation of an expression over the window.
VAR: Calculates the variance of an expression over the window.
VAR_POP: Computes the population variance of an expression over the window.
VAR_SAMP: Determines the sample variance of an expression over the window.
COVAR: Calculates the covariance of two expressions over the window.
COVAR_POP: Computes the population covariance of two expressions over the window.
COVAR_SAMP: Determines the sample covariance of two expressions over the window.
CORR: Finds the correlation coefficient of two expressions over the window.
LISTAGG: Concatenates a string of values from an expression over the window, separated by a delimiter.
PERCENTILE_CONT: Computes a continuous percentile value of an expression over the window, based on a specified fraction.
PERCENTILE_DISC: Calculates a discrete percentile value of an expression over the window, based on a specified fraction.

The syntax for analytics functions looks like this:

function> ( expression> [ , arguments> ] ) OVER ( [ PARTITION BY partition_expression> ] [ ORDER BY order_expression> [ { ASC | DESC } ] [ window_frame> ] ] )

Here, ‘’ can be any of the functions listed above. ‘’ is any valid SQL expression that results in a scalar value. For functions like COUNT and COUNT_IF, you can use * to indicate all rows in the window. ‘’ are optional and specify extra parameters for some functions (e.g., the delimiter for LISTAGG or the fraction for PERCENTILE_CONT and PERCENTILE_DISC). The ‘OVER’ clause is mandatory and defines the window for the function.

The ‘PARTITION BY ’ clause is optional and divides the rows into partitions based on the value of an expression; if omitted, the entire table is treated as a single partition. Finally, the ‘ORDER BY [ { ASC | DESC } ]’ clause is optional and sorts the rows within each partition based on the value of an expression, with an optional sort order of ascending or descending (the default is ascending). The ‘’ is an optional specification that defines the range or rows within the partition used for calculation.

Low Code Wizard Approach

Let’s now talk about leveraging Snowflake Windowing Functions using a low code wizard approach. One great low code tool for this purpose is Datameer. Low code wizard tools help users create and execute windowing functions without writing SQL code. This approach comes with a variety of benefits, such as ease of use, speed, flexibility, and accuracy. However, it’s essential to understand that low code tools may have limitations when it comes to handling more complex or custom calculations.

Using Datameer to Perform Windowing Functions

Let’s demonstrate how to use Datameer, a low code wizard tool, to perform Snowflake windowing functions on a sample dataset. Here’s a step-by-step guide to using Datameer:

Select the dataset: First, choose the dataset you want to perform windowing functions on.
Choose the windowing function: Datameer provides an intuitive interface to select the desired windowing function, such as SUM, AVG, RANK, ROW_NUMBER, and others.
Specify the partition, order, and frame: Customize the window’s partition, order, and frame according to your analysis requirements. You can easily do this through the user-friendly interface.
Execute the function: Run the windowing function on the dataset, and Datameer will generate the desired output.

Comparing Low Code Wizard Results to SQL Code

Once you’ve obtained the results from Datameer, you can compare them with the results obtained from the SQL code version. Datameer generates the SQL code behind the scenes, so you can compare it to the manual code you wrote earlier. If you find any differences or discrepancies, you can investigate the cause and resolve them accordingly.

Conclusion

In this article, we’ve explored Snowflake windowing functions and their various types, including navigation, numbering, and analytic functions. We’ve also seen how a low code wizard approach, like Datameer, can simplify the process of using windowing functions.

Some tips and best practices for using Snowflake windowing functions include choosing the appropriate windowing function for your analysis goal, avoiding unnecessary or redundant calculations, and optimizing performance and memory usage.

We encourage you to try out Datameer and experiment with different windowing functions and scenarios to see how they affect your results. Feel free to share your feedback, and don’t hesitate to explore more articles or videos on related topics.

References:
Window Functions | Snowflake Syntax and Examples | Count

https://docs.snowflake.com/en/sql-reference/functions-analytic

The post How to Leverage Snowflake Windowing functions Easily Using a Low Code Wizard Approach appeared first on Datameer.

This post first appeared on Hadoop Blog, Big Data Analytics Blog - Datameer, please read the originial post: here