Peng JingPeng Jing
Presto SQL Pre-validation Rules and Optimisation TipsThe newly launched Query Pre-Validation System by Presto introduces a comprehensive approach to enhancing query efficiency and saving valuable user resources. With this groundbreaking feature in place, all queries submitted to Presto undergo meticulous validation using a set of predefined rules. If you meet the validation error, do not worry, here are the definitions and suggestions according to the each validation rule.
Presto SQL Pre-validation Rules and Optimisation Tips

AsteriskColumnRule

When using SQL's "*" to scan, please note that the operation will ask to return all columns of the table by default. However, this may potentially add unnecessary scan data, impacting performance. To ensure efficient data retrieval, it is recommended to specify the desired columns when scanning data. To facilitate data detection and previewing a smaller dataset, we allow simple pattern queries (only) "SELECT * FROM xxx LIMIT yyy". This provides convenience when you only need to preview a limited amount of data. However, any other queries using all columns will be blocked to prevent excessive scanning and improve query performance.

For example, the query:

  • "SELECT * FROM xxx WHERE a = 1 ORDER BY c DESC LIMIT 100" should be transformed to: "SELECT col1, col3 FROM xxx WHERE a = 1 ORDER BY c DESC LIMIT 100" by explicitly specifying the desired columns (col1 and col3) instead of using "*".
  • Similarly, when using the "WITH" clause, consider modifying the query as follows: "WITH t1 as (SELECT * FROM xxx) SELECT col1, sum(col2) FROM t1 GROUP BY col1 ORDER BY col1 LIMIT 10" should be revised to: "WITH t1 as (SELECT col1, col2 FROM xxx) SELECT col1, sum(col2) FROM t1 GROUP BY col1 ORDER BY col1 LIMIT 10"

By specifying the required columns explicitly, you ensure that only the necessary data is scanned, leading to improved query performance and efficient data retrieval.Remember, while "*" can be handy for data detection and limited previews, it is recommended to specify the desired columns explicitly for other queries to optimize resource usage and enhance overall query execution.

OrderByLimitRule

When utilizing the ORDER BY clause in a query, it is often necessary to include the LIMIT clause to restrict the number of output data returned by the query.In distributed computing scenarios, employing ORDER BY can introduce a significant amount of exchange (shuffle) calculations, which can be computationally expensive. However, by adding the LIMIT clause, it becomes possible to terminate the execution of the query pipeline prematurely at each independent stage. This termination helps to optimize the query execution process, reduce the computing amount of data at least.

`In certain suitable cases, the Presto engine can optimize further by pushing down the LIMIT clause. By pushing down the LIMIT, the engine can efficiently limit the amount of resources consumed during query execution. This optimization can result in significant savings in terms of computational resources and improve the overall performance of the query.

For example, consider the query:

  • SELECT col1, col2 FROM (SELECT ... FROM xxx) ORDER BY col1" should be changed to "SELECT col1, col2 FROM (SELECT ... FROM xxx) ORDER BY col1 LIMIT yyy.

  • For the query "SELECT col1, sum(col2) FROM (SELECT ... FROM xxx) ORDER BY col1 GROUP by col1" should be changed to "SELECT col1, sum(col2) FROM (SELECT ... FROM xxx) ORDER BY col1 GROUP by col1 LIMIT yyy". For such query it may not increase the performance too much, but in some more complicated scenarios, If you treat the whole query as the subquery and using it join with other tables, add LIMIT can let the optimizers know the exactly what is the maximum output about the subquery node, then can let the optimizers to chose a optimal version of the whole plan.

JoinConditionRule

Presto offers robust support for JOIN operations, and there are several optimizations specifically designed for JOIN queries. However, certain practices can hinder these optimizations from achieving the best possible results, particularly when applying functions within the JOIN conditions. below using F1 , F2 represents some undetermined function.

Consider the following query as an example:

WITH t1 AS (SELECT col... FROM tbl1), t2 AS (SELECT col... FROM tbl2) SELECT t1.col c1, t2.col c2 FROM t1 JOIN t2 ON F1(c1) = F2(c2)

The query may appear fine if the engine can partially push down the functions F1 and F2 to tables t1 and t2 , respectively. However, the success of pushing down functions depends on their type, and in certain cases, F1 and F2 may not be pushable if it involves another connector or the due to the limitation of the function semantic or the optimizers. The complexity increases when the query becomes more intricate, such as:

WITH t(c1, c2) AS ( SELECT F1(c1), c2 FROM ( WITH t1 AS (SELECT col... FROM tbl1), t2 AS (SELECT col... FROM tbl2) SELECT t1.col c1, t2.col c2 FROM t1 JOIN t2 ON F1(c1) = F2(c2) ) ) SELECT t1, SUM(c2) FROM t GROUP BY t1

In this case, the function F1 in table t cannot be pushed down because the engine considers t.F1 and t1.F1 as two distinct functions. The recommended approach is to rewrite the query as follows:

WITH t(c1, c2) AS ( SELECT c1, c2 FROM ( WITH t1 AS (SELECT F1(col) col... FROM tbl1), t2 AS (SELECT col, F2(col) col2... FROM tbl2) SELECT t1.col c1, t2.col2 c2, t2.col c2 FROM t1 JOIN t2 ON c1 = c2 ) ) SELECT t1, SUM(c2) FROM t GROUP BY t1

By doing this, the function F1 participates only in the scan stage and directly operates on the calculated data in memory during JOIN and subsequent calculations.

Furthermore, it's important to consider the computational aspects. Generally, the more complex an operator is, the more time it takes. The optimal scenario aims for each operator in a pipeline to take an equal amount of time on average. But JOIN operators tend to be more intricate, so applying functions within(though usually there might exists a projection node) the JOIN conditions can cause the JOIN operator to become bloated and execute slower compared to other operators. This can potentially increase the scheduling and memory costs of the query, as additional memory is reserved for calculations.

LongInListRule

Using the IN clause can provide convenience when filtering data in a query. When the list of values in the IN clause is relatively small, the In List operation tends to be faster. In this case, Presto can directly traverse the values in the comparison expression, optimizing the query execution.

However, when the number of values in the IN clause is relatively large, Presto would trend to transform the execution to semi join similar to JOIN , there will be lots of optimizers can be applied on the query plan/execution, i.e, the query is likely to be optimized into a broadcast join , may also can pushdown filter .

For example, consider the following query:

SELECT c1, c2 ... FROM xxx WHERE c1 IN (a, b, c, ...)[very large list]

It is recommended to define the calculation process of the list as an intermediate table using a WITH clause or Subquery . By doing so, you can leverage various optimization points that would otherwise be missed.

The revised query could look like this:

SELECT c1, c2 ... FROM xxx WHERE c1 IN (logic that the IN list value comes from)

By defining the logic that generates the list value separately and then using it in the query, you can enable the Presto engine to apply optimization techniques such as filter pushdown, resulting in improved query performance. This approach is particularly beneficial when dealing with large IN lists, as it allows for better query optimization and efficient execution. One more operation after you update the SQL if you have any chance is to analyze the tables that the logic referenced.

MaxJoinCountRule

The JOIN clause in a query is considered a computationally heavy operation. Therefore, it is generally not recommended to use it excessively within a single query. The frequent use of JOIN can significantly increase the complexity of the query and potentially impact its performance.

To mitigate the impact of JOIN and reduce the overall complexity of a query, it is advisable to employ certain optimization techniques. The approach is to perform calculations in advance wherever possible. By calculating intermediate results or transforming data before performing JOIN operations, you can simplify the subsequent JOIN clauses and potentially improve query performance. The most proper way is to re-analyze your business and reduce the use of JOIN within one query.

MaxConsecutiveJoinCountRule

The complexity of performing continuous JOIN operations grows exponentially as the number of tables involved increases(both plan/execution). Therefore,we need to minimize the usage of continuous JOINs whenever possible. In cases where the business requirements necessitate complex JOINs, the top priority things you need to consider are :

  • Do I really need so many consecutive joins, is it the correct way to go?
  • Is there any alternative tables that can reduce the complexity?
  • Am I writing the correct SQL?

If the answer all yes, then an alternative approach to consider is splitting the query into multiple stages, utilizing intermediate tables.

For instance, let's take an example query:

SELECT c1, c2 ... FROM t1 JOIN t2 ON t1.c1 = t2.c1 JOIN t3 ...To mitigate the challenges posed by continuous JOIN complexity, the following steps can be take:

  • Create intermediate table t using CTAS OR CREATE + INSERT INTO:
    • CREATE TABLE t AS SELECT c1, c2 ... FROM t1 JOIN t2 ON t1.c1 = t2.c1
  • Update the code by levage the intermediate table:
    • SELECT ... FROM t JOIN t3 ... JOIN t4 ...

And do remember keep all the filter while creating the intermediate table, although in the above example now show.

By splitting the complex continuous JOIN operation into stages, with the help of an intermediate table, you can mitigate the exponential increase in complexity and optimize the execution of the query. This approach allows for a more manageable and efficient way to handle complex JOIN operations in cases where business requirements demand it.
Note, if just create intermediate using WITH clause in the sql directly can not reduce the complexity of the query while executing.

PartitionCountAndInputRule

PartitionCountAndInputRule is a guideline or rule that helps ensure optimal performance when working with partitioned tables. If a validation error related to partition count and data amount occurs, it indicates that the current query might be causing performance issues. In such cases, it is recommended to consider adding a partition filter to efficiently filter the data pk = value and improve query performance.

Partitioning is a technique used to divide a large table into smaller, more manageable partitions based on certain criteria, such as a range of values or specific conditions. Each partition can then be accessed and processed independently, which can significantly enhance query performance, based on this naturally we need add more partition keys as much as we can in any query.

Some tables are really large, especially some fact table. If the number of partitions or the amount of data within each partition becomes too large, it can negatively impact performance, sometimes such scan not even finish in a reasonable time. The PartitionCountAndInputRule advises against excessive partition counts and data sizes to avoid such issues.

When encountering a validation error related to partition count and data amount, it's important to evaluate the current partitioning scheme, you can using SHOW CREATE TABLE to get the table definitions and then consider adding a partition filter, i,e, pk = value OR pk BETWEEN min AND max for retrieval of data from specific partitions, effectively reducing the amount of data that needs to be processed.
Note, usually the table has a partition key called grass_date ,you may consider narrow down search time range according to this partition key.

FilterApplyAllPartitionRule

This role is similar to above PartitionCountAndInputRule rule but more strictly. By applying filters on all partition keys, you can narrow down the dataset and significantly improve query performance. Those rule is based on the trust that partitioned table must have

Suppose we have a table named "t" that is partitioned based on the keys pk1, pk2, and pk3. When querying this partitioned table, it is recommended to include filters on all partition keys to ensure efficient data retrieval. For example, consider the query:SELECT c1, c2, c3... FROM t WHERE pk1 = date '2022-02-02' AND pk2 > 2 AND pk3 IN ('SG', 'TW', ...)

In this query, the filters are applied to all three partition keys (pk1, pk2, pk3). By doing so, the query optimizer can determine the specific partitions that satisfy the given filter conditions, allowing it to access only the relevant partitions and retrieve the necessary data. This approach reduces the amount of data that needs to be scanned, resulting in improved query performance.

Apply all the partition keys into filters is particularly useful when working with large partitioned tables, as it helps to minimize the amount of data processed and improve overall query efficiency.

NonPartitionInputRule

In general, non-partitioned tables should not be excessively large, as loading them into memory can adversely affect query performance because usually cannot be computed in parallel. Therefore, it is recommended to minimize the usage of large non-partitioned tables in queries. If an error related to this rule occurs, you might consider the following options:

  1. Modifying the table's schema to transform it into a partitioned table: By reorganizing the data into partitions based on specific criteria (such as date ranges or categories), you can optimize query performance by limiting the amount of data loaded into memory during queries.
  2. Creating aggregated(dimension) tables for querying instead of directly querying the non-partitioned table: Aggregated tables store pre-computed results or summaries that are derived from the original data. By querying these aggregated tables instead of the non-partitioned table, you can achieve faster query performance, as the data is already processed and optimized for specific analytical tasks.