Predicate Pushdown in hive is a feature to Push your predicate ( where condition) further up in the query. It tries to execute the expression as early as possible in plan.
Let’s try to understand this by example. let’s consider we have two tables, product and sales and we want to answer following question.
How many products of brand Washington has been sold so far?
Non-Optimized Query
Following query will answer the above question. However, if you are familiar with sql you will realize that above query is not optimized. It applies first joins the two table and then applies the condition (predicate).
select sum(s.unit_sales) from foodmart.product p join foodmart.sales_fact_dec_1998 s on p.product_id = s.product_id where p.brand_name = "Washington"
Optimized Query
We could easily optimize this above query by applying condition first on product table and then joining it to sales table as mentioned below.
SELECT sum(s.unit_sales) FROM foodmart.sales_fact_dec_1998 s JOIN ( SELECT product_id, brand_name FROM foodmart.product WHERE brand_name = "Washington" ) p ON p.product_id = s.product_id
This is what PPD (predicate pushdown) does internally. if you have ppd enabled your first query will automatically be converted to a second optimized query.
Let’s see this in action. The product table has total 1560 rows (product) with only 11 products with the brand name Washington.
For better understanding, I have disabled the vectorization. If you are not sure what vectorization is, please read the following blog post – What is vectorization?
Running Query with PPD Disabled
Following is the DAG of the first query with PPD disabled.
Please set the following parameter to false, to disable the PPD.
set hive.optimize.ppd=false;
if you notice, it’s reading all rows from the product table and then passing it to the reducer for join.
Running Query with PPD Enabled.
And Following is the DAG of the same query with PPD Enabled.
Please set the following parameter to true, to enable the PPD.
set hive.optimize.ppd=true;
Once, we enable the PPD, it first applies the condition on product table and sends only 11 rows to the reducer for join.
Hi,Can you please tell me how can i view the same DAG in my hive?
Why Duration is high when applied PPD?