Skip to content

What is Predicate PushDown in Hive?

Predicate Pushdown in hive is a feature to Push your predicate ( where condition) further up in the query. It tries to execute the expression as early as possible in plan. 

Let’s try to understand this by example. let’s consider we have two tables, product and sales and we want to answer following question. 

How many products of brand Washington has been sold so far?

Non-Optimized Query

Following query will answer the above question. However, if you are familiar with sql you will realize that above query is not optimized.  It applies first joins the two table and then  applies the condition (predicate).

select sum(s.unit_sales) from foodmart.product p 
join 
	foodmart.sales_fact_dec_1998 s 
on 
	p.product_id = s.product_id
where 
	p.brand_name = "Washington"

Optimized Query

We could easily optimize this above query by applying condition first on product table and then joining it to sales table as mentioned below.

SELECT sum(s.unit_sales)
FROM foodmart.sales_fact_dec_1998 s
JOIN (
	SELECT product_id, brand_name
	FROM foodmart.product
	WHERE 
		brand_name = "Washington"
	) p
ON 
	p.product_id = s.product_id

This is what PPD (predicate pushdown) does internally.  if you have ppd enabled your first query will automatically be converted to a second optimized query.

Let’s see this in action.  The product table has total 1560 rows (product) with only 11 products with the brand name Washington.

For better understanding, I have disabled the vectorization.  If you are not sure what vectorization is, please read the following blog post – What is vectorization? 

Running Query with PPD Disabled

Following is the DAG of the first query with PPD disabled. 
Please set the following parameter to false, to disable the PPD.

set hive.optimize.ppd=false;

if you notice, it’s reading all rows from the product table and then passing it to the reducer for join. 

DAG when PPD (predicate pushdown) is disabled.
DAG for first query when PPD is disabled

Running Query with PPD Enabled.

And Following is the DAG of the same query with PPD Enabled.
Please set the following parameter to true, to enable the PPD.

set hive.optimize.ppd=true;

Once, we enable the PPD, it first applies the condition on product table and sends only 11 rows to the reducer for join.

DAG when PPD (predicate pushdown) is enabled.
DAG for first query when PPD is enabled

Published inHadoophive

2 Comments

  1. Harshul Harshul

    Hi,Can you please tell me how can i view the same DAG in my hive?

  2. shyam kumar shyam kumar

    Why Duration is high when applied PPD?

Leave a Reply

Your email address will not be published. Required fields are marked *