Skip to content

What is vectorization in hive?

Vectorization in hive is a feature (available from Hive 0.13.0) which when enabled rather than reading one row at a time it reads a block on 1024 rows .  This Improves the CPU Usage for operation like,  Scan, Filter, join and aggregations. 

Note that, Vectorization is only available if data is stored in ORC format

How to Enable Vectorized Execution?

To Enable Vectorized

set hive.vectorized.execution.enabled = true;

To Disable Vectorized

set hive.vectorized.execution.enabled = false;

Difference Vectorized vs Non-Vectorized Queries. 

I have a Product Table with 1560 rows and I want to know how many products has name with Washington in it. 

Non-Vectorized Query. 

set hive.vectorized.execution.enabled = false;
select count(*) from foodmart.product 
where 
	product.product_name like "%Washington%"

In the below image you will notice that INPUT_RECORDS_PROCESSED is 1560.  

Non-Vectorized Hive Query shows all 1560 records being read.
Non-Vectorized Query DAG

Vectorized Query

set hive.vectorized.execution.enabled = true;
select count(*) from foodmart.product 
where 
	product.product_name like "%Washington%"

In Below image you will see that INPUT_RECORDS_PROCESSED is only 2. This is because we have enabled the vectorized which rather than processing one row, it processed 1024 rows in a block. if you will divided 1560 by 1024, you will get two blocks. 1560/1024 = 2 (block has to be int value) 

Vectorized Hive Query shows only two records being read.
Vectorized Query DAG.
Published inHadoophive

2 Comments

  1. CD CD

    excellent explanation

  2. Mouli Shareef Shaik Mouli Shareef Shaik

    Very good explanation.

Leave a Reply

Your email address will not be published. Required fields are marked *