Vectorization in hive is a feature (available from Hive 0.13.0) which when enabled rather than reading one row at a time it reads a block on 1024 rows . This Improves the CPU Usage for operation like, Scan, Filter, join and aggregations.
Note that, Vectorization is only available if data is stored in ORC format
How to Enable Vectorized Execution?
To Enable Vectorized
set hive.vectorized.execution.enabled = true;
To Disable Vectorized
set hive.vectorized.execution.enabled = false;
Difference Vectorized vs Non-Vectorized Queries.
I have a Product Table with 1560 rows and I want to know how many products has name with Washington in it.
Non-Vectorized Query.
set hive.vectorized.execution.enabled = false;
select count(*) from foodmart.product
where
product.product_name like "%Washington%"
In the below image you will notice that INPUT_RECORDS_PROCESSED is 1560.
Vectorized Query
set hive.vectorized.execution.enabled = true;
select count(*) from foodmart.product
where
product.product_name like "%Washington%"
In Below image you will see that INPUT_RECORDS_PROCESSED is only 2. This is because we have enabled the vectorized which rather than processing one row, it processed 1024 rows in a block. if you will divided 1560 by 1024, you will get two blocks. 1560/1024 = 2 (block has to be int value)
There are two types of join common join also knows as distributed join and a mapjoin or also knows as mapside join.
Before we jump to mapjoin let me give you an overview of common join. How common join works is, it distributes all the rows based on your join key on all the nodes. After this all the keys with same value ends on the same node and the in the final reducer step the join happens.
Common joins are good when I have both the tables really huge. However, what if I have one table with 1 TB of data and other table with just 10MB if data. Common join would take more time to distribute the row.
In such case, mapjoin help a lot. How it works is rather the distributing the rows from both the tables, it keeps the small table into memory and for every mapper of big table it join it reads the small table from memory. This process doesn’t required any reducer to join and so the name map-join or map side join.
Let’s see this with an example.
Let’s consider we have two tables employee and employee_location. employee is a huge table and employee_location is a small enough to fit in memory
By Default MapJoins are enabled and so if you you will join above two tables, mapjoin is going to happen.
select * from emp join emp_location on emp.id == emp_location.id
This is how DAG looks for above mapjoin.
Now let’s disable the map join and see what happens when we try to join same two tables.
set hive.auto.convert.join = false;
select * from emp join emp_location on emp.id == emp_location.id;
And this is how DAG looks like for common join or distributed join.
Parameters Affecting Mapjoin
Following are four parameters which affects the join.
hive.auto.convert.join
Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. Default Value: true
hive.mapjoin.smalltable.filesize
Applicable only if above parameter is set to true. The threshold (in bytes) for the input file size of the small tables; if the file size is smaller than this threshold, it will try to convert the common join into map join. Default Value: 25000000 (25 MB)
hive.auto.convert.join.noconditionaltask
Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. If this parameter is on, and the sum of size for n-1 of the tables/partitions for an n-way join is smaller than the size specified by hive.auto.convert.join.noconditionaltask.size, the join is directly converted to a mapjoin (there is no conditional task). Default Value: True
hive.auto.convert.join.noconditionaltask.size
If hive.auto.convert.join.noconditionaltask is off, this parameter does not take effect. However, if it is on, and the sum of size for n-1 of the tables/partitions for an n-way join is smaller than this size, the join is directly converted to a mapjoin (there is no conditional task). Default Value: 10 MB
Hive is a append only database and so update and delete is not supported on hive external and managed table.
From hive version 0.14 the have started a new feature called transactional. Which allows to have ACID properties for a particular hive table and allows to delete and update. but let’s keep the transactional table for any other posts.
Here let’s discuss how to update hive table which is not transaction, either external or managed ( External table couldn’t be transactional).
Chances are if you have tried to update the hive table, external or managed (non transactional), you might have got below errors, depends on your hive version.
select * from temp.test_udpate;
+-----------------+-------------------+--+
| test_update.id | test_update.name |
+-----------------+-------------------+--+
| 1 | test user 1 |
| 2 | test user 2 |
| 2 | test user 3 |
+-----------------+-------------------+--+
delete from temp.test1 where id=1;
Error: Error while compiling statement: FAILED:
SemanticException [Error 10297]: Attempt to do update or
delete on table temp.test1 that does not use an
AcidOutputFormat or is not bucketed (state=42000,code=10297)
Then the question is how to update or delete a record in hive table?
Deleting Records in Hive Table
Deleting rerecords is easy, you can use insert overwrite Syntax for this. Let’s says we want to delete a record from above hive table which has name as “test user 3”. Then we need to select all the records which does not have name as “test user 3” and overwrite into same table.
insert overwrite table temp.test_update \
select * from temp.test_update \
where name = "test user 3";
Update Records in Hive Table
updating the record consist of three steps as mentioned below. Let’s say in your test.update table we want to update the id to 3 for all the records which has name as “test user 3”
Create a temporary table which has updated record
create temporary table temp.test \
as select 3 as id, name from temp.test_update \
where name="test user 3";
Delete the records which you want to update from original table
insert overwrite table temp.test_update \
select * from temp.test_update \
where name="test user 3";
Insert the Updated Record(s)
insert into table temp.test_update select * from temp.test;