Skip to content

How MapJoin works in hive?

There are two types of join common join also knows as distributed join and a mapjoin or also knows as mapside join. 

Before we jump to mapjoin let me give you an overview of common join. How common join works is, it distributes all the rows based on your join key on all the nodes.  After this all the keys with same value ends on the same node and the in the final reducer step the join happens.

Common joins are good when I have both the tables really huge. However, what if I have one table with 1 TB of data and other table with just 10MB if data. Common join would take more time to distribute the row.

In such case, mapjoin help a lot. How it works is rather the distributing the rows from both the tables, it keeps the small table into memory and for every mapper of big table it join it reads the small table from memory. This process doesn’t required any reducer to join and so the name map-join or map side join. 

Let’s see this with an example. 

Let’s consider we have two tables employee and employee_location.  employee is a huge table and employee_location is a small enough to fit in memory

By Default MapJoins are enabled and so if you you will join above two tables, mapjoin is going to happen. 

select * from emp join emp_location on emp.id == emp_location.id 

This is how DAG looks for above mapjoin. 

mapjoin
DAG for mapjoin (mapside join)

Now let’s disable the map join and see what happens when we try to join same two tables. 

set hive.auto.convert.join = false;
select * from emp join emp_location on emp.id == emp_location.id;

And this is how DAG looks like for common join or distributed join. 

Distributed join DAG.
DAG for distributed join

Parameters Affecting Mapjoin

Following are four parameters which affects the join.

hive.auto.convert.join

Whether Hive enables the optimization about converting common join into mapjoin based on the input file size.
Default Value: true

hive.mapjoin.smalltable.filesize

Applicable only if above parameter is set to  true. The threshold (in bytes) for the input file size of the small tables; if the file size is smaller than this threshold, it will try to convert the common join into map join.
Default Value: 25000000 (25 MB)

hive.auto.convert.join.noconditionaltask

Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. If this parameter is on, and the sum of size for n-1 of the tables/partitions for an n-way join is smaller than the size specified by hive.auto.convert.join.noconditionaltask.size, the join is directly converted to a mapjoin (there is no conditional task).
Default Value: True

hive.auto.convert.join.noconditionaltask.size

If hive.auto.convert.join.noconditionaltask is off, this parameter does not take effect. However, if it is on, and the sum of size for n-1 of the tables/partitions for an n-way join is smaller than this size, the join is directly converted to a mapjoin (there is no conditional task).
Default Value: 10 MB

Published inHadoophive

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *