<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Hive Archives - Big Data Processing</title>
	<atom:link href="https://bigdataproc.com/tag/hive/feed/" rel="self" type="application/rss+xml" />
	<link>https://bigdataproc.com/tag/hive/</link>
	<description>Big Data Solution for GCP, AWS, Azure and on-prem</description>
	<lastBuildDate>Sun, 15 Jan 2023 04:58:59 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.3.2</generator>
	<item>
		<title>Hive Lateral view explode vs posexplode</title>
		<link>https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=hive-lateral-view-explode-vs-posexplode</link>
					<comments>https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 06 Feb 2019 17:46:13 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=116</guid>

					<description><![CDATA[<p>Lateral view Explode Lateral view explode, explodes the array data into multiple rows. for example, let&#8217;s say our table look like this, where Telephone is&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/">Continue reading<span class="screen-reader-text">Hive Lateral view explode vs posexplode</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/">Hive Lateral view explode vs posexplode</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Lateral view Explode </h2>



<p>Lateral view explode, explodes the array data into multiple rows.  for example, let&#8217;s say our table look like this, where Telephone is an array of string.</p>



<table class="wp-block-table is-style-regular"><tbody><tr><td><strong>name</strong></td><td><strong>phone_numbers</strong></td><td><strong>cities</strong></td></tr><tr><td>AAA</td><td>[&#8220;365-889-1234&#8221;, &#8220;365-887-2232&#8221;]</td><td>[&#8220;Hamilton&#8221;][&#8220;Burlington&#8221;]</td></tr><tr><td>BBB</td><td>[&#8220;232-998-3232&#8221;, &#8220;878-998-2232&#8221;]</td><td>[&#8220;Toronto&#8221;, &#8220;Stoney Creek&#8221;]</td></tr></tbody></table>



<p>Applying a lateral view explode on the above table will expand <g class="gr_ gr_80 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar only-del replaceWithoutSep" id="80" data-gr-id="80">the both</g> Telephone and Cities and do a cross join,  your final table will look like this. </p>



<table class="wp-block-table is-style-regular"><tbody><tr><td><strong>name</strong></td><td><strong>phone_numbers</strong></td><td><strong>cities</strong></td></tr><tr><td>AAA</td><td>365-889-1234</td><td>Hamilton</td></tr><tr><td>AAA</td><td>365-887-2232</td><td>Hamilton</td></tr><tr><td>AAA</td><td>365-889-1234</td><td>Burlington</td></tr><tr><td>AAA</td><td>365-887-2232</td><td>Burlington</td></tr><tr><td>BBB</td><td>232-998-3232</td><td>Toronto</td></tr><tr><td>BBB</td><td>878-998-2232</td><td>Toronto</td></tr><tr><td>BBB</td><td>232-998-3232</td><td>Stoney Creek</td></tr><tr><td>BBB</td><td>878-998-2232</td><td>Stoney Creek</td></tr></tbody></table>



<h2 class="wp-block-heading">Lateral View POSExplode</h2>



<p>However, this is not what you probably want, if you want to map first telephone number to first city and second with second one, and that for all the records. Then you can use <strong>posexplode&nbsp;(</strong>positional explode) </p>



<p>posexplode gives you an index along with value when you expand any error, and then you can use this indexes to map values with each other as mentioned below. </p>



<pre class="lang:sql theme:twilight" title="when to use lateral view posexplode in hive">select 
    name, 
    phone_number, 
    city 
from temp.test_laterla_view_posexplode
lateral view posexplode(phone_numbers) pn as pos_phone, phone_number
lateral view posexplode(cities) pn as pos_city, city 
where 
    pos_phone == pos_city
</pre>



<p>With above query you will get following results, where phone number is mapped with corresponding city. </p>



<table class="wp-block-table is-style-regular"><tbody><tr><td><strong>name</strong></td><td><strong>phone_number</strong></td><td><strong>city</strong></td></tr><tr><td>AAA</td><td>365-889-1234</td><td>Hamilton</td></tr><tr><td>AAA</td><td>365-887-2232</td><td>Burlington</td></tr><tr><td>BBB</td><td>232-998-3232</td><td>Toronto</td></tr><tr><td>BBB</td><td>878-998-2232</td><td>Stoney Creek</td></tr></tbody></table>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/">Hive Lateral view explode vs posexplode</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
		<item>
		<title>When to use lateral view explode in hive</title>
		<link>https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=when-to-use-lateral-view-explode-in-hive</link>
					<comments>https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 12 Dec 2018 19:13:48 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[Hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=111</guid>

					<description><![CDATA[<p>if you have a table with one or more column with array datatype&#160;&#160;and if you want it to expand into multiple rows, you can use&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/">Continue reading<span class="screen-reader-text">When to use lateral view explode in hive</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/">When to use lateral view explode in hive</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>if you have a table with one or more column with<strong> array datatype&nbsp;</strong>&nbsp;and if you want it to expand into multiple rows, you can use lateral view explode function.&nbsp;</p>



<p>Let&#8217;s consider we have following table, where one employee has multiple phone numbers which are stores as part of array (list).&nbsp;</p>



<table class="wp-block-table"><tbody><tr><td><strong>emp_name</strong></td><td><strong>phone_numbers</strong></td></tr><tr><td>user1</td><td>[&#8220;546-487-3384&#8243;,&#8221;383-767-2238&#8221;]</td></tr><tr><td>user2</td><td>[&#8220;373-384-1192&#8243;,&#8221;374-282-1289&#8243;,&#8221;332-453-5566&#8221;]</td></tr></tbody></table>



<p>However as a output if we want to convert this Array (list) into multiple rows, we can use lateral view explode function a mentioned below&nbsp;</p>



<pre class="lang:sql theme:twilight" title="when to use lateral view explode in hive">select emp_name, phone_number 
from 
    temp.test_laterla_view_explode
lateral view explode(phone_numbers) p as phone_number
</pre>



<p>This will generate the output as mentioned below</p>



<table class="wp-block-table"><tbody><tr><td><strong>emp_name</strong></td><td><strong>phone_number</strong></td></tr><tr><td>user2</td><td>373-384-1192</td></tr><tr><td>user2</td><td>374-282-1289</td></tr><tr><td>user2</td><td>332-453-5566</td></tr><tr><td>user1</td><td>546-487-3384</td></tr><tr><td>user1</td><td>383-767-2238</td></tr></tbody></table>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/">When to use lateral view explode in hive</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Import Data from Netezza to Hive using sqoop</title>
		<link>https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=import-data-from-netezza-to-hive-using-sqoop</link>
					<comments>https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 07 Nov 2018 15:56:18 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Sqoop]]></category>
		<category><![CDATA[Hive]]></category>
		<category><![CDATA[netezza]]></category>
		<category><![CDATA[sqoop]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=83</guid>

					<description><![CDATA[<p>following is the syntax for importing data from netezza to hive.&#160; sqoop import \ --connect jdbc:netezza://:/ \ --username= \ --password= \ --table \ --hcatalog-database \&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/">Continue reading<span class="screen-reader-text">Import Data from Netezza to Hive using sqoop</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/">Import Data from Netezza to Hive using sqoop</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>following is the syntax for importing data from netezza to hive.&nbsp;</p>



<pre class="lang:shell theme:twilight" title="Import data from netezza to hive using sqoop">
sqoop import \
--connect jdbc:netezza://<netezza_host>:<port>/<database> \
--username=<USERNAME IN ALL CAPS> \
--password=<password> \
--table <netezza_table_name> \
--hcatalog-database <hive_database_name> \
--hcatalog-table <hive_table_name> \
-m 1
</pre>



<p>provide the username in all CAPS, otherwise it will throw authentication error.</p>



<h2 class="wp-block-heading">create hive table.&nbsp;</h2>



<p>if you don&#8217;t have the corresponding hive table, you can use the following option.&nbsp; which will create hive table if it doesn&#8217;t exist.&nbsp;</p>



<pre class="lang:shell theme:twilight" title="create hive table while sqoop import">--create-hcatalog-table 
</pre>



<h2 class="wp-block-heading">Sqoop import change hive table format and properties</h2>



<p>if you want to change the format of the hive table, you can do with following option.&nbsp;</p>



<pre class="lang:shell theme:twilight" title="sqoop import change hive table format and properties">--hcatalog-storage-stanza \
'stored as orc tblproperties ("orc.compress"="SNAPPY")'
</pre>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/">Import Data from Netezza to Hive using sqoop</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>What is Predicate PushDown in Hive?</title>
		<link>https://bigdataproc.com/what-is-predicate-pushdown-in-hive/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-is-predicate-pushdown-in-hive</link>
					<comments>https://bigdataproc.com/what-is-predicate-pushdown-in-hive/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 26 Sep 2018 20:07:37 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=57</guid>

					<description><![CDATA[<p>Predicate Pushdown in hive is a feature to Push your predicate ( where condition) further up in the query. It tries to execute the expression&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/what-is-predicate-pushdown-in-hive/">Continue reading<span class="screen-reader-text">What is Predicate PushDown in Hive?</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/what-is-predicate-pushdown-in-hive/">What is Predicate PushDown in Hive?</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Predicate Pushdown in hive is a feature to Push your predicate ( where condition) further up in the query. It tries to execute the expression as early as possible in plan.&nbsp;</p>



<p>Let&#8217;s try to understand this by example. let&#8217;s consider we have two tables, <strong>product</strong> and <strong>sales</strong> and we want to answer following question.&nbsp;<br></p>



<p>How many products of brand <strong>Washington</strong> has been sold so far?<br></p>



<h2 class="wp-block-heading">Non-Optimized Query</h2>



<p>Following query will answer the above question. However, if you are familiar with sql you will realize that above query is&nbsp;<strong>not optimized.</strong>&nbsp; It applies first joins the two table and then&nbsp; applies the condition (predicate).</p>



<pre class="lang:mysql theme:twilight" title="Sample Query for PPD (Predicate Pushdown)">select sum(s.unit_sales) from foodmart.product p 
join 
	foodmart.sales_fact_dec_1998 s 
on 
	p.product_id = s.product_id
where 
	p.brand_name = "Washington"
</pre>



<h2 class="wp-block-heading">Optimized Query</h2>



<p>We could easily optimize this above query by applying condition first on product table and then joining it to sales table as mentioned below.</p>



<pre class="lang:mysql theme:twilight">SELECT sum(s.unit_sales)
FROM foodmart.sales_fact_dec_1998 s
JOIN (
	SELECT product_id, brand_name
	FROM foodmart.product
	WHERE 
		brand_name = "Washington"
	) p
ON 
	p.product_id = s.product_id
</pre>



<p>This is what PPD (predicate pushdown) does internally.  if you have ppd enabled your first query will automatically be converted to a second optimized query.</p>



<p>Let&#8217;s see this in action.  The product table has total 1560 rows (product) with only 11 products with the brand name Washington.</p>



<p>For better understanding, I have disabled the <strong>vectorization</strong>.  If you are not sure what vectorization is, please read the following blog post &#8211; <a href="http://bigdataproc.com/what-is-vectorization-in-hive/" target="_blank" rel="noreferrer noopener">What is vectorization? </a></p>



<h2 class="wp-block-heading">Running Query with PPD Disabled</h2>



<p>Following is the DAG of the first query with <strong>PPD disabled. <br></strong>Please set the following parameter to false, to disable the PPD.</p>



<pre class="lang:mysql theme:twilight" title="disable ppd">set hive.optimize.ppd=false;
</pre>



<p>if you notice, it&#8217;s reading all rows from the product table and then passing it to the reducer for join. </p>



<figure class="wp-block-image"><a href="http://allabouthadoop.net/wp-content/uploads/2018/09/DAG_PPD_Disabled.png"><img decoding="async" fetchpriority="high" width="300" height="266" src="http://allabouthadoop.net/wp-content/uploads/2018/09/DAG_PPD_Disabled-300x266.png" alt="DAG when PPD (predicate pushdown) is disabled. " class="wp-image-59" srcset="https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Disabled-300x266.png 300w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Disabled-768x682.png 768w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Disabled-1024x909.png 1024w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Disabled.png 1090w" sizes="(max-width: 300px) 100vw, 300px" /></a><figcaption>DAG for first query when PPD is disabled</figcaption></figure>



<h2 class="wp-block-heading" id="mce_24">Running Query with PPD Enabled.</h2>



<p>And Following is the DAG of the same query with <strong>PPD Enabled.<br></strong>Please set the following parameter to true, to enable the PPD.</p>



<pre class="lang:mysql theme:twilight" title="enable ppd">set hive.optimize.ppd=true;
</pre>



<p>Once, we enable the PPD, it first applies the condition on product table and sends <strong>only 11 rows</strong> to the reducer for join.</p>



<figure class="wp-block-image"><a href="http://allabouthadoop.net/wp-content/uploads/2018/09/DAG_PPD_Enabled.png"><img decoding="async" width="300" height="273" src="http://allabouthadoop.net/wp-content/uploads/2018/09/DAG_PPD_Enabled-300x273.png" alt="DAG when PPD (predicate pushdown) is enabled. " class="wp-image-60" srcset="https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Enabled-300x273.png 300w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Enabled-768x699.png 768w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Enabled-1024x932.png 1024w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Enabled.png 1092w" sizes="(max-width: 300px) 100vw, 300px" /></a><figcaption>DAG for first query when PPD is enabled</figcaption></figure>



<p><strong></strong></p>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/what-is-predicate-pushdown-in-hive/">What is Predicate PushDown in Hive?</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/what-is-predicate-pushdown-in-hive/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>What is vectorization in hive?</title>
		<link>https://bigdataproc.com/what-is-vectorization-in-hive/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-is-vectorization-in-hive</link>
					<comments>https://bigdataproc.com/what-is-vectorization-in-hive/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 19 Sep 2018 16:51:51 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=53</guid>

					<description><![CDATA[<p>Vectorization in hive is a feature (available from&#160;Hive 0.13.0) which when enabled rather than reading one row at a time it reads a block on&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/what-is-vectorization-in-hive/">Continue reading<span class="screen-reader-text">What is vectorization in hive?</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/what-is-vectorization-in-hive/">What is vectorization in hive?</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Vectorization in hive is a feature (available from&nbsp;Hive 0.13.0) which when enabled rather than reading one row at a time it reads a block on <strong>1024 rows</strong>&nbsp;.&nbsp; This Improves the CPU Usage for operation like,&nbsp; Scan, Filter, join and aggregations.&nbsp;<br></p>



<p>Note that, Vectorization is only available if data is stored in <strong>ORC format</strong></p>



<h2 class="wp-block-heading">How to Enable Vectorized Execution?</h2>



<p>To Enable&nbsp;Vectorized</p>



<pre class="lang:text theme:twilight">set hive.vectorized.execution.enabled = true;
</pre>



<p>To Disable&nbsp;Vectorized</p>



<pre class="lang:text theme:twilight">set hive.vectorized.execution.enabled = false;
</pre>



<h2 class="wp-block-heading">Difference Vectorized vs Non-Vectorized Queries.&nbsp;</h2>



<p>I have a Product Table with <strong>1560 rows</strong> and I want to know how many products has name with Washington in it.&nbsp;</p>



<h3 class="wp-block-heading" id="mce_29">Non-Vectorized Query.&nbsp;</h3>



<pre class="lang:mysql theme:twilight">set hive.vectorized.execution.enabled = false;
select count(*) from foodmart.product 
where 
	product.product_name like "%Washington%"
</pre>



<p>In the below image you will notice that <strong>INPUT_RECORDS_PROCESSED is 1560.&nbsp;&nbsp;</strong></p>



<figure class="wp-block-image"><a href="http://allabouthadoop.net/wp-content/uploads/2018/09/Non-Vectorizied-Query-Execution.png"><img decoding="async" width="300" height="273" src="http://allabouthadoop.net/wp-content/uploads/2018/09/Non-Vectorizied-Query-Execution-300x273.png" alt="Non-Vectorized Hive Query shows all 1560 records being read." class="wp-image-54" srcset="https://bigdataproc.com/wp-content/uploads/2018/09/Non-Vectorizied-Query-Execution-300x273.png 300w, https://bigdataproc.com/wp-content/uploads/2018/09/Non-Vectorizied-Query-Execution-768x698.png 768w, https://bigdataproc.com/wp-content/uploads/2018/09/Non-Vectorizied-Query-Execution-1024x930.png 1024w, https://bigdataproc.com/wp-content/uploads/2018/09/Non-Vectorizied-Query-Execution.png 1092w" sizes="(max-width: 300px) 100vw, 300px" /></a><figcaption>Non-Vectorized Query DAG</figcaption></figure>



<h3 class="wp-block-heading">Vectorized Query</h3>



<pre class="lang:mysql theme:twilight">set hive.vectorized.execution.enabled = true;
select count(*) from foodmart.product 
where 
	product.product_name like "%Washington%"
</pre>



<p>In Below image you will see that&nbsp;<strong>INPUT_RECORDS_PROCESSED is only 2.</strong> This is because we have enabled the vectorized which rather than processing one row, it processed 1024 rows in a block. if you will divided 1560 by 1024, you will get two blocks.<strong> 1560/1024 = 2</strong> (block has to be int value)&nbsp;</p>



<figure class="wp-block-image"><a href="http://allabouthadoop.net/wp-content/uploads/2018/09/Vectorized-Query-Execution.png"><img decoding="async" loading="lazy" width="300" height="272" src="http://allabouthadoop.net/wp-content/uploads/2018/09/Vectorized-Query-Execution-300x272.png" alt="Vectorized Hive Query shows only two records being read." class="wp-image-55" srcset="https://bigdataproc.com/wp-content/uploads/2018/09/Vectorized-Query-Execution-300x272.png 300w, https://bigdataproc.com/wp-content/uploads/2018/09/Vectorized-Query-Execution-768x697.png 768w, https://bigdataproc.com/wp-content/uploads/2018/09/Vectorized-Query-Execution-1024x930.png 1024w, https://bigdataproc.com/wp-content/uploads/2018/09/Vectorized-Query-Execution.png 1086w" sizes="(max-width: 300px) 100vw, 300px" /></a><figcaption>Vectorized Query DAG.</figcaption></figure>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/what-is-vectorization-in-hive/">What is vectorization in hive?</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/what-is-vectorization-in-hive/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>How MapJoin works in hive?</title>
		<link>https://bigdataproc.com/how-mapjoin-works-in-hive/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-mapjoin-works-in-hive</link>
					<comments>https://bigdataproc.com/how-mapjoin-works-in-hive/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 12 Sep 2018 19:38:34 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=43</guid>

					<description><![CDATA[<p>There are two types of join common join also knows as distributed join and a mapjoin or also knows as mapside join.  Before we jump&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/how-mapjoin-works-in-hive/">Continue reading<span class="screen-reader-text">How MapJoin works in hive?</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/how-mapjoin-works-in-hive/">How MapJoin works in hive?</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>There are two types of join common join also knows as distributed join and a mapjoin or also knows as mapside join. </p>



<p>Before we jump to mapjoin let me give you an overview of common join. How common join works is, it distributes all the rows based on your join key on all the nodes.  After this all the keys with same value ends on the same node and the in the final reducer step the join happens.</p>



<p>Common joins are good when I have both the tables really huge. However, what if I have one table with 1 TB of data and other table with just 10MB if data. Common join would take more time to distribute the row.</p>



<p>In such case, mapjoin help a lot. How it works is rather the distributing the rows from both the tables, it keeps the small table into memory and for every mapper of big table it join it reads the small table from memory. This process doesn&#8217;t required any reducer to join and so the name map-join or map side join. </p>



<p>Let&#8217;s see this with an example. </p>



<p>Let&#8217;s consider we have two tables employee and employee_location.  employee is a huge table and employee_location is a small enough to fit in memory</p>



<p>By Default MapJoins are enabled and so if you you will join above two tables, mapjoin is going to happen. </p>



<pre class="lang:mysql theme:twilight">select * from emp join emp_location on emp.id == emp_location.id 
</pre>



<p>This is how DAG looks for above mapjoin. </p>



<figure class="wp-block-image"><img decoding="async" loading="lazy" width="194" height="300" src="http://allabouthadoop.net/wp-content/uploads/2018/09/Screen-Shot-2018-09-12-at-2.19.05-PM-194x300.png" alt="mapjoin" class="wp-image-44" srcset="https://bigdataproc.com/wp-content/uploads/2018/09/Screen-Shot-2018-09-12-at-2.19.05-PM-194x300.png 194w, https://bigdataproc.com/wp-content/uploads/2018/09/Screen-Shot-2018-09-12-at-2.19.05-PM.png 458w" sizes="(max-width: 194px) 100vw, 194px" /><figcaption>DAG for mapjoin (mapside join)</figcaption></figure>



<p>Now let&#8217;s disable the map join and see what happens when we try to join same two tables. </p>



<pre class="lang:mysql theme:twilight">set hive.auto.convert.join = false;
select * from emp join emp_location on emp.id == emp_location.id;
</pre>



<p>And this is how DAG looks like for common join or distributed join. </p>



<figure class="wp-block-image"><img decoding="async" loading="lazy" width="203" height="300" src="http://allabouthadoop.net/wp-content/uploads/2018/09/Screen-Shot-2018-09-12-at-2.58.11-PM-203x300.png" alt="Distributed join DAG." class="wp-image-45" srcset="https://bigdataproc.com/wp-content/uploads/2018/09/Screen-Shot-2018-09-12-at-2.58.11-PM-203x300.png 203w, https://bigdataproc.com/wp-content/uploads/2018/09/Screen-Shot-2018-09-12-at-2.58.11-PM.png 474w" sizes="(max-width: 203px) 100vw, 203px" /><figcaption>DAG for distributed join</figcaption></figure>



<h2 class="wp-block-heading">Parameters Affecting Mapjoin</h2>



<p>Following are four parameters which affects the join.</p>



<h3 class="wp-block-heading">hive.auto.convert.join</h3>



<p>Whether Hive enables the optimization about converting common join into mapjoin based on the input file size.<br/><strong>Default Value:</strong> true</p>



<h3 class="wp-block-heading">hive.mapjoin.smalltable.filesize</h3>



<p>Applicable only if above parameter is set to  true. The threshold (in bytes) for the input file size of the small tables; if the file size is smaller than this threshold, it will try to convert the common join into map join.<br/><strong>Default Value</strong>: 25000000 (25 MB)</p>



<h3 class="wp-block-heading">hive.auto.convert.join.noconditionaltask</h3>



<p>Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. If this parameter is on, and the sum of size for n-1 of the tables/partitions for an n-way join is smaller than the size specified by hive.auto.convert.join.noconditionaltask.size, the join is directly converted to a mapjoin (there is no conditional task).<br/><strong>Default Value:</strong> True</p>



<h3 class="wp-block-heading">hive.auto.convert.join.noconditionaltask.size<br/></h3>



<p>If <strong>hive.auto.convert.join.noconditionaltask</strong> is off, this parameter does not take effect. However, if it is on, and the sum of size for n-1 of the tables/partitions for an n-way join is smaller than this size, the join is directly converted to a mapjoin (there is no conditional task). <br/><strong>Default Value:</strong> 10 MB</p>



<p></p>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/how-mapjoin-works-in-hive/">How MapJoin works in hive?</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/how-mapjoin-works-in-hive/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Update Hive Table</title>
		<link>https://bigdataproc.com/update-hive-table/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=update-hive-table</link>
					<comments>https://bigdataproc.com/update-hive-table/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 05 Sep 2018 00:30:34 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[Hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=11</guid>

					<description><![CDATA[<p>Hive is a append only database and so update and delete is not supported on hive external and managed table.&#160; From hive version 0.14 the&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/update-hive-table/">Continue reading<span class="screen-reader-text">Update Hive Table</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/update-hive-table/">Update Hive Table</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Hive is a append only database and so update and delete is not supported on hive external and managed table.&nbsp;</p>



<p>From hive version 0.14 the have started a new feature called <strong>transactional.&nbsp;</strong> Which allows to have ACID properties for a particular hive table and allows to delete and update. but let&#8217;s keep the transactional table for any other posts.&nbsp;</p>



<p>Here let&#8217;s discuss how to update hive table which is not transaction, either external or managed ( External table couldn&#8217;t be transactional).</p>



<p>Chances are if you have tried to update the hive table, external or managed (non transactional), you might have got below errors, depends on your hive version.</p>



<pre class="lang:mysql theme:twilight"> select * from temp.test_udpate;
+-----------------+-------------------+--+
| test_update.id  | test_update.name  |
+-----------------+-------------------+--+
| 1               | test user 1       |
| 2               | test user 2       |
| 2               | test user 3       |
+-----------------+-------------------+--+
delete from temp.test1 where id=1;
Error: Error while compiling statement: FAILED:
SemanticException [Error 10297]: Attempt to do update or
delete on table temp.test1 that does not use an
AcidOutputFormat or is not bucketed (state=42000,code=10297)</pre>



<p>Then the question is how to update or delete a record in hive table?</p>



<h2 class="wp-block-heading">Deleting Records in Hive Table</h2>



<p>Deleting rerecords is easy,&nbsp; you can use <strong>insert overwrite </strong>Syntax for this. Let&#8217;s says we want to delete a record from above hive table which has name as &#8220;test user 3&#8221;.&nbsp; Then we need to select all the records which does not have name as &#8220;test user 3&#8221; and overwrite into same table.&nbsp;</p>



<pre class="lang:mysql theme:twilight"> 
insert overwrite table temp.test_update \
select * from temp.test_update \ 
where name = "test user 3";
</pre>



<p></p>



<h2 class="wp-block-heading">Update Records in Hive Table</h2>



<p>updating the record consist of three steps as mentioned below. Let&#8217;s say in your <strong>test.update&nbsp;</strong>table we want to update the <strong>id</strong>&nbsp;to<strong> 3&nbsp;</strong> for all the records which has name as <strong>&#8220;test user 3&#8221;</strong></p>



<h3 class="wp-block-heading">Create a temporary table which has updated record</h3>



<pre class="lang:mysql theme:twilight"> 
create temporary table temp.test \
as select 3 as id, name from temp.test_update \
where name="test user 3";
</pre>



<h3 class="wp-block-heading">Delete the records which you want to update from original table</h3>



<pre class="lang:mysql theme:twilight">  insert overwrite table temp.test_update \
select * from temp.test_update \
where name="test user 3";
</pre>



<h3 class="wp-block-heading">Insert the Updated Record(s)</h3>



<pre class="lang:mysql theme:twilight"> insert into table temp.test_update select * from temp.test;

</pre>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/update-hive-table/">Update Hive Table</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/update-hive-table/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
	</channel>
</rss>
