<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Hadoop Archives - Big Data Processing</title>
	<atom:link href="https://bigdataproc.com/category/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>https://bigdataproc.com/category/hadoop/</link>
	<description>Big Data Solution for GCP, AWS, Azure and on-prem</description>
	<lastBuildDate>Thu, 22 Feb 2024 17:54:51 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>Distcp to Copy your HDFS data to GCP Cloud Storage</title>
		<link>https://bigdataproc.com/distcp-to-copy-your-hdfs-data-to-gcp-cloud-storage/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=distcp-to-copy-your-hdfs-data-to-gcp-cloud-storage</link>
					<comments>https://bigdataproc.com/distcp-to-copy-your-hdfs-data-to-gcp-cloud-storage/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Thu, 22 Feb 2024 17:53:13 +0000</pubDate>
				<category><![CDATA[GCP]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<guid isPermaLink="false">https://bigdataproc.com/?p=461</guid>

					<description><![CDATA[<p>Copy HDFS data from on-prem to cloud storage using distcp</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/distcp-to-copy-your-hdfs-data-to-gcp-cloud-storage/">Continue reading<span class="screen-reader-text">Distcp to Copy your HDFS data to GCP Cloud Storage</span></a></div>
<p>The post <a href="https://bigdataproc.com/distcp-to-copy-your-hdfs-data-to-gcp-cloud-storage/">Distcp to Copy your HDFS data to GCP Cloud Storage</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>A while back, I found myself deeply immersed in a Hadoop migration project where our cloud platform of choice was Google Cloud Platform (GCP). Our mission? To seamlessly transition data from on-premises infrastructure to the cloud. Due to various constraints, utilizing hardware wasn&#8217;t a viable option. Thus, I embarked on a quest to explore multiple software solutions to tackle this challenge.</p>



<p>For one-off migrations, Spark emerged as a favorable choice. It facilitated direct data migration to BigQuery, bypassing the intermediary step of storing it in cloud storage. However, there was a caveat: Spark lacked the ability to detect changes, necessitating a full refresh each time. This approach proved less than ideal, especially when dealing with substantial datasets.</p>



<p>My gaze then turned to Cloudera BDR, but alas, it didn&#8217;t support integration with Google Cloud. Left with no alternative, I delved into Distcp. In this blog post, I&#8217;ll guide you through the setup process for Distcp, enabling seamless data transfer from an on-prem HDFS cluster to Google Cloud Storage.</p>



<h2 class="wp-block-heading">Service Account Setup</h2>



<p>To begin, create a GCP service account with read/write permissions for the designated Google Cloud Storage bucket. Obtain the JSON key associated with this service account. This key will need to be distributed across all nodes involved in the migration process. For instance, I&#8217;ve opted to store it at <code>/tmp/sa-datamigonpremtobigquery.json</code>. Also make sure, the user with which you are going to run distcp command have access to this path. </p>



<h2 class="wp-block-heading">HDFS.conf</h2>



<p>Please store following file on edge node in your home directory.  please replace the value of <strong>fs.gs.project.id </strong>with your project id.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="xml" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">&lt;configuration>
  &lt;property>
    &lt;name>fs.AbstractFileSystem.gs.impl&lt;/name>
    &lt;value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS&lt;/value>
    &lt;description>The AbstractFileSystem for 'gs:' URIs.&lt;/description>
  &lt;/property>
  &lt;property>
    &lt;name>fs.gs.project.id&lt;/name>
    &lt;value>raw-bucket&lt;/value>
    &lt;description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    &lt;/description>
  &lt;/property>
  &lt;property>
    &lt;name>google.cloud.auth.type&lt;/name>
    &lt;value>SERVICE_ACCOUNT_JSON_KEYFILE&lt;/value>
    &lt;description>
      Authentication type to use for GCS access.
    &lt;/description>
  &lt;/property>
  &lt;property>
    &lt;name>google.cloud.auth.service.account.json.keyfile&lt;/name>
    &lt;value>/tmp/sa-datamigonpremtobigquery.json&lt;/value>
    &lt;description>
      The JSON keyfile of the service account used for GCS
      access when google.cloud.auth.type is SERVICE_ACCOUNT_JSON_KEYFILE.
    &lt;/description>
  &lt;/property>

  &lt;property>
    &lt;name>fs.gs.checksum.type&lt;/name>
    &lt;value>CRC32C&lt;/value>
    &lt;description>
          https://cloud.google.com/architecture/hadoop/validating-data-transfers
  &lt;/description>
  &lt;/property>

  &lt;property>
    &lt;name>dfs.checksum.combine.mode&lt;/name>
    &lt;value>COMPOSITE_CRC&lt;/value>
    &lt;description>
          https://cloud.google.com/architecture/hadoop/validating-data-transfers
  &lt;/description>
  &lt;/property>
&lt;/configuration>
</pre>



<h2 class="wp-block-heading">Executing Transfer</h2>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">hadoop --debug distcp --conf hdfs.conf -pc -update -v -log hdfs:///tmp/distcp_log hdfs:///tmp/ gs://raw-bucket/ </pre>
<p>The post <a href="https://bigdataproc.com/distcp-to-copy-your-hdfs-data-to-gcp-cloud-storage/">Distcp to Copy your HDFS data to GCP Cloud Storage</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/distcp-to-copy-your-hdfs-data-to-gcp-cloud-storage/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Spark &#8211; How to rename multiple columns in DataFrame</title>
		<link>https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=spark-how-to-rename-multiple-columns-in-dataframe</link>
					<comments>https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Sat, 23 May 2020 17:01:00 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[spark]]></category>
		<category><![CDATA[apache-spark]]></category>
		<category><![CDATA[pyspark]]></category>
		<category><![CDATA[spark dataframe]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=161</guid>

					<description><![CDATA[<p>In the last post we show how to apply a function to multiple columns. And if you have done that, you might have multiple column&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/">Continue reading<span class="screen-reader-text">Spark &#8211; How to rename multiple columns in DataFrame</span></a></div>
<p>The post <a href="https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/">Spark &#8211; How to rename multiple columns in DataFrame</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>In the last <a rel="noreferrer noopener" href="http://allabouthadoop.net/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/" target="_blank">post </a> we show how to apply a function to multiple columns. And if you have done that, you might have multiple column with desired data. However, you might want to rename back to original name. </p>



<p>let&#8217;s consider you have following dataframe. And you want to rename all the columns to different name. </p>



<pre class="lang:python theme:twilight" title="rename multiple columns in pyspark dataframe">&gt;&gt;&gt; df.printSchema()
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- joining_dt: date (nullable = true)
</pre>



<p>First thing you need is <code>map</code> which contains mapping from <code>old names</code> to <code>new names</code> and a small functional programming. </p>



<h2 class="wp-block-heading">How to rename multiple columns in Pyspark </h2>



<pre class="lang:python theme:twilight" title="rename multiple columns in pyspark dataframe">from pyspark.sql.functions import col
col_rename = {"age":"new_age", "name":"new_name", "joining_dt":"new_joining_dt"}
df_with_col_renamed = df.select([col(c).alias(col_rename.get(c,c)) for c in df.columns])
</pre>



<pre class="lang:python theme:twilight" title="rename multiple columns in pyspark dataframe">&gt;&gt;&gt; df_with_col_renamed.printSchema()
root
 |-- new_name: string (nullable = true)
 |-- new_age: integer (nullable = true)
 |-- new_joining_dt: date (nullable = true)

</pre>



<h2 class="wp-block-heading">How to rename multiple columns in spark using Scala</h2>



<pre class="lang:scala theme:twilight" title="How to rename multiple columns in spark using Scala">val colToRename = Map("age"-&gt;"new_age", 
					  "name"-&gt;"new_name", 
					  "joining_dt"-&gt;"new_joining_dt")
val newDf = df.select(
				df.columns.map{
						oldName=&gt;col(oldName).alias(colToRename.getOrElse(oldName, oldName))
				}: _*)
</pre>
<p>The post <a href="https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/">Spark &#8211; How to rename multiple columns in DataFrame</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Spark &#8211; How to apply a function to multiple columns on DataFrame?</title>
		<link>https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe</link>
					<comments>https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Sun, 17 May 2020 20:10:15 +0000</pubDate>
				<category><![CDATA[spark]]></category>
		<category><![CDATA[apache-spark]]></category>
		<category><![CDATA[spark dataframe]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=130</guid>

					<description><![CDATA[<p>let&#8217;s see that you have a spark dataframe and you want to apply a function to multiple columns. One way is to use WithColumn multiple&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/">Continue reading<span class="screen-reader-text">Spark &#8211; How to apply a function to multiple columns on DataFrame?</span></a></div>
<p>The post <a href="https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/">Spark &#8211; How to apply a function to multiple columns on DataFrame?</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>let&#8217;s see that you have a spark dataframe and you want to apply a function to multiple columns. One way is to use <code>WithColumn</code> multiple times. However, that&#8217;s good when you have only few columns and you know  column names in advance. Otherwise, it&#8217;s tedious and error-some. </p>



<p>So let&#8217;s see how to do that </p>



<pre class="wp-block-preformatted">val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
+-------------+------------+------+
|selling_price|market_price|profit|
+-------------+------------+------+
|         $100|         $90|   $10|
+-------------+------------+------+

</pre>



<p></p>



<p>Let&#8217;s consider you have a spark dataframe as above with more than 50 such columns, and you want to  remove <code>$</code> character and convert datatype to <code>Decimal</code>. Rather than writing 50 lines of code, you can do that using <code>fold</code> in less than 5 lines. </p>



<p>First, Create a list with new column name (yes, you need new column name) and the function you want to apply. I just added <code>_new</code> to existing column name so it&#8217;s easier to rename later.<br>And next thing you need is to utilize <code>foldLeft</code> method to <code>recursively</code>  function from a list to given dataframe. </p>



<pre class="lang:scala theme:twilight" title="apply function to multiple columns on spark dataframe">import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.types.DataTypes._
val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
df.show
val operations =  ListBuffer[(String, org.apache.spark.sql.Column)]()
val colNames = df.columns
val DecimalType = createDecimalType(10, 4)
colNames.foreach{colName =&gt;
  val operation = (s"${colName}_new", regexp_replace(col(colName), lit("\\$"), lit("")).cast(DecimalType))
  operations += operation
}

val dfWithNewColumns = operations.foldLeft(df) { (tempDF, listValue) =&gt;
  tempDF.withColumn(listValue._1, listValue._2)
}

dfWithNewColumns.show</pre>



<p>let&#8217;s see if that worked. </p>



<pre class="lang:scala theme:twilight" title="apply function to multiple columns on spark dataframe"> 
scala&gt; dfWithNewColumns.printSchema
root
 |-- selling_price: string (nullable = true)
 |-- market_price: string (nullable = true)
 |-- profit: string (nullable = true)
 |-- selling_price_new: decimal(10,4) (nullable = true)
 |-- market_price_new: decimal(10,4) (nullable = true)
 |-- profit_new: decimal(10,4) (nullable = true)


scala&gt; dfWithNewColumns.show
+-------------+------------+------+-----------------+----------------+----------+
|selling_price|market_price|profit|selling_price_new|market_price_new|profit_new|
+-------------+------------+------+-----------------+----------------+----------+
|         $100|         $90|   $10|         100.0000|         90.0000|   10.0000|
+-------------+------------+------+-----------------+----------------+----------+

</pre>
<p>The post <a href="https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/">Spark &#8211; How to apply a function to multiple columns on DataFrame?</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Hive Lateral view explode vs posexplode</title>
		<link>https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=hive-lateral-view-explode-vs-posexplode</link>
					<comments>https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 06 Feb 2019 17:46:13 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=116</guid>

					<description><![CDATA[<p>Lateral view Explode Lateral view explode, explodes the array data into multiple rows. for example, let&#8217;s say our table look like this, where Telephone is&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/">Continue reading<span class="screen-reader-text">Hive Lateral view explode vs posexplode</span></a></div>
<p>The post <a href="https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/">Hive Lateral view explode vs posexplode</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Lateral view Explode </h2>



<p>Lateral view explode, explodes the array data into multiple rows.  for example, let&#8217;s say our table look like this, where Telephone is an array of string.</p>



<table class="wp-block-table is-style-regular"><tbody><tr><td><strong>name</strong></td><td><strong>phone_numbers</strong></td><td><strong>cities</strong></td></tr><tr><td>AAA</td><td>[&#8220;365-889-1234&#8221;, &#8220;365-887-2232&#8221;]</td><td>[&#8220;Hamilton&#8221;][&#8220;Burlington&#8221;]</td></tr><tr><td>BBB</td><td>[&#8220;232-998-3232&#8221;, &#8220;878-998-2232&#8221;]</td><td>[&#8220;Toronto&#8221;, &#8220;Stoney Creek&#8221;]</td></tr></tbody></table>



<p>Applying a lateral view explode on the above table will expand <g class="gr_ gr_80 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar only-del replaceWithoutSep" id="80" data-gr-id="80">the both</g> Telephone and Cities and do a cross join,  your final table will look like this. </p>



<table class="wp-block-table is-style-regular"><tbody><tr><td><strong>name</strong></td><td><strong>phone_numbers</strong></td><td><strong>cities</strong></td></tr><tr><td>AAA</td><td>365-889-1234</td><td>Hamilton</td></tr><tr><td>AAA</td><td>365-887-2232</td><td>Hamilton</td></tr><tr><td>AAA</td><td>365-889-1234</td><td>Burlington</td></tr><tr><td>AAA</td><td>365-887-2232</td><td>Burlington</td></tr><tr><td>BBB</td><td>232-998-3232</td><td>Toronto</td></tr><tr><td>BBB</td><td>878-998-2232</td><td>Toronto</td></tr><tr><td>BBB</td><td>232-998-3232</td><td>Stoney Creek</td></tr><tr><td>BBB</td><td>878-998-2232</td><td>Stoney Creek</td></tr></tbody></table>



<h2 class="wp-block-heading">Lateral View POSExplode</h2>



<p>However, this is not what you probably want, if you want to map first telephone number to first city and second with second one, and that for all the records. Then you can use <strong>posexplode&nbsp;(</strong>positional explode) </p>



<p>posexplode gives you an index along with value when you expand any error, and then you can use this indexes to map values with each other as mentioned below. </p>



<pre class="lang:sql theme:twilight" title="when to use lateral view posexplode in hive">select 
    name, 
    phone_number, 
    city 
from temp.test_laterla_view_posexplode
lateral view posexplode(phone_numbers) pn as pos_phone, phone_number
lateral view posexplode(cities) pn as pos_city, city 
where 
    pos_phone == pos_city
</pre>



<p>With above query you will get following results, where phone number is mapped with corresponding city. </p>



<table class="wp-block-table is-style-regular"><tbody><tr><td><strong>name</strong></td><td><strong>phone_number</strong></td><td><strong>city</strong></td></tr><tr><td>AAA</td><td>365-889-1234</td><td>Hamilton</td></tr><tr><td>AAA</td><td>365-887-2232</td><td>Burlington</td></tr><tr><td>BBB</td><td>232-998-3232</td><td>Toronto</td></tr><tr><td>BBB</td><td>878-998-2232</td><td>Stoney Creek</td></tr></tbody></table>
<p>The post <a href="https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/">Hive Lateral view explode vs posexplode</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/hive-lateral-view-explode-vs-posexplode/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
		<item>
		<title>When to use lateral view explode in hive</title>
		<link>https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=when-to-use-lateral-view-explode-in-hive</link>
					<comments>https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 12 Dec 2018 19:13:48 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[Hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=111</guid>

					<description><![CDATA[<p>if you have a table with one or more column with array datatype&#160;&#160;and if you want it to expand into multiple rows, you can use&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/">Continue reading<span class="screen-reader-text">When to use lateral view explode in hive</span></a></div>
<p>The post <a href="https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/">When to use lateral view explode in hive</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>if you have a table with one or more column with<strong> array datatype&nbsp;</strong>&nbsp;and if you want it to expand into multiple rows, you can use lateral view explode function.&nbsp;</p>



<p>Let&#8217;s consider we have following table, where one employee has multiple phone numbers which are stores as part of array (list).&nbsp;</p>



<table class="wp-block-table"><tbody><tr><td><strong>emp_name</strong></td><td><strong>phone_numbers</strong></td></tr><tr><td>user1</td><td>[&#8220;546-487-3384&#8243;,&#8221;383-767-2238&#8221;]</td></tr><tr><td>user2</td><td>[&#8220;373-384-1192&#8243;,&#8221;374-282-1289&#8243;,&#8221;332-453-5566&#8221;]</td></tr></tbody></table>



<p>However as a output if we want to convert this Array (list) into multiple rows, we can use lateral view explode function a mentioned below&nbsp;</p>



<pre class="lang:sql theme:twilight" title="when to use lateral view explode in hive">select emp_name, phone_number 
from 
    temp.test_laterla_view_explode
lateral view explode(phone_numbers) p as phone_number
</pre>



<p>This will generate the output as mentioned below</p>



<table class="wp-block-table"><tbody><tr><td><strong>emp_name</strong></td><td><strong>phone_number</strong></td></tr><tr><td>user2</td><td>373-384-1192</td></tr><tr><td>user2</td><td>374-282-1289</td></tr><tr><td>user2</td><td>332-453-5566</td></tr><tr><td>user1</td><td>546-487-3384</td></tr><tr><td>user1</td><td>383-767-2238</td></tr></tbody></table>
<p>The post <a href="https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/">When to use lateral view explode in hive</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/when-to-use-lateral-view-explode-in-hive/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Hive &#8211; Convert JSON to complex Data Type</title>
		<link>https://bigdataproc.com/hive-convert-json-to-complex-data-type/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=hive-convert-json-to-complex-data-type</link>
					<comments>https://bigdataproc.com/hive-convert-json-to-complex-data-type/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Tue, 04 Dec 2018 19:39:02 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=97</guid>

					<description><![CDATA[<p>if you have a small (not complex) json file and need to create a corresponding hive table, it&#8217;s easy.&#160; { "country":"Switzerland", "languages":["German","French","Italian"], "religions": { "catholic":[10,20],&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/hive-convert-json-to-complex-data-type/">Continue reading<span class="screen-reader-text">Hive &#8211; Convert JSON to complex Data Type</span></a></div>
<p>The post <a href="https://bigdataproc.com/hive-convert-json-to-complex-data-type/">Hive &#8211; Convert JSON to complex Data Type</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>if you have a small (not complex) json file and need to create a corresponding hive table, it&#8217;s easy.&nbsp;</p>



<pre class="lang:json theme:twilight" title="sample json for hive jsonserde">{
	"country":"Switzerland",
	"languages":["German","French","Italian"],
	"religions":
		{
			"catholic":[10,20],
			"protestant":[40,50]
		}
}
</pre>



<p>However that&#8217;s hardly the case in real life. we get JSON file with 100s of nested fields.&nbsp; Manually parsing that into Hive table is a tedious task.&nbsp;</p>



<p>To ease the work you can take the help of <strong>spark</strong>.&nbsp; don&#8217;t worry, it&#8217;s just two lines of code <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" />&nbsp;</p>



<h2 class="wp-block-heading">first put your file in hdfs location&nbsp;</h2>



<pre class="lang:text theme:twilight" title="sample json for hive jsonserde">hdfs dfs -put sample.json /tmp/
</pre>



<h2 class="wp-block-heading">Fetch Schema for Hive Table&nbsp;</h2>



<pre class="lang:text theme:twilight" title="convert json to hive table">
>>> df = spark.read.json("/tmp/sample.json")
>>> df
DataFrame[country: string, languages: array<string>, religions: struct<catholic:array<bigint>,protestant:array<bigint>>]

</pre>



<h2 class="wp-block-heading">Hive table</h2>



<p>your final hive table will look like this, with minor modification in schema and adding json serde and other properties.&nbsp;</p>



<pre class="lang:sql theme:twilight" title="hive table using json serde">
CREATE TABLE <code data-enlighter-language="raw" class="EnlighterJSRAW">temp.test_json</code>(
	  <code data-enlighter-language="raw" class="EnlighterJSRAW">country</code> string, 
	  <code data-enlighter-language="raw" class="EnlighterJSRAW">languages</code> array<string>, 
	  <code data-enlighter-language="raw" class="EnlighterJSRAW">religions</code> struct<catholic:array<bigint>,protestant:array<bigint>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE
location '/tmp/test_json/table/'
</pre>



<p>if you don&#8217;t like making modification to schema, &nbsp;alternatively you can save you table to hive and get schema using that.&nbsp;</p>



<pre class="lang:text theme:twilight" title="hive table using json serde">df.write.saveAsTable("temp.test_json")
</pre>

And then run following in hive
<pre class="lang:sql theme:twilight" title="hive table using json serde">show create table temp.test_json
</pre>



<h2 class="wp-block-heading">Hive Data</h2>



<p>either way, &nbsp;this is how data looks</p>



<figure class="wp-block-image"><img fetchpriority="high" decoding="async" width="1024" height="79" src="http://allabouthadoop.net/wp-content/uploads/2018/12/Screen-Shot-2018-12-04-at-2.28.45-PM-1024x79.png" alt="" class="wp-image-105" srcset="https://bigdataproc.com/wp-content/uploads/2018/12/Screen-Shot-2018-12-04-at-2.28.45-PM-1024x79.png 1024w, https://bigdataproc.com/wp-content/uploads/2018/12/Screen-Shot-2018-12-04-at-2.28.45-PM-300x23.png 300w, https://bigdataproc.com/wp-content/uploads/2018/12/Screen-Shot-2018-12-04-at-2.28.45-PM-768x59.png 768w, https://bigdataproc.com/wp-content/uploads/2018/12/Screen-Shot-2018-12-04-at-2.28.45-PM.png 1216w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
<p>The post <a href="https://bigdataproc.com/hive-convert-json-to-complex-data-type/">Hive &#8211; Convert JSON to complex Data Type</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/hive-convert-json-to-complex-data-type/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>How to Access Hive With Python script?</title>
		<link>https://bigdataproc.com/how-to-access-hive-with-python-script/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-access-hive-with-python-script</link>
					<comments>https://bigdataproc.com/how-to-access-hive-with-python-script/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Thu, 15 Nov 2018 15:06:20 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=87</guid>

					<description><![CDATA[<p>You can read hive tables using pyhive python library.&#160; Install PyHive library pip install pyhive Connect to Hive using LDAP from pyhive import hive connection&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/how-to-access-hive-with-python-script/">Continue reading<span class="screen-reader-text">How to Access Hive With Python script?</span></a></div>
<p>The post <a href="https://bigdataproc.com/how-to-access-hive-with-python-script/">How to Access Hive With Python script?</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>You can read hive tables using pyhive python library.&nbsp;</p>



<h2 class="wp-block-heading">Install PyHive library</h2>



<pre class="lang:shell theme:twilight" title="Install Pyhive to read hive tables using python">pip install pyhive
</pre>



<h2 class="wp-block-heading">Connect to Hive using LDAP</h2>



<pre class="lang:python theme:twilight" title="Pyhive Connect using LDAP Authentication">from pyhive import hive
connection = hive.connect(host='HIVE_HOST',
                          port=10000,
                          database='temp',
                          username='HIVE_USERNAME',
                          password='HIVE_PASSWORD',
                          auth='CUSTOM')	
</pre>



<h2 class="wp-block-heading">Connect to Hive using Kerberos</h2>



<pre class="lang:python theme:twilight" title="Pyhive Connect using Kerberos Authentication">from pyhive import hive
connection = hive.connect(host='HIVE_HOST',
                          port=10000,
                          database='temp',
                          username='HIVE_USERNAME',
                          auth='KERBEROS',
                          kerberos_service_name='hive')	
</pre>



<p>To connect using kerberos, you don&#8217;t need to supply password. However you need to provide kerberos service name.&nbsp;&nbsp;</p>



<h2 class="wp-block-heading">Execute hive Query using PyHive</h2>



<pre class="lang:python theme:twilight" title="Pyhive Connect using Kerberos Authentication">query="select * from temp.test_table"
cur = connection.cursor()
cur.execute(query)
res = cur.fetchall()
</pre>



<p></p>
<p>The post <a href="https://bigdataproc.com/how-to-access-hive-with-python-script/">How to Access Hive With Python script?</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/how-to-access-hive-with-python-script/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Import Data from Netezza to Hive using sqoop</title>
		<link>https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=import-data-from-netezza-to-hive-using-sqoop</link>
					<comments>https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 07 Nov 2018 15:56:18 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Sqoop]]></category>
		<category><![CDATA[Hive]]></category>
		<category><![CDATA[netezza]]></category>
		<category><![CDATA[sqoop]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=83</guid>

					<description><![CDATA[<p>following is the syntax for importing data from netezza to hive.&#160; sqoop import \ --connect jdbc:netezza://:/ \ --username= \ --password= \ --table \ --hcatalog-database \&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/">Continue reading<span class="screen-reader-text">Import Data from Netezza to Hive using sqoop</span></a></div>
<p>The post <a href="https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/">Import Data from Netezza to Hive using sqoop</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>following is the syntax for importing data from netezza to hive.&nbsp;</p>



<pre class="lang:shell theme:twilight" title="Import data from netezza to hive using sqoop">
sqoop import \
--connect jdbc:netezza://<netezza_host>:<port>/<database> \
--username=<USERNAME IN ALL CAPS> \
--password=<password> \
--table <netezza_table_name> \
--hcatalog-database <hive_database_name> \
--hcatalog-table <hive_table_name> \
-m 1
</pre>



<p>provide the username in all CAPS, otherwise it will throw authentication error.</p>



<h2 class="wp-block-heading">create hive table.&nbsp;</h2>



<p>if you don&#8217;t have the corresponding hive table, you can use the following option.&nbsp; which will create hive table if it doesn&#8217;t exist.&nbsp;</p>



<pre class="lang:shell theme:twilight" title="create hive table while sqoop import">--create-hcatalog-table 
</pre>



<h2 class="wp-block-heading">Sqoop import change hive table format and properties</h2>



<p>if you want to change the format of the hive table, you can do with following option.&nbsp;</p>



<pre class="lang:shell theme:twilight" title="sqoop import change hive table format and properties">--hcatalog-storage-stanza \
'stored as orc tblproperties ("orc.compress"="SNAPPY")'
</pre>
<p>The post <a href="https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/">Import Data from Netezza to Hive using sqoop</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/import-data-from-netezza-to-hive-using-sqoop/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>NiFi API to filter Processor Groups</title>
		<link>https://bigdataproc.com/nifi-api-to-filter-processor-groups/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=nifi-api-to-filter-processor-groups</link>
					<comments>https://bigdataproc.com/nifi-api-to-filter-processor-groups/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 24 Oct 2018 14:49:02 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NiFi]]></category>
		<category><![CDATA[NiFi-API]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=77</guid>

					<description><![CDATA[<p>Recently I was working on NiFi and realize that our Dev Instance is running too slow, reason being Developers forgot to cleanup their work.&#160; And&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/nifi-api-to-filter-processor-groups/">Continue reading<span class="screen-reader-text">NiFi API to filter Processor Groups</span></a></div>
<p>The post <a href="https://bigdataproc.com/nifi-api-to-filter-processor-groups/">NiFi API to filter Processor Groups</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Recently I was working on NiFi and realize that our Dev Instance is running too slow, reason being Developers forgot to cleanup their work.&nbsp;</p>



<p>And So I used the NiFi API and write the code in python to identify some of the processor groups which we can delete. I will walk through the code here.</p>



<h2 class="wp-block-heading">How to get API token?</h2>



<p>To access NiFi endpoint behind the security you would need API token. And then you need to send this API token in header of all your request.&nbsp;</p>



<p>Following code will take the username and password and based on&nbsp;</p>



<pre class="lang:python theme:twilight" title="NiFi API get API token">def get_token(username, password, host):
    url = "https://%s:8080/nifi-api/access/token" % host
    header = {"Content-Type": "application/x-www-form-urlencoded;charset=UTF-8"}
    data = {"username": username, "password": password}

    resp = requests.post(url, data=data, headers=header, verify=False)

    if resp.status_code not in (200, 201):
        print resp.reason
        print resp.text
        exit(1)
    return resp.text
</pre>



<h2 class="wp-block-heading">How to Find Stopped Processor Group?</h2>



<p>Following code will find all the processor group which doesn&#8217;t have any processor running.</p>



<pre class="lang:python theme:twilight" title="Find Stopped processor Group using NiFi API">
def find_stopped_processor(group_id, host):
    url = "https://%s:8080/nifi-api/process-groups/%s" % (host, group_id)
    header = {"Authorization": "Bearer %s" % utils.ACCESS_TOKEN}
    r = requests.get(url, headers=header, verify=False)
    resp = r.json()
    running_count = resp.get("runningCount")
    if int(running_count) == 0:
        print "%s,%s" % (group_id, resp.get("status").get("name"))


def find_processor_group_stopped_processor(parent_processor, host):
    for processor_group in parent_processor:
        p = processor_group.get("processGroupStatusSnapshot")
        if len(p.get("processGroupStatusSnapshots")) > 0:
            find_processor_group_stopped_processor(p.get("processGroupStatusSnapshots"))
        else:
            find_stopped_processor(p.get("id"), host)
</pre>



<p></p>
<p>The post <a href="https://bigdataproc.com/nifi-api-to-filter-processor-groups/">NiFi API to filter Processor Groups</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/nifi-api-to-filter-processor-groups/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>What is Predicate PushDown in Hive?</title>
		<link>https://bigdataproc.com/what-is-predicate-pushdown-in-hive/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-is-predicate-pushdown-in-hive</link>
					<comments>https://bigdataproc.com/what-is-predicate-pushdown-in-hive/#comments</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Wed, 26 Sep 2018 20:07:37 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Hive]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=57</guid>

					<description><![CDATA[<p>Predicate Pushdown in hive is a feature to Push your predicate ( where condition) further up in the query. It tries to execute the expression&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/what-is-predicate-pushdown-in-hive/">Continue reading<span class="screen-reader-text">What is Predicate PushDown in Hive?</span></a></div>
<p>The post <a href="https://bigdataproc.com/what-is-predicate-pushdown-in-hive/">What is Predicate PushDown in Hive?</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Predicate Pushdown in hive is a feature to Push your predicate ( where condition) further up in the query. It tries to execute the expression as early as possible in plan.&nbsp;</p>



<p>Let&#8217;s try to understand this by example. let&#8217;s consider we have two tables, <strong>product</strong> and <strong>sales</strong> and we want to answer following question.&nbsp;<br></p>



<p>How many products of brand <strong>Washington</strong> has been sold so far?<br></p>



<h2 class="wp-block-heading">Non-Optimized Query</h2>



<p>Following query will answer the above question. However, if you are familiar with sql you will realize that above query is&nbsp;<strong>not optimized.</strong>&nbsp; It applies first joins the two table and then&nbsp; applies the condition (predicate).</p>



<pre class="lang:mysql theme:twilight" title="Sample Query for PPD (Predicate Pushdown)">select sum(s.unit_sales) from foodmart.product p 
join 
	foodmart.sales_fact_dec_1998 s 
on 
	p.product_id = s.product_id
where 
	p.brand_name = "Washington"
</pre>



<h2 class="wp-block-heading">Optimized Query</h2>



<p>We could easily optimize this above query by applying condition first on product table and then joining it to sales table as mentioned below.</p>



<pre class="lang:mysql theme:twilight">SELECT sum(s.unit_sales)
FROM foodmart.sales_fact_dec_1998 s
JOIN (
	SELECT product_id, brand_name
	FROM foodmart.product
	WHERE 
		brand_name = "Washington"
	) p
ON 
	p.product_id = s.product_id
</pre>



<p>This is what PPD (predicate pushdown) does internally.  if you have ppd enabled your first query will automatically be converted to a second optimized query.</p>



<p>Let&#8217;s see this in action.  The product table has total 1560 rows (product) with only 11 products with the brand name Washington.</p>



<p>For better understanding, I have disabled the <strong>vectorization</strong>.  If you are not sure what vectorization is, please read the following blog post &#8211; <a href="http://bigdataproc.com/what-is-vectorization-in-hive/" target="_blank" rel="noreferrer noopener">What is vectorization? </a></p>



<h2 class="wp-block-heading">Running Query with PPD Disabled</h2>



<p>Following is the DAG of the first query with <strong>PPD disabled. <br></strong>Please set the following parameter to false, to disable the PPD.</p>



<pre class="lang:mysql theme:twilight" title="disable ppd">set hive.optimize.ppd=false;
</pre>



<p>if you notice, it&#8217;s reading all rows from the product table and then passing it to the reducer for join. </p>



<figure class="wp-block-image"><a href="http://allabouthadoop.net/wp-content/uploads/2018/09/DAG_PPD_Disabled.png"><img decoding="async" width="300" height="266" src="http://allabouthadoop.net/wp-content/uploads/2018/09/DAG_PPD_Disabled-300x266.png" alt="DAG when PPD (predicate pushdown) is disabled. " class="wp-image-59" srcset="https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Disabled-300x266.png 300w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Disabled-768x682.png 768w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Disabled-1024x909.png 1024w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Disabled.png 1090w" sizes="(max-width: 300px) 100vw, 300px" /></a><figcaption>DAG for first query when PPD is disabled</figcaption></figure>



<h2 class="wp-block-heading" id="mce_24">Running Query with PPD Enabled.</h2>



<p>And Following is the DAG of the same query with <strong>PPD Enabled.<br></strong>Please set the following parameter to true, to enable the PPD.</p>



<pre class="lang:mysql theme:twilight" title="enable ppd">set hive.optimize.ppd=true;
</pre>



<p>Once, we enable the PPD, it first applies the condition on product table and sends <strong>only 11 rows</strong> to the reducer for join.</p>



<figure class="wp-block-image"><a href="http://allabouthadoop.net/wp-content/uploads/2018/09/DAG_PPD_Enabled.png"><img decoding="async" width="300" height="273" src="http://allabouthadoop.net/wp-content/uploads/2018/09/DAG_PPD_Enabled-300x273.png" alt="DAG when PPD (predicate pushdown) is enabled. " class="wp-image-60" srcset="https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Enabled-300x273.png 300w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Enabled-768x699.png 768w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Enabled-1024x932.png 1024w, https://bigdataproc.com/wp-content/uploads/2018/09/DAG_PPD_Enabled.png 1092w" sizes="(max-width: 300px) 100vw, 300px" /></a><figcaption>DAG for first query when PPD is enabled</figcaption></figure>



<p><strong></strong></p>
<p>The post <a href="https://bigdataproc.com/what-is-predicate-pushdown-in-hive/">What is Predicate PushDown in Hive?</a> appeared first on <a href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/what-is-predicate-pushdown-in-hive/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
	</channel>
</rss>
