<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>spark Archives - Big Data Processing</title>
	<atom:link href="https://bigdataproc.com/tag/spark/feed/" rel="self" type="application/rss+xml" />
	<link>https://bigdataproc.com/tag/spark/</link>
	<description>Big Data Solution for GCP, AWS, Azure and on-prem</description>
	<lastBuildDate>Sun, 15 Jan 2023 04:56:09 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.3.2</generator>
	<item>
		<title>Spark &#8211; How to rename multiple columns in DataFrame</title>
		<link>https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=spark-how-to-rename-multiple-columns-in-dataframe</link>
					<comments>https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Sat, 23 May 2020 17:01:00 +0000</pubDate>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[spark]]></category>
		<category><![CDATA[apache-spark]]></category>
		<category><![CDATA[pyspark]]></category>
		<category><![CDATA[spark dataframe]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=161</guid>

					<description><![CDATA[<p>In the last post we show how to apply a function to multiple columns. And if you have done that, you might have multiple column&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/">Continue reading<span class="screen-reader-text">Spark &#8211; How to rename multiple columns in DataFrame</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/">Spark &#8211; How to rename multiple columns in DataFrame</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>In the last <a rel="noreferrer noopener" href="http://allabouthadoop.net/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/" target="_blank">post </a> we show how to apply a function to multiple columns. And if you have done that, you might have multiple column with desired data. However, you might want to rename back to original name. </p>



<p>let&#8217;s consider you have following dataframe. And you want to rename all the columns to different name. </p>



<pre class="lang:python theme:twilight" title="rename multiple columns in pyspark dataframe">&gt;&gt;&gt; df.printSchema()
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- joining_dt: date (nullable = true)
</pre>



<p>First thing you need is <code>map</code> which contains mapping from <code>old names</code> to <code>new names</code> and a small functional programming. </p>



<h2 class="wp-block-heading">How to rename multiple columns in Pyspark </h2>



<pre class="lang:python theme:twilight" title="rename multiple columns in pyspark dataframe">from pyspark.sql.functions import col
col_rename = {"age":"new_age", "name":"new_name", "joining_dt":"new_joining_dt"}
df_with_col_renamed = df.select([col(c).alias(col_rename.get(c,c)) for c in df.columns])
</pre>



<pre class="lang:python theme:twilight" title="rename multiple columns in pyspark dataframe">&gt;&gt;&gt; df_with_col_renamed.printSchema()
root
 |-- new_name: string (nullable = true)
 |-- new_age: integer (nullable = true)
 |-- new_joining_dt: date (nullable = true)

</pre>



<h2 class="wp-block-heading">How to rename multiple columns in spark using Scala</h2>



<pre class="lang:scala theme:twilight" title="How to rename multiple columns in spark using Scala">val colToRename = Map("age"-&gt;"new_age", 
					  "name"-&gt;"new_name", 
					  "joining_dt"-&gt;"new_joining_dt")
val newDf = df.select(
				df.columns.map{
						oldName=&gt;col(oldName).alias(colToRename.getOrElse(oldName, oldName))
				}: _*)
</pre>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/">Spark &#8211; How to rename multiple columns in DataFrame</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/spark-how-to-rename-multiple-columns-in-dataframe/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Spark &#8211; How to apply a function to multiple columns on DataFrame?</title>
		<link>https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe</link>
					<comments>https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/#respond</comments>
		
		<dc:creator><![CDATA[Gaurang]]></dc:creator>
		<pubDate>Sun, 17 May 2020 20:10:15 +0000</pubDate>
				<category><![CDATA[spark]]></category>
		<category><![CDATA[apache-spark]]></category>
		<category><![CDATA[spark dataframe]]></category>
		<guid isPermaLink="false">http://allabouthadoop.net/?p=130</guid>

					<description><![CDATA[<p>let&#8217;s see that you have a spark dataframe and you want to apply a function to multiple columns. One way is to use WithColumn multiple&#8230;</p>
<div class="more-link-wrapper"><a class="more-link" href="https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/">Continue reading<span class="screen-reader-text">Spark &#8211; How to apply a function to multiple columns on DataFrame?</span></a></div>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/">Spark &#8211; How to apply a function to multiple columns on DataFrame?</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>let&#8217;s see that you have a spark dataframe and you want to apply a function to multiple columns. One way is to use <code>WithColumn</code> multiple times. However, that&#8217;s good when you have only few columns and you know  column names in advance. Otherwise, it&#8217;s tedious and error-some. </p>



<p>So let&#8217;s see how to do that </p>



<pre class="wp-block-preformatted">val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
+-------------+------------+------+
|selling_price|market_price|profit|
+-------------+------------+------+
|         $100|         $90|   $10|
+-------------+------------+------+

</pre>



<p></p>



<p>Let&#8217;s consider you have a spark dataframe as above with more than 50 such columns, and you want to  remove <code>$</code> character and convert datatype to <code>Decimal</code>. Rather than writing 50 lines of code, you can do that using <code>fold</code> in less than 5 lines. </p>



<p>First, Create a list with new column name (yes, you need new column name) and the function you want to apply. I just added <code>_new</code> to existing column name so it&#8217;s easier to rename later.<br>And next thing you need is to utilize <code>foldLeft</code> method to <code>recursively</code>  function from a list to given dataframe. </p>



<pre class="lang:scala theme:twilight" title="apply function to multiple columns on spark dataframe">import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.types.DataTypes._
val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
df.show
val operations =  ListBuffer[(String, org.apache.spark.sql.Column)]()
val colNames = df.columns
val DecimalType = createDecimalType(10, 4)
colNames.foreach{colName =&gt;
  val operation = (s"${colName}_new", regexp_replace(col(colName), lit("\\$"), lit("")).cast(DecimalType))
  operations += operation
}

val dfWithNewColumns = operations.foldLeft(df) { (tempDF, listValue) =&gt;
  tempDF.withColumn(listValue._1, listValue._2)
}

dfWithNewColumns.show</pre>



<p>let&#8217;s see if that worked. </p>



<pre class="lang:scala theme:twilight" title="apply function to multiple columns on spark dataframe"> 
scala&gt; dfWithNewColumns.printSchema
root
 |-- selling_price: string (nullable = true)
 |-- market_price: string (nullable = true)
 |-- profit: string (nullable = true)
 |-- selling_price_new: decimal(10,4) (nullable = true)
 |-- market_price_new: decimal(10,4) (nullable = true)
 |-- profit_new: decimal(10,4) (nullable = true)


scala&gt; dfWithNewColumns.show
+-------------+------------+------+-----------------+----------------+----------+
|selling_price|market_price|profit|selling_price_new|market_price_new|profit_new|
+-------------+------------+------+-----------------+----------------+----------+
|         $100|         $90|   $10|         100.0000|         90.0000|   10.0000|
+-------------+------------+------+-----------------+----------------+----------+

</pre>
<p>The post <a rel="nofollow" href="https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/">Spark &#8211; How to apply a function to multiple columns on DataFrame?</a> appeared first on <a rel="nofollow" href="https://bigdataproc.com">Big Data Processing </a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://bigdataproc.com/spark-how-to-apply-a-function-to-multiple-columns-on-spark-dataframe/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
