Skip to content

Tag: apache-spark

Spark – How to rename multiple columns in DataFrame

In the last post we show how to apply a function to multiple columns. And if you have done that, you might have multiple column with desired data. However, you might want to rename back to original name.

let’s consider you have following dataframe. And you want to rename all the columns to different name.

>>> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- joining_dt: date (nullable = true)

First thing you need is map which contains mapping from old names to new names and a small functional programming.

How to rename multiple columns in Pyspark

from pyspark.sql.functions import col
col_rename = {"age":"new_age", "name":"new_name", "joining_dt":"new_joining_dt"}
df_with_col_renamed = df.select([col(c).alias(col_rename.get(c,c)) for c in df.columns])
>>> df_with_col_renamed.printSchema()
root
 |-- new_name: string (nullable = true)
 |-- new_age: integer (nullable = true)
 |-- new_joining_dt: date (nullable = true)

How to rename multiple columns in spark using Scala

val colToRename = Map("age"->"new_age", 
					  "name"->"new_name", 
					  "joining_dt"->"new_joining_dt")
val newDf = df.select(
				df.columns.map{
						oldName=>col(oldName).alias(colToRename.getOrElse(oldName, oldName))
				}: _*)
Leave a Comment

Spark – How to apply a function to multiple columns on DataFrame?

let’s see that you have a spark dataframe and you want to apply a function to multiple columns. One way is to use WithColumn multiple times. However, that’s good when you have only few columns and you know column names in advance. Otherwise, it’s tedious and error-some.

So let’s see how to do that

val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
+-------------+------------+------+
|selling_price|market_price|profit|
+-------------+------------+------+
|         $100|         $90|   $10|
+-------------+------------+------+

Let’s consider you have a spark dataframe as above with more than 50 such columns, and you want to remove $ character and convert datatype to Decimal. Rather than writing 50 lines of code, you can do that using fold in less than 5 lines.

First, Create a list with new column name (yes, you need new column name) and the function you want to apply. I just added _new to existing column name so it’s easier to rename later.
And next thing you need is to utilize foldLeft method to recursively function from a list to given dataframe.

import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.types.DataTypes._
val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
df.show
val operations =  ListBuffer[(String, org.apache.spark.sql.Column)]()
val colNames = df.columns
val DecimalType = createDecimalType(10, 4)
colNames.foreach{colName =>
  val operation = (s"${colName}_new", regexp_replace(col(colName), lit("\$"), lit("")).cast(DecimalType))
  operations += operation
}

val dfWithNewColumns = operations.foldLeft(df) { (tempDF, listValue) =>
  tempDF.withColumn(listValue._1, listValue._2)
}

dfWithNewColumns.show

let’s see if that worked.

 
scala> dfWithNewColumns.printSchema
root
 |-- selling_price: string (nullable = true)
 |-- market_price: string (nullable = true)
 |-- profit: string (nullable = true)
 |-- selling_price_new: decimal(10,4) (nullable = true)
 |-- market_price_new: decimal(10,4) (nullable = true)
 |-- profit_new: decimal(10,4) (nullable = true)


scala> dfWithNewColumns.show
+-------------+------------+------+-----------------+----------------+----------+
|selling_price|market_price|profit|selling_price_new|market_price_new|profit_new|
+-------------+------------+------+-----------------+----------------+----------+
|         $100|         $90|   $10|         100.0000|         90.0000|   10.0000|
+-------------+------------+------+-----------------+----------------+----------+

Leave a Comment