spark Archives - Big Data Processing

How to rename multiple columns in Pyspark

from pyspark.sql.functions import col col_rename = {"age":"new_age", "name":"new_name", "joining_dt":"new_joining_dt"} df_with_col_renamed = df.select([col(c).alias(col_rename.get(c,c)) for c in df.columns])

>>> df_with_col_renamed.printSchema() root |-- new_name: string (nullable = true) |-- new_age: integer (nullable = true) |-- new_joining_dt: date (nullable = true)

val colToRename = Map("age"->"new_age", "name"->"new_name", "joining_dt"->"new_joining_dt") val newDf = df.select( df.columns.map{ oldName=>col(oldName).alias(colToRename.getOrElse(oldName, oldName)) }: _*)

let’s see that you have a spark dataframe and you want to apply a function to multiple columns. One way is to use WithColumn multiple times. However, that’s good when you have only few columns and you know column names in advance. Otherwise, it’s tedious and error-some.

So let’s see how to do that

val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
+-------------+------------+------+
|selling_price|market_price|profit|
+-------------+------------+------+
|         $100|         $90|   $10|
+-------------+------------+------+

Let’s consider you have a spark dataframe as above with more than 50 such columns, and you want to remove $ character and convert datatype to Decimal. Rather than writing 50 lines of code, you can do that using fold in less than 5 lines.

First, Create a list with new column name (yes, you need new column name) and the function you want to apply. I just added _new to existing column name so it’s easier to rename later.
And next thing you need is to utilize foldLeft method to recursively function from a list to given dataframe.

import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.types.DataTypes._
val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
df.show
val operations =  ListBuffer[(String, org.apache.spark.sql.Column)]()
val colNames = df.columns
val DecimalType = createDecimalType(10, 4)
colNames.foreach{colName =>
  val operation = (s"${colName}_new", regexp_replace(col(colName), lit("\$"), lit("")).cast(DecimalType))
  operations += operation
}

val dfWithNewColumns = operations.foldLeft(df) { (tempDF, listValue) =>
  tempDF.withColumn(listValue._1, listValue._2)
}

dfWithNewColumns.show

let’s see if that worked.

 
scala> dfWithNewColumns.printSchema
root
 |-- selling_price: string (nullable = true)
 |-- market_price: string (nullable = true)
 |-- profit: string (nullable = true)
 |-- selling_price_new: decimal(10,4) (nullable = true)
 |-- market_price_new: decimal(10,4) (nullable = true)
 |-- profit_new: decimal(10,4) (nullable = true)


scala> dfWithNewColumns.show
+-------------+------------+------+-----------------+----------------+----------+
|selling_price|market_price|profit|selling_price_new|market_price_new|profit_new|
+-------------+------------+------+-----------------+----------------+----------+
|         $100|         $90|   $10|         100.0000|         90.0000|   10.0000|
+-------------+------------+------+-----------------+----------------+----------+

Category: spark

Spark – How to rename multiple columns in DataFrame

How to rename multiple columns in Pyspark

How to rename multiple columns in spark using Scala

Spark – How to apply a function to multiple columns on DataFrame?