Apache Spark Data Frame:Basic Data manipulation using scala

Apache Spark Data Frame:Basic Data manipulation using scala

Overview of this tutorial

  • Replace the data with new value in Data Frame
  • Filter the row values with basic conditions in Data Frame
  • Type Casting the Column Value in Data Frame

To start Apache spark and read data from csv follow this post

Replace the data with new value in Data Frame

import org.apache.spark.sql.functions._
var data = csv.toDf(); //refer old tutorial to load Df from csv;
data = data.select("*")
.withColumn("_c1", when(data("_c1") === "NA", 0.0)
.otherwise(data("_c1"))).toDF();
  • import org.apache.spark.sql.functions._: this import necessary Spark sql conditions
  • select("*") : same as select * as sql selects all the columns of the dataframe
  • withColumn(name,condition to populate) : So our above code snippet overwrite the column named _c1 , we can create new column if we do not want to rewrite
  • when() and otherwise() condition works as if else,
  • toDf() create new dataFrame and assign to mutable variable called data

Filter the row values with basic conditions in Data Frame

data.select("_c0", "_c1", "_c22")
.filter(data("_c1") =!= "Sub Total - SCS")
.toDF();
  • select("_c0", "_c1", "_c22") can be translated as select a,b,c as traditional sql
  • filter (the condition) filter taken the condition to filter on

Type Casting Casting the Column Value in Data Frame

data
.withColumn("_c22",data("_c22").cast(FloatType))
.toDF();

Hope this part of code could be self explanatory