Apache Spark Data Frame:Basic Data manipulation using scala
Overview of this tutorial
- Replace the data with new value in Data Frame
- Filter the row values with basic conditions in Data Frame
- Type Casting the Column Value in Data Frame
To start Apache spark and read data from csv follow this post
Replace the data with new value in Data Frame
import org.apache.spark.sql.functions._
var data = csv.toDf(); //refer old tutorial to load Df from csv;
data = data.select("*")
.withColumn("_c1", when(data("_c1") === "NA", 0.0)
.otherwise(data("_c1"))).toDF();
import org.apache.spark.sql.functions._
: this import necessary Spark sql conditions- select("*") : same as select * as sql selects all the columns of the dataframe
withColumn(name,condition to populate)
: So our above code snippet overwrite the column named _c1 , we can create new column if we do not want to rewritewhen()
andotherwise()
condition works as if else,toDf()
create new dataFrame and assign to mutable variable called data
Filter the row values with basic conditions in Data Frame
data.select("_c0", "_c1", "_c22")
.filter(data("_c1") =!= "Sub Total - SCS")
.toDF();
select("_c0", "_c1", "_c22")
can be translated as select a,b,c as traditional sqlfilter (the condition)
filter taken the condition to filter on
Type Casting Casting the Column Value in Data Frame
data
.withColumn("_c22",data("_c22").cast(FloatType))
.toDF();
Hope this part of code could be self explanatory