Sign in Subscribe

Apache Spark

Apache Spark Data Frame:Basic Data manipulation using scala

Ashrith G N

Aug 7, 2018 • 1 min read

Overview of this tutorial

Replace the data with new value in Data Frame
Filter the row values with basic conditions in Data Frame
Type Casting the Column Value in Data Frame

To start Apache spark and read data from csv follow this post

Replace the data with new value in Data Frame

import org.apache.spark.sql.functions._
var data = csv.toDf(); //refer old tutorial to load Df from csv;
data = data.select("*")
.withColumn("_c1", when(data("_c1") === "NA", 0.0)
.otherwise(data("_c1"))).toDF();

import org.apache.spark.sql.functions._: this import necessary Spark sql conditions
select("*") : same as select * as sql selects all the columns of the dataframe
withColumn(name,condition to populate) : So our above code snippet overwrite the column named _c1 , we can create new column if we do not want to rewrite
when() and otherwise() condition works as if else,
toDf() create new dataFrame and assign to mutable variable called data

Filter the row values with basic conditions in Data Frame

data.select("_c0", "_c1", "_c22")
.filter(data("_c1") =!= "Sub Total - SCS")
.toDF();

select("_c0", "_c1", "_c22") can be translated as select a,b,c as traditional sql
filter (the condition) filter taken the condition to filter on

Type Casting Casting the Column Value in Data Frame

data
.withColumn("_c22",data("_c22").cast(FloatType))
.toDF();

Hope this part of code could be self explanatory