Introduction to Apache Spark in scala

What is Apache Spark ?
Apache Spark is all referred as big data processing tool or framework developed under Apache. Spark has various inbuilt tool like SparkSQL, Spark Streaming,Spark Mllib,GraphX to handle the big data work.
Overview of things covered in this Tutorial
- Adding dependency to scala project
- Starting stand alone Instance of the spark in scala
- Creating local spark Instance
- Reading the csv from spark
- Selecting required coloums from the dataframe
Adding Spark as dependecy in scala
To start the scala project in sbt read here
Replace build.sbt the following code snippet
name := "scalaExample"
version := "0.1"
scalaVersion := "2.11.11"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.1" % "compile",
"org.apache.spark" %% "spark-mllib" % "2.3.1" % "compile"
)
Starting stand alone Instance of the spark in scala
create a Application.scala file paste the following snippet
object Application extends App {
val spark = SparkSession
.builder()
.appName("test")
.config("spark.master", "local")
.getOrCreate();
}
object Application extends App {}
- is like main function of an app like we use in java
variable val spark
is public static final variale or called immutable,once assigned the value cannot be modified,So above code snippet create running spark session spark.close()
will close the spark session
Reading the csv from spark
Create seperate file Called Readcsv
class Readcsv(spark:SparkSession) {
def read(path:String ): DataFrame = {
return spark.read.format("csv").option("header",false).load(path).toDF();
}
}
ReadCsv class has constructor paramater taking sparkSession has input and
read function is defined having paramater of string type where the file is stored, this function reads the csv and return dataframe object of spark sql
Selecting the required columns from the the dataFrame
val csvreader = new Readcsv(spark);
var data = csvreader("/path/to/file.csv")
data.show();
data = data.select("_c0","_c1","_c2").toDf();
var list = data.toList;
we create Readcsv instance by passing the spark session created in above code snippet of application object, and declare mutable variable called data and read the content of csv to that,
data.show()
prints first 20 rows
data.select("_c0","_c1","_c2").toDf()
select necessary columns for us
data.toList
converts the result to Array list
Data Frame ?
As copied from mapr.com blog - A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL. DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.