Introduction to Apache Spark in scala

Introduction to Apache Spark in scala

What is Apache Spark ?

Apache Spark is all referred as big data processing tool  or framework developed under Apache. Spark has various inbuilt tool like SparkSQL, Spark Streaming,Spark Mllib,GraphX to handle the big data work.

Overview of things covered in this Tutorial

  • Adding dependency to scala project
  • Starting stand alone Instance of the spark in scala
  • Creating local spark Instance
  • Reading the csv from spark
  • Selecting required coloums from the dataframe

Adding Spark as dependecy in scala

To start the scala project in sbt read here

Replace build.sbt the following code snippet

name := "scalaExample"

version := "0.1"

scalaVersion := "2.11.11"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.3.1" % "compile",
   "org.apache.spark" %% "spark-mllib" % "2.3.1" % "compile"
)

Starting stand alone Instance of the spark in scala

create a Application.scala file paste the following snippet

object Application extends App {
    
    val spark = SparkSession
    .builder()
    .appName("test")
    .config("spark.master", "local")
    .getOrCreate();
    
}

object Application extends App {} - is like main function of an app like we use in java

variable val spark is public static final variale or called immutable,once assigned the value cannot be modified,So above code snippet create running spark session spark.close() will close the spark session

Reading the csv from spark

Create seperate file Called Readcsv

class Readcsv(spark:SparkSession) {

  def read(path:String ): DataFrame = {
      return spark.read.format("csv").option("header",false).load(path).toDF();
    }
}

ReadCsv class has constructor paramater taking sparkSession has input and
read function is defined having paramater of string type where the file is stored, this function reads the csv and return dataframe object of spark sql

Selecting the required columns from the the dataFrame

val csvreader = new Readcsv(spark);
var data = csvreader("/path/to/file.csv")
data.show();
data = data.select("_c0","_c1","_c2").toDf();
var list = data.toList;

we create Readcsv instance by passing the spark session created in above code snippet of application object, and declare mutable variable called data and read the content of csv to that,
data.show() prints first 20 rows
data.select("_c0","_c1","_c2").toDf() select necessary columns for us
data.toList converts the result to Array list

Data Frame ?

As copied from mapr.com blog - A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL. DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.