27 Dec 2019 2 min read Apache Spark

4 Moments : Skew,Kurtosis,Mean,Variance in Scala and Apache Spark

Moments is specific quantitative measure of the shape of the data. In statistics, moments are used to understand the various characteristics of a probability distribution. usually we use moments to characterise the data, identify the shape of normal distribution. Moments are used measure to central tendency, dispersion, skewness and kurtosis of a distribution.So lets find out how how to calculate these Moments in Scala with spark.

Central tendency

This is nothing but Mean, its the average value of a distribution. to calculate the central tendency we can use Imputer or Spark SQL's stats function.

Dispersion

This Nothing but Variance,Is a measure that how far the data set is spread out, So calculate the Central tendency and dispersion refer this tutorial.

Skewness

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the centre point. lets see how to calculate Skewness in spark Scala.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

object Main extends App {
  print("hello world");

  val spark = SparkSession
    .builder()
    .appName("test")
    .config("spark.master", "local")
    .getOrCreate();


  var data = spark.read.format("csv").
    option("header", true).load("/<data-downloaded-from mean tutorial>.csv").toDF();

  data = data
    .withColumn("rn",row_number()
      .over(Window.orderBy("year"))).toDF();

  data = data.filter(data("rn") > 2).toDF();

  data.filter(data("value") !== "C").agg(skewness(data("value"))).show();

}

Kurtosis

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case. So to calculate the Kurtosis in Scala spark refer the code below.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

object Main extends App {
  print("hello world");

  val spark = SparkSession
    .builder()
    .appName("test")
    .config("spark.master", "local")
    .getOrCreate();


  var data = spark.read.format("csv").
    option("header", true).load("/<data-downloaded-from mean tutorial>.csv").toDF();

  data = data
    .withColumn("rn",row_number()
      .over(Window.orderBy("year"))).toDF();

  data = data.filter(data("rn") > 2).toDF();

  data.filter(data("value") !== "C").agg(skewness(data("value"))).show();
  dta.filter(data("value") !== "C").agg(kurtosis(data("value"))).show()
}

Ashrith G N

I am a Professional Web Application developer with extensive experience in designing and developing Web apps and Web services. Also I write technical blogs

Mysore

Central tendency

Dispersion

Skewness

Kurtosis

Ashrith G N

You might also like...