4 Moments : Skew,Kurtosis,Mean,Variance in Scala and Apache Spark

4 Moments : Skew,Kurtosis,Mean,Variance in Scala and Apache Spark

Moments is specific quantitative measure of the shape of the data. In statistics, moments are used to understand the various characteristics of a probability  distribution. usually we use moments to characterise the data, identify the shape of normal distribution. Moments are used measure to central tendency, dispersion, skewness and kurtosis of a distribution.So lets find out how how to calculate these Moments in Scala with spark.

Central tendency

This is nothing but Mean, its the average value of a distribution. to calculate the central tendency we can use Imputer or Spark SQL's stats function.

Dispersion

This Nothing but Variance,Is a measure that how far the data set is spread out, So calculate the Central tendency and dispersion refer this tutorial.

Basic statistics concepts for machine learning in Scala spark
Before applying some distribution algorithm or probability density function or probability mass function, we need to understand some basic concepts ofstatistics these concepts might be though in our school ,we shall start bybrushing up the concepts and implement those in Scala spark,Just for an o…

Skewness

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the centre point. lets see how to calculate Skewness in spark Scala.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

object Main extends App {
  print("hello world");

  val spark = SparkSession
    .builder()
    .appName("test")
    .config("spark.master", "local")
    .getOrCreate();


  var data = spark.read.format("csv").
    option("header", true).load("/<data-downloaded-from mean tutorial>.csv").toDF();

  data = data
    .withColumn("rn",row_number()
      .over(Window.orderBy("year"))).toDF();

  data = data.filter(data("rn") > 2).toDF();

  data.filter(data("value") !== "C").agg(skewness(data("value"))).show();

}

Kurtosis

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case. So to calculate the Kurtosis in Scala spark refer the code below.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

object Main extends App {
  print("hello world");

  val spark = SparkSession
    .builder()
    .appName("test")
    .config("spark.master", "local")
    .getOrCreate();


  var data = spark.read.format("csv").
    option("header", true).load("/<data-downloaded-from mean tutorial>.csv").toDF();

  data = data
    .withColumn("rn",row_number()
      .over(Window.orderBy("year"))).toDF();

  data = data.filter(data("rn") > 2).toDF();

  data.filter(data("value") !== "C").agg(skewness(data("value"))).show();
  dta.filter(data("value") !== "C").agg(kurtosis(data("value"))).show()
}