4 Moments : Skew,Kurtosis,Mean,Variance in Scala and Apache Spark
Moments is specific quantitative measure of the shape of the data. In statistics, moments are used to understand the various characteristics of a probability distribution. usually we use moments to characterise the data, identify the shape of normal distribution. Moments are used measure to central tendency, dispersion, skewness and kurtosis of a distribution.So lets find out how how to calculate these Moments in Scala with spark.
Central tendency
This is nothing but Mean, its the average value of a distribution. to calculate the central tendency we can use Imputer or Spark SQL's stats function.
Dispersion
This Nothing but Variance,Is a measure that how far the data set is spread out, So calculate the Central tendency and dispersion refer this tutorial.
Skewness
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the centre point. lets see how to calculate Skewness in spark Scala.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
object Main extends App {
print("hello world");
val spark = SparkSession
.builder()
.appName("test")
.config("spark.master", "local")
.getOrCreate();
var data = spark.read.format("csv").
option("header", true).load("/<data-downloaded-from mean tutorial>.csv").toDF();
data = data
.withColumn("rn",row_number()
.over(Window.orderBy("year"))).toDF();
data = data.filter(data("rn") > 2).toDF();
data.filter(data("value") !== "C").agg(skewness(data("value"))).show();
}
Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case. So to calculate the Kurtosis in Scala spark refer the code below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
object Main extends App {
print("hello world");
val spark = SparkSession
.builder()
.appName("test")
.config("spark.master", "local")
.getOrCreate();
var data = spark.read.format("csv").
option("header", true).load("/<data-downloaded-from mean tutorial>.csv").toDF();
data = data
.withColumn("rn",row_number()
.over(Window.orderBy("year"))).toDF();
data = data.filter(data("rn") > 2).toDF();
data.filter(data("value") !== "C").agg(skewness(data("value"))).show();
dta.filter(data("value") !== "C").agg(kurtosis(data("value"))).show()
}