Apache Spark Imputer Usage In Scala

This Tutorial explain what is Spark imputer, implement the Imputer and basic terminologies used while using the imputer.And strategies available in spark imputer.

Apache Spark Imputer Usage In Scala

What is Imputer ?

Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located.

So while using Imputer in scala the the the field should be in DoubleType or FloatType. so to type cast the value in dataFrame follow this tutorial. And to Start basic Apache session and understand data frame follow this tutorial .

And Spark's Imputer supports the two types of Strategies Mean and Median

Mean is the average of the numbers. It is easy to calculate: add up all the numbers, then divide by how many numbers there are. In other words it is the sum divided by the count.

Median is the  Middle Value in the given numerical order.

So Lets implement the Imputer in Scala..!!!

Below code demonstrate the following Steps:

  1. First we shall create the Dataframe(df) using seq Array
  2. Create the Imputer Instance with Strategy
  3. Fit the Data to Imputer
  4. Transform the Data based on Strategy
val df = spark.createDataFrame(Seq(
      (1.0, 0),
      (2.0, 0),
      (0, 3.0),
      (4.0, 4.0),
      (5.0, 5.0)
    )).toDF("_c1", "_c2")
    
val imputer = new Imputer().
    setStrategy("median").
    setMissingValue(0).
    setInputCols(Array("_c1","_c2")).setOutputCols(Array("_c1_out","_c2_out"));

val model = imputer.fit(df)
val data = model.transform(data);

Fit or Fitting a model means that you're making your algorithm learn the relationship between predictors and outcome so that you can predict the future values of the outcome.

Transform or model transformation, in model-driven engineering, is an automated way of modifying and creating models based on Model Fitted.