Basic encoding : label encoding and one hot encoding in Scala with Apache Spark

Basic encoding : label encoding and one hot encoding in Scala with Apache Spark

If your starting with machine learning, after cleaning the data you end up with Normalising data, this is where encoding techniques comes in handy. there are lot of data encoding techniques but we hear lot about one hot encoding and label encoding a lot. also adaption rate for Scala programming language is picking up, and Spark is one among industry standard tools used for ML& Data Science. now we shall see what is label encoding and one hot encoding and how to implement in Scala Spark.

Label Encoding

lets take and example of string with countries ["india","us","china","us","india"]  so label encoding means your are specifying and unique numerical index for a string so when we apply label encoding for the above example the o/p will be [1,2,3,2,1]. now we shall se how to apply this in Apache Spark using Scala as programming language.

Before jumping into implementation following this tutorial to begin with Apache spark with Scala.

Introduction to Apache Spark in scala
What is Apache Spark ?Apache Spark is all referred as big data processing tool or framework developedunder Apache. Spark has various inbuilt tool like SparkSQL, SparkStreaming,Spark Mllib,GraphX to handle the big data work. Overview of things covered in this Tutorial * Adding dependency to sca…

So now lets create some sample random data of some fantasy game played by few countries like this

+-------+------+----------+----+
|country|points|contestant|year|
+-------+------+----------+----+
|  India|    30|         2|1991|
|     Us|    30|         1|1991|
|  India|    30|         2|1992|
|  China|    30|         3|1993|
|  India|    30|         2|1993|
|     Us|    30|         1|1992|
|  India|    30|         2|1992|
|  China|    30|         3|1993|
+-------+------+----------+----+

for above sample data let me we shall write small spark code to apply label encoding.

import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession

object Main extends App {

  var sampleData = Seq(
    ("India", 30, 2, 1991),
    ("Us", 30, 1, 1991),
    ("India", 30, 2, 1992),
    ("China", 30, 3, 1993),
    ("India", 30, 2, 1993),
    ("Us", 30, 1, 1992),
    ("India", 30, 2, 1992),
    ("China", 30, 3, 1993)
  );

  val spark = SparkSession
    .builder()
    .appName("test")
    .config("spark.master", "local")
    .getOrCreate();

  val sampleDataDf = spark.createDataFrame(sampleData).toDF("country", "points", "contestant", "year");
  val sampleIndexedDf = new StringIndexer().setInputCol("country").setOutputCol("country_index").fit(sampleDataDf).transform(sampleDataDf);
  sampleDataDf.show();
  sampleIndexedDf.show();
}

And if you run this snippet you will get the following o/p.

+-------+------+----------+----+-------------+
|country|points|contestant|year|country_index|
+-------+------+----------+----+-------------+
|  India|    30|         2|1991|          0.0|
|     Us|    30|         1|1991|          2.0|
|  India|    30|         2|1992|          0.0|
|  China|    30|         3|1993|          1.0|
|  India|    30|         2|1993|          0.0|
|     Us|    30|         1|1992|          2.0|
|  India|    30|         2|1992|          0.0|
|  China|    30|         3|1993|          1.0|
+-------+------+----------+----+-------------+

 if you closely observe the country_index column for every unique country has an unique numerical value is assigned. basically we use label encoding for categorical data. and categorical data means variables represent types of data which may be divided into groups, in our example country is the categorical grouping.and algorithms like decision trees and random forests that can work with categorical variables just fine when the data are ordinal i.e (small car, Sedan,SUV,Bus) with Label Encoder technique.

One Hot Encoding

This also another way encoding categorical data like label encoding, but when we label encode category is assigned to numerical index, if we take our above example we used in label encoding column `country_index` could create an ambiguity like country_index could also denotes the order or hierarchy, so to eliminate this ambiguity extend the label index into matrix by using One Hot Encoding. so lets dive into example by extending the label encoded data with simple example.

object Main extends App {

 
  var sampleData = Seq(
    ("India",1000),
    ("Us", 2000),
    ("China", 30000),
    ("Japan", 400),
    ("Korea", 500),
    ("canada", 600)

  );

  val spark = SparkSession
    .builder()
    .appName("test")
    .config("spark.master", "local")
    .getOrCreate();

  val sampleDataDf = spark.createDataFrame(sampleData).toDF("country", "visitors");
  val sampleIndexedDf = new StringIndexer().setInputCol("country").setOutputCol("country_index").fit(sampleDataDf).transform(sampleDataDf);
  var oneHotEncoder = new OneHotEncoder().setInputCol("country_index").setOutputCol("country_vec");
  var encoded = oneHotEncoder.transform(sampleIndexedDf);
  encoded.show();
}

and out put follows like this:

+-------+--------+-------------+-------------+
|country|visitors|country_index| country_vec|
+-------+--------+-------------+-------------+
|  India|    1000|          4.0|(5,[4],[1.0])|
|     Us|    2000|          1.0|(5,[1],[1.0])|
|  China|   30000|          2.0|(5,[2],[1.0])|
|  Japan|     400|          0.0|(5,[0],[1.0])|
|  Korea|     500|          5.0|    (5,[],[])|
| canada|     600|          3.0|(5,[3],[1.0])|
+-------+--------+-------------+-------------+

if we closely observer `country_vec`  i.e (5,[4],[1.0])` number 5 max value of the index [4] corresponds to `India` based on the value assigned during label encoding which is 4.0  so if we draw a matrix representation which will look like this

Japan Us China canada India Korea value
0 0 0 0 1 0 1000
0 1 0 0 0 0 2000
0 0 1 0 0 0 30000
1 0 0 0 0 0 400
0 0 0 0 0 1 500
0 0 0 1 0 0 600

One-hot encoding is often used for indicating the state of a state machine. we can apply OHE

  1. When the values that are close to each other in the label encoding correspond to target values that aren't close (non - linear data).
  2. When the categorical feature is not ordinal (house,car,mouse).