Removing Stop Words in Apache Spark using Scala

Removing Stop Words in Apache Spark using Scala

Long ago is was working on my pet project where i used scrape description and title form web URL and indexing words for granular search and grouping. the project was in java. and i had to remove few words which i did not want to index, Like "I","The","In" wanted mainly focus on context rather than verbal grammar. just to summarise Stop words are common words used in language,we generally remove stop words better search and indexing.And Apache Spark has function to remove the stop words so now lets us try to implement the by writing a simple Scala code.

You can read this article to get started with Scala and SBT [https://blogs.ashrithgn.com/install-scala-and-scala-build-tools-ubuntu/]

Also you refer other tutorials related to Apache spark [https://blogs.ashrithgn.com/tag/apache-spark/]

1) Including spark Dependency

So in this tutorial i will be using Scala with SBT to start the project. so the effective build.sbt looks like this.

name := "stop-word-remover"
organization := "com.ashrithgn.scala.tut"
version := "1.0"
scalaVersion := "2.11.8"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.2.3",
  "org.apache.spark" %% "spark-mllib" % "2.4.4" % "compile"
)

So sbt syntax is more like gradle or maven, represented bit slightly libraryDependencies takes sequential array and pulls the dependency form maven repo. another plus point in scala we can use java dependencies too.

2) Creating the data

For this example i have, took the description from one of my blog.so we shall prepare the data, usually to remove stop words we have to tokenize the string to words, so we shall convert the string to words in the below code.

object Main extends App {
  print("hello world");

  var siteDesription : String  = "How to - install scala,and sbt(scala build tool) on ubuntu 18.04 " +
    "lts & ubuntu 16.04 lts, and start a hello world project in sbt and shell. ... " +
    "So Scala is packed object-oriented and functional programming in one succinct, " +
    "high-level language. ... hello world project using Scala build tool";

 var words = siteDesription.replaceAll("[-+.^:,&]","")
     .split(" ")
     .map(_.toString).distinct;
}

So the in the  above code i am converting the string to array also removing some special characters which i don't need. no we shall further extend code to start spark session andand apply the stop word remover function.

object Main extends App {
  print("hello world");

  var siteDesription : String  = "How to - install scala,and sbt(scala build tool) on ubuntu 18.04 " +
    "lts & ubuntu 16.04 lts, and start a hello world project in sbt and shell. ... " +
    "So Scala is packed object-oriented and functional programming in one succinct, " +
    "high-level language. ... hello world project using Scala build tool";

   var words = siteDesription.replaceAll("[-+.^:,&]","")
     .split(" ")
     .map(_.toString).distinct;

  val spark = SparkSession
    .builder()
    .appName("test")
    .config("spark.master", "local")
    .getOrCreate();

    var data = spark.createDataFrame(Seq(
    (0, words)
  )).toDF("id", "raw")

  val remover = new StopWordsRemover()
    .setInputCol("raw")
    .setOutputCol("filtered")

   remover.transform(data).select("filtered").show(false);
}

 so this code will provide  following output pasted below, in which you can observe the words like "I","How","to" ,"In" are removed.

[, install, scalaand, sbt(scala, build, tool), ubuntu, 18.04, lts, 16.04, start, hello, world, project, sbt, shell., ..., Scala, packed, objectoriented, functional, programming, one, succinct, highlevel, language., using, tool]