Big Data

POS tagging a sentence in Scala using Spark NLP.

Ashrith G N

Jan 28, 2020 • 2 min read

POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition. This task is not straightforward, as a particular word may have a different part of speech based on the context in which the word is used. so in this tutorial is will be using Spark NLP library by JhonSnow labs. so lets dive into implementing.

Dependency


name := "scalaExamples"

version := "0.1"

scalaVersion := "2.11.8"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.4.4",
  "org.apache.spark" %% "spark-mllib" % "2.4.4" % "compile",
  "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.3.6"
)

POS (part of speech) Tagging


object Main extends App {

  val spark = SparkSession
    .builder()
    .appName("test")
    .config("spark.master", "local")
    .getOrCreate();

  val sent = Seq((1, " I ate dinner."),
    (2, "We had a three-course meal."),
    (3, "Brad came to dinner with us."),
    (4, "He loves fish tacos."),
    (5, "we all felt like we ate too much."),
    (6, "In the end We all agreed; it was a magnificent evening."),
    (7, "the end We all agreed; it was a magnificent evening."));


  var df = spark.createDataFrame(sent).toDF("id", "sentence");
  df.show();

  df = new Tokenizer().setInputCol("sentence").setOutputCol("token").transform(df);

  val pM = PretrainedPipeline("explain_document_dl");
  val result = pM.annotate(df,"sentence");
  result.select("pos").show();
}

To begin with Scala and SBT follow this tutorial :

Dependency

POS (part of speech) Tagging

Sign up for more like this.