Big Data

JSOUP for web scraping in java

Ashrith G N

Dec 20, 2019 • 3 min read

The Web Scraping is process of harvesting the the content from website's URL, As we are in world of Data driven decision making web scraping plays major role collecting data from public channel and processing content could help to analyse and fuel the the decision. so lets begin with one of the small use case, like your building an locality suggestion feature for your real estate application which would list the rentals building or an apartment,and client are interested to know crime rate in the locality, so we need to scrape some regional news data from reputed sources. parse it and analyse it so this is where web scarping plays an major role. these kinds of projects are called as data mining. so i came across one simple java library Jsoup.

Jsoup

Open source Java HTML parser, with DOM, CSS, and jquery-like methods for easy data extraction. version 0.2 was released on Feb of 2010 and last major update was May 2019. so this library is actively developed and supported from almost a decade. so JSOUP is also widely accepted by the community. so now lets dive to parse simple web data and extract which we need using java.

1) Lets add dependency to java project.

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

2) Then we can extract content from an URL by removing unwanted content

 public static Map<String, String> parseHtml(String url, int loopCount) {
        try {

            Random r = new Random();
            List<String> unwatedClass = new ArrayList<String>();
            unwatedClass.add("nav");
            unwatedClass.add("navigation");
            unwatedClass.add("comment");
            unwatedClass.add("relatedposts");
            unwatedClass.add("banner");
            unwatedClass.add("infobox");
            unwatedClass.add("button");
            unwatedClass.add("menu");
            unwatedClass.add("recent");
            unwatedClass.add("tags");
            unwatedClass.add("social");
            unwatedClass.add("twitter");
            unwatedClass.add("share");
            unwatedClass.add("pagination");
            unwatedClass.add("references");
            unwatedClass.add("wikitable");
            unwatedClass.add("reflist");
            unwatedClass.add("links");
            unwatedClass.add("featured");
            unwatedClass.add("infobox");
            unwatedClass.add("footer");
            unwatedClass.add("noprint");
            unwatedClass.add("toc");
            unwatedClass.add("subscribe");
            unwatedClass.add("sd-title");
            unwatedClass.add("share");
            unwatedClass.add("like");
            unwatedClass.add("social");
            unwatedClass.add("menu");
            unwatedClass.add("search");
            unwatedClass.add("bio");
            unwatedClass.add("author");
            unwatedClass.add("archives");
            unwatedClass.add("categories");
            unwatedClass.add("subscription");
            unwatedClass.add("ticker-block");
            unwatedClass.add("related");
            unwatedClass.add("cite_ref");


            List<String> unwatedTag = new ArrayList<String>();
            unwatedTag.add("nav");
            unwatedTag.add("aside");
            unwatedTag.add("input");
            unwatedTag.add("footer");
            unwatedTag.add("button");
            unwatedTag.add("table");

            if (!url.contains("dzone.com")) {
                unwatedClass.add("widget");
            }


            Connection connection = Jsoup.connect(url);
            connection.userAgent("Mozilla");
            connection.timeout(0);
            //connection.proxy(ip,80);
            connection.referrer("https://medium.com");
            connection.followRedirects(true);
            Map<String, String> data = new LinkedHashMap<>();
            data.put("imageUrl", "");
            try {
                Document doc = connection.get();

                data.put("title", doc.title());
                String body = "";
                if (doc.body().text().trim().isEmpty()) {
                    String urlNew = retriveUrl(doc.html());
                    System.out.println(urlNew);
                    if (loopCount == 0) {
                        return parseHtml(urlNew, loopCount + 1);
                    }

                }

                try {
                    if (!doc.body().select("img").first().attr("src").isEmpty()) {
                        data.put("imageUrl", doc.body().select("img").first().absUrl("src"));
                    }
                } catch (Exception e) {
                }

                for (String unwantedTag : unwatedTag) {
                    doc.select(unwantedTag).remove();
                }

                for (String unwatedC : unwatedClass) {
                    doc.select("[class~=(?i).*" + unwatedC + "]").remove();
                    doc.select("[id~=(?i).*" + unwatedC + "]").remove();
                }


                Elements elements = null;

                try {
                    elements = doc.getElementsByTag("article").first().select("h1,h2,h3,p,pre,ol,ul");
                } catch (Exception e) {

                }


                if (elements == null || elements.isEmpty()) {
                    elements = doc.getElementsByTag("body").select("h1,h2,h3,p,pre,ol,ul");
                }


                Set<String> content = new LinkedHashSet<>();

                for (Element ele : elements) {

                    String text = ele.text().trim();
                    if (!text.isEmpty()) {
                        content.add(text);
                    }


                }
                body = StringUtils.join(content, "####next###");
                data.put("content", body);
            } catch (Exception e) {


            }
            return data;
        } catch (Exception ex) {
            ex.printStackTrace();
            return null;
        }
    }

The Above class removes and unwanted tags,classes & ids and retrieve body, image, and page Title. this piece of code is used in one of my live project. which in news aggregation app which reads news and articles of booked marked url you can read more about the app here https://blogs.ashrithgn.com/link-note-tts-text-to-speech-flutter/

Or you can download app from here:

https://play.google.com/store/apps/details?id=com.ashrithgn.apps.link_note

Jsoup

1) Lets add dependency to java project.

Sign up for more like this.