Introduction to Web Scraping With Java

tag with the class location

Once we have the details, title, price, and location, we are printing them on the screen.

public static void main(String[] args) throws IOException {
    var searchQuery = "iphone 13";
    var searchUrl = "https://newyork.craigslist.org/search/moa?query=%s".formatted(URLEncoder.encode(searchQuery, StandardCharsets.UTF_8));

    System.out.println("searchUrl = " + searchUrl);

    try (var client = new WebClient()) {
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);
        client.getOptions().setThrowExceptionOnFailingStatusCode(false);
        client.getOptions().setThrowExceptionOnScriptError(false);

        HtmlPage page = client.getPage(searchUrl);
        for (var htmlItem : page.<HtmlElement>getByXPath("//li[contains(@class,'cl-static-search-result')]")) {
            HtmlAnchor itemAnchor = htmlItem.getFirstByXPath(".//a");
            HtmlElement itemTitle = htmlItem.getFirstByXPath(".//div[@class='title']");
            HtmlElement itemPrice = htmlItem.getFirstByXPath(".//div[@class='price']");
            HtmlElement itemLocation = htmlItem.getFirstByXPath(".//div[@class='location']");

            if (itemAnchor != null && itemTitle != null) {
                System.out.printf("Name: %s, Price: %s, Location: %s, URL: %s%n", itemTitle.asNormalizedText(), itemPrice.asNormalizedText(), (itemLocation == null) ? "N/A" : itemLocation.asNormalizedText(), itemAnchor.getHrefAttribute());
            }
        }
    }
}

Voilà, we have parsed the whole page and managed to extract the individual product items!

💡 We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Please check out the documentation
here
for more information.

Converting to JSON

While the previous example provided an excellent overview on how to quickly scrape a website, we could take this a step further and convert the data into a structured and machine-readable format, such as JSON.

For that, we just need to make small changes to our code and introduce a special object to hold our results.

POJO

We add an additional POJO (Plain Old Java Object) class, which will represent the JSON object and hold our data. In modern Java, we can use the record keyword to define a simple data holder class:

record Item(String title, BigDecimal price, String location, String url) {
}

Mapping

Now, we need to instantiate our mapper:

private final static ObjectMapper OBJECT_MAPPER = new ObjectMapper();

and use it to convert our data into JSON:

if (itemAnchor != null && itemTitle != null) {
    var itemName = itemTitle.asNormalizedText();
    var itemUrl = itemAnchor.getHrefAttribute();
    var itemPriceText = itemPrice.asNormalizedText();
    var itemLocationText = (itemLocation == null) ? "N/A" : itemLocation.asNormalizedText();
    
    var item = new Item(itemName, new BigDecimal(itemPriceText.replace("$", "").replace(",", ".")), itemLocationText, itemUrl);
    System.out.println("item = " + OBJECT_MAPPER.writeValueAsString(item));
}

Let’s take it a step further

Our project provided us so far with a quick overview on what web scraping is, its fundamental concepts, and how to set up our own crawler, using Java and XPath.

For now, it’s a relatively simple example, taking a defined search term and returning as JSON all the products sold in the area of New York City. What if we wanted to get data from more than one city? Let’s check it out.

Multi-city support

If you closely look at the URL we previously used for the search, you’ll notice, Craigslist catalogues its ads by city and keeps that information as part of the hostname of the URL.

For example, our ads for New York City are all behind the following URL:

https://newyork.craigslist.org

If we wanted to fetch the ads relevant to Boston, we’d be using https://boston.craigslist.org instead.

Now, let’s say, we’d like to retrieve all iPhone 13 ads for the East Coast and, specifically, for New York, Boston, and Washington D.C. In that case, we’d simply revisit our code from
Fetching the page
and extend it a bit, to support the other cities as well:

public static void main(String[] args) throws IOException {
    var searchQuery = "iphone 13";
    var cities = List.of("newyork", "boston", "washingtondc");

    try (var client = new WebClient()) {
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);
        client.getOptions().setThrowExceptionOnFailingStatusCode(false);
        client.getOptions().setThrowExceptionOnScriptError(false);

        for (String city : cities) {
            var searchUrl = "https://%s.craigslist.org/search/moa?query=%s".formatted(city, URLEncoder.encode(searchQuery, StandardCharsets.UTF_8));

            System.out.println("searchUrl = " + searchUrl);

            HtmlPage page = client.getPage(searchUrl);
            for (var htmlItem : page.<HtmlElement>getByXPath("//li[contains(@class,'cl-static-search-result')]")) {
                HtmlAnchor itemAnchor = htmlItem.getFirstByXPath(".//a");
                HtmlElement itemTitle = htmlItem.getFirstByXPath(".//div[@class='title']");
                HtmlElement itemPrice = htmlItem.getFirstByXPath(".//div[@class='price']");
                HtmlElement itemLocation = htmlItem.getFirstByXPath(".//div[@class='location']");

                if (itemAnchor != null && itemTitle != null) {
                    var itemName = itemTitle.asNormalizedText();
                    var itemUrl = itemAnchor.getHrefAttribute();
                    var itemPriceText = itemPrice.asNormalizedText();
                    var itemLocationText = (itemLocation == null) ? "N/A" : itemLocation.asNormalizedText();

                    var item = new Item(itemName, new BigDecimal(itemPriceText.replace("$", "").replace(",", ".")), itemLocationText, itemUrl);
                    System.out.println("item = " + OBJECT_MAPPER.writeValueAsString(item));
                }
            }
        }
    }
}

What we now did was to add a list for the cities and iterate over it to fetch the ads for each city individually.

Voilà, we now run the request for each city individually!

Output customisation

You could encounter the situation where your crawler may have to support different output formats.

For example, you might have to support JSON and CSV. In that case you could simply add a switch to your code, which changes the output format depending on its value:

public static void main(String[] args) {
   var outputType = args.length == 1 ? args[0].toLowerCase() : "";
   var searchQuery = "iphone 13";
   var cities = List.of("newyork", "boston", "washingtondc");

   var results = fetchCities(cities, searchQuery);

   switch (outputType) {
      case "json" -> asJson(results);
      case "csv" -> asCsv(results);
      default -> System.out.println("unknown output type");
   }
}

private static void asCsv(Map<String, List<Item>> results) {
   System.out.println("city,title,price,location,url");
   for (Map.Entry<String, List<Item>> entry : results.entrySet()) {
      for (Item item : entry.getValue()) {
         System.out.printf("%s,%s,%s,%s,%s%n", entry.getKey(), item.title, item.price, item.location, item.url);
      }
   }
}

The fetchCities method would then return a list of items for each city, and the asJson() and asCsv() methods would convert the data into the respective format.

If you now pass json as first argument to your crawler call, it will return a JSON object for each entry (just as we originally showed under
Mapping
). If you passed csv, it would print a comma-separated line for each entry instead.

Increasing scale with parallelisation

If you are planning to scrape a large number of sites, you might face the issue of slow performance. In that case, you could consider parallelising your requests to speed up the process.

Let’s start by expanding our previous example by adding additional cities to our list:

var cities = List.of("newyork", "boston", "washingtondc", "losangeles", "chicago", "sanfrancisco", "seattle", "miami", "dallas", "denver");

Now, let’s measure the time it takes to fetch all the cities sequentially by wrapping the original code into a timed method:

public static void main(String[] args) {
    timed(() -> {
        var outputType = args.length == 1 ? args[0].toLowerCase() : "";
        var searchQuery = "iphone 13";
        var cities = List.of("newyork", "boston", "washingtondc", "losangeles", "chicago", "sanfrancisco", "seattle", "miami", "dallas", "denver");
        var results = fetchCities(cities, searchQuery);
        switch (outputType) {
            case "json" -> asJson(results);
            case "csv" -> asCsv(results);
            default -> System.out.println("unknown output type");
        }
    });
}

private static void timed(Runnable action) {
    var start = System.currentTimeMillis();
    action.run();
    var end = System.currentTimeMillis();
    System.out.printf("time = %dms%n", end - start);
}

It turns out it runs in around 15 seconds:

time = 15861ms

Not great, not terrible, but would not be acceptable for a large number of cities.

Let’s parallelise the requests by using Java’s virtual threads, which are great for I/O-bound tasks.

To do this, we need to change the fetchCities method to scrape each city on a separate virtual thread. This is done by wrapping the code in a CompletableFuture and using a VirtualThreadExecutor:

private static Map<String, List<Item>> fetchCities(List<String> cities, String searchQuery) {
    try (var client = new WebClient()) {
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);
        client.getOptions().setThrowExceptionOnFailingStatusCode(false);
        client.getOptions().setThrowExceptionOnScriptError(false);
        
        return cities.stream().map(city -> Map.entry(city, CompletableFuture.supplyAsync(() -> {
            var searchUrl = "https://%s.craigslist.org/search/moa?query=%s".formatted(city, URLEncoder.encode(searchQuery, StandardCharsets.UTF_8));
            System.out.println("fetching: " + searchUrl);
            try {
                var results = new ArrayList<Item>();
                HtmlPage page = client.getPage(searchUrl);
                for (var htmlItem : page.<HtmlElement>getByXPath("//li[contains(@class,'cl-static-search-result')]")) {
                    HtmlAnchor itemAnchor = htmlItem.getFirstByXPath(".//a");
                    HtmlElement itemTitle = htmlItem.getFirstByXPath(".//div[@class='title']");
                    HtmlElement itemPrice = htmlItem.getFirstByXPath(".//div[@class='price']");
                    HtmlElement itemLocation = htmlItem.getFirstByXPath(".//div[@class='location']");
                    
                    if (itemAnchor != null && itemTitle != null) {
                        var itemName = itemTitle.asNormalizedText();
                        var itemUrl = itemAnchor.getHrefAttribute();
                        var itemPriceText = itemPrice.asNormalizedText();
                        var itemLocationText = (itemLocation == null) ? "N/A" : itemLocation.asNormalizedText();
                        var item = new Item(itemName, new BigDecimal(itemPriceText.replace("$", "").replace(",", ".")), itemLocationText, itemUrl);
                        results.add(item);
                    }
                }
                return results;
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            }, Executors.newVirtualThreadPerTaskExecutor()))).toList()
                .stream()
                .collect(Collectors.toMap(Map.Entry::getKey, e -> e.getValue().join()));
    }
}

Now, if you run it again, you’ll see that the time has been reduced significantly:

time: 3473ms

Next steps

The examples mentioned so far provided a bit of insight on how to scrape Craigslist, but there are certainly still a few areas which could be improved.

Pagination handling
Support for more than one criterion
and more

Of course, there’s a lot more to scraping than just fetching a single HTML page and running a few XPath expressions. Especially when it comes to distributed scraping, fully handling JavaScript, and CAPTCHAs, the topic can quickly become very complex. If you like it and would like to have these things handled automatically, then please simply check out our
web scraping API
. The first 1,000 API calls are on us!

Even more

We are almost at the end of this post, so thanks for staying with us until now, but we’d still have a couple of recommended articles for you.

Don’t get blocked

Also check out our recent blog post on
Web Scraping without getting blocked
, which goes into details on how to optimise your scraping approach in order to avoid being blocked by anti-scraping measures.

Scraping with Chrome and full JavaScript support

While HtmlUnit is a wonderful headless browser, you may still want to check out our other article on the
Introduction to Headless Chrome
, as this will provide you with additional insight on how to use Chrome’s headless mode, which features full JavaScript support, just as you’d expect it from your daily driver browser.

One CSS selector, please

CSS selectors are used for much more these days, than just applying colours and spacing. Very often they are used in the very same context as XPath expressions and if you happen to prefer CSS selectors, you should definitely also check out our
tutorial on HTML parsing with Java using jsoup
.

Python maybe?

Python has been one of the most popular languages for years at this point and is, in fact, commonly used for web scraping as well. If Python is your choice of language, you might just like our other
guide on using Python for scraping web pages
.

Or Groovy?

I’ve you like Java you’re going to LOVE Groovy. Check out our guide to
web scraping with Groovy
You may also like our guide about
web scraping with Kotlin

What about Scala?

Of course, we didn’t forget about
web scraping with Scala
, you should check it out!

Code sample

You can find the full source code of this example in our
Github repository
.