Web Scraping with Groovy | ScrapingBee

Groovy has been around for quite a while and has established itself as reliable scripting language for tasks where you’d like to use the full power of Java and the JVM, but without all its verbosity.

While typical use-cases often are build pipelines or automated testing, it works equally well for anything related to data extraction and web scraping. And that’s precisely, what we are going to check out in this article. Let’s fasten our seatbelts and dive right into web scraping and handling HTTP requests with Groovy.


cover image

💡 Interested in web scraping with Java? Check out our guide to the best Java web scraping libraries

Prerequisites

Before we enter the realm of web scraping, we need to perform the usual housekeeping and make sure our Groovy environment is properly set up. For that, we need:

For the sake of brevity, we do assume you already have Java set up 😊, so it’s just Groovy and Jodd. If you are using one of the mainstream IDEs (e.g. Eclipse or IntelliJ), you can typically quickly add Groovy support from their plugin configuration. Should you prefer to set it up manually, you can also always manually download and install it.

Once we have Groovy running, we just need our two Jodd libraries for HTTP and HTML parsing. Jodd is a fantastic set of Java libraries and Jodd HTTP and Jodd Lagarto, in particular, are two lean and elegant libraries.

Groovy Grapes

Usually, we’d be now talking about build management tools (e.g. Maven or Gradle) or JVM classpaths, however, did we already mention how much Groovy simplifies many things? Welcome to Groovy Grape.

All we need to do, is annotate our import statements with @Grab annotations and Groovy automatically takes care of all libraries on-the-fly.

All right, with the basics set up, we should be ready to go on our HTTP adventure. 🧙🏻‍♂️

Introduction to Groovy and Jodd HTTP

Sending a Plain GET Request

The most simple request always is a GET request. That’s pretty much always the standard request type and barely requires additional configuration information.

@Grab('org.jodd:jodd-http:6.2.1')
import jodd.http.HttpRequest;

def request = HttpRequest.get('https://example.com');
def response = request.send();

println(response);

What we did here was create a new HttpRequest object, set up a GET call, send() it, store the response in our variable, and print it using println.

Pretty simple, right? If we really wanted to be succinct, we could even have skipped most of the declarative rituals and have had a true one-linerprintln(jodd.http.HttpRequest.get('https://example.com').send());but do we want to be that succinct?

Let’s save that to a Groovy script file BasicHTTPRequest.groovy and run it.

groovy BasicHTTPRequest.groovy

Perfect, if everything went all right, we should now have some output similar to the following.

HTTP/1.1 200 OK
Cache-Control: max-age=604800
Connection: close
Content-Length: 1256
Content-Type: text/html; charset=UTF-8


Lovely! As per convention, when we passed response to println(), it called the object’s toString() method, which in turn provided us with the original response.

The response object has a number of methods, which provide access to most of the other response details. For example, we could have used statusCode() to get the HTTP status code of the response, in our case 200.

POSTing To The Server

Should we step it up a notch? Sure, let’s do a POST coupled with Groovy’s native JSON parser.


Austin PowersAustin Powers

@Grab('org.jodd:jodd-http:6.2.1')
import jodd.http.HttpRequest;
import groovy.json.JsonSlurper;

def request = HttpRequest.post('https://httpbin.org/post');

request.form('param1', 'value1', 'param2', 'value2');

def response = request.send();

def json = new JsonSlurper().parseText(response.bodyText());

println(json);

Quite similar to our previous example. The main difference here is we used post() to create a POST request, we appended form data with form(), and we didn’t print response directly, but passed this time only the response body (which is a JSON object in the example here) to Groovy’s JsonSlurper, which did all the heavy JSON lifting for us and provided us with a beautifully pre-populated Java map object.

Save the code in POSTRequest.groovy and run it once again with

groovy POSTRequest.groovy

Voilà, we should have a similar output to this

[args:[:], data:, files:[:], form:[param1:value1, param2:value2], headers:[Content-Length:27, Content-Type:application/x-www-form-urlencoded, Host:httpbin.org, User-Agent:Jodd HTTP, json:null, url:https://httpbin.org/post]

Here, we also notice the beauty of httpbin.org. Everything we send to it, it sends back to us and we can find the two form parameters, which previously passed in our request, in the JSON’s form field. Not bad, is it?

Quick summary

The combination of Groovy and Jodd HTTP allows us to write really concise and mostly boilerplate-free code, which still has the ability to leverage the full potential of the Java platform.

Compared to Java, Groovy makes many things quite a bit easier (e.g. default handling of JSON) and its syntax is overall a lot less verbose (e.g. no mandatory classes, equality).

Jodd HTTP, on the other hand, approaches HTTP with a rather straightforward and rational approach and tries to avoid boilerplate as well. Features like method-chaining are intrinsic to the library and allow one to compile a possibly complex HTTP request with just a handful of lines of code.

But enough of theory and simple examples, let’s scrape the web.

Real World Web Scraping in Groovy

In this part of the article, we are going to take a deep dive right into scraping and crawling code.

We will show three sample projects. The first example will focus on anonymous data extraction and we will learn about how to send HTTP requests with Jodd on Groovy, and how to handle HTML content with Lagarto and Jerry and how to extract information with CSS selectors.

With the second example, we will find out more about POST requests and how to use them to log into a page. On top of that we learn how to keep track of session cookies and how to use them for subsequent requests.

The third example will eventually be a proper deep dive into headless browsing and we will learn more about how we can control an entire browser instance from Groovy.

If you already read a couple of our other articles, you’ll have noticed we like to crawl Hacker News in our demos. It’s a great and resourceful site and always has fresh content to satisfy a hungry crawler, so in short, perfect.

This time, however, we are not just going to scrape the homepage for new articles, but instead, we want to crawl all the comments of an article, along with their metadata – and Groovy and Jodd allow us to do that in less than 50 lines of code.

Analyzing the DOM tree

But before we jump right into coding, let’s first check out the page structure, so that we know where the information is, we’d like to access, and how to extract it. So let’s aim for that F12 key and open the developer tools, select Elements/Inspector, and scroll to the comments at

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *