The Java Web Scraping Handbook

list. Then we will iterate through this list, and for each item select title, the URL, author etc with a relative Xpath and then print the text content or value.

HackerNewsScraper.java

HtmlPage page = client.getPage(baseUrl);
List<HtmlElement> itemList =  page.getByXPath("//tr[@class='athing']");
if(itemList.isEmpty())")[1];
else")[1];

Printing the result in your IDE is cool, but exporting to JSON or another well formated/reusable format is better. We will use JSON, with the Jackson library, to map items in JSON format.

First we need a POJO (plain old java object) to represent the Hacker News items :

HackerNewsItem.java

public class HackerNewsItem {
private String title;

private String url ;
private String author;
private int score;
private int position ;
private int id ;

public HackerNewsItem(String title, String url, String author, int score, int position, int id) 
    jobId = stringResponse.split("\
//getters and setters
}

Then add the Jackson dependency to your pom.xml : pom.xml


  com.fasterxml.jackson.core
  jackson-databind
  2.7.0

Now all we have to do is create an HackerNewsItem, set its attributes, and convert it to JSON string (or a file …). Replace the old System.out.prinln() by this :

HackerNewsScraper.java

HackerNewsItem hnItem = new HackerNewsItem(title, url, author, score, position, id);
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(hnItem) ;
// print or save to a file
System.out.println(jsonString);

And that’s it. You should have a nice list of JSON formatted items.

Go further

You can find the full code in this Github repository.

3. Handling forms

In this chapter, we are going to see how to handle forms on the web. Knowing how to submit forms can be critical to extract information behind a login form, or to perform actions that require to be authenticated. Here are some examples of actions that require to submit a form :

  • Create an account
  • Authentication
  • Post a comment on a blog
  • Upload an image or a file
  • Search and Filtering on a website
  • Collecting a user email
  • Collecting payment information from a user
  • Any user-generated content !

Form Theory

Form diagram

There are two parts of a functional HTML form: the user interface (defined by its HTML code and CSS) with different inputs and the backend code, which is going to process the different values the user entered, for example by storing it in a database, or charging the credit card in case of a payment form.

Form tag

Form diagram 2

HTML forms begins with a tag. There are many attributes. The most important ones are the action and method attribute.

The action attribute represents the URL where the HTTP request will be sent, and the method attribute specifies which HTTP method to use.

Generally, POST methods are used when you create or modify something, for example:

  • Login forms
  • Account creation
  • Add a comment to a blog

Form inputs

In order to collect user inputs, the element is used. It is this element that makes the text field appear. The element has different attributes :

  • type: email, text, radio, file, date…
  • name: the name associated with the value that will be sent
  • many more

Let’s take an example of a typical login form :

Classic login form

And here is the corresponding HTML code (CSS code is not included):

<form action="login" method="POST">
  <div class="imgcontainer">
    <img src="img_avatar2.png" alt="Avatar" class="avatar">
  div>

  <div class="container">
    <label for="uname"><b>Usernameb>label>
    <input type="text" placeholder="Enter Username" name="uname" required>

    <label for="psw"><b>Passwordb>label>
    <input type="password" placeholder="Enter Password" name="psw" required>

    <button type="submit">Loginbutton>

  div>

form>

When a user fills the form with his credentials, let’s say usernameand my_great_password and click the submit button, the request sent by the browser will look like this :

POST /login HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
uname=username&psw=my_great_password

Cookies

After the POST request is made, if the credentials are valid the server will generally set cookies in the response headers, to allow the user to navigate.

This cookie is often named (the name depends on the technology/framework used by the website’s backend):

  • session_id
  • session
  • JSESSION_ID
  • PHPSESSID

This cookie will be sent for each subsequent requests by the browser, and the website’s backend will check its presence and validity to authorize requests. Cookies are not only used for login, but for lots of different use cases:

  • Shopping carts
  • User preferences
  • Tracking user behavior

Cookies are small key/value pairs stored in the browser, or in an HTTP client, that looks like this:

cookie_name=cookie_value

An HTTP response that sets a cookie looks like this:

HTTP/1.0 200 OK
Content-type: text/html
Set-Cookie: cookie_name=cookie_value

An HTTP request with a cookie looks like this:

GET /sample_page.html HTTP/1.1
Host: www.example.org
Cookie: cookie_name=cookie_value

A cookie can have different attributes :

  • Expires: Expiration date, by default, cookies expire when the client closes the connection.
  • Secure: only sent to HTTPS URLs
  • HttpOnly: Inaccessible to Javascript Document.cookie, to prevent session hijacking and XSS attack
  • Domain: Specifies which host is allowed to receive the cookie

Login forms

To study login forms, let me introduce you the website I made to apply some example in this book : https://www.javawebscrapingsandbox.com

This website will serve for the rest of the book for lots of different examples, starting with the authentication example. Let’s take a look at the login form HTML :

Login form screenshot

Basically, our scraper needs to :

  • Get to the login page
  • Fills the input with the right credentials
  • Submit the form
  • Check if there is an error message or if we are logged in.

There are two “difficult” thing here, the XPath expressions to select the different inputs, and how to submit the form.

To select the email input, it is quite simple, we have to select the first input inside a form, which name attribute is equal to email, so this XPath attribute should be ok: //form//input[@name="email"].

Same for the password input : //form//input[@name="password"]

To submit the form, HtmlUnit provides a great method to select a form : HtmlForm loginForm = input.getEnclosingForm().

Once you have the form object, you can generate the POST request for this form using: loginForm.getWebRequest(null) that’s all you have to do 🙂

Let’s take a look at the full code:

public class Authentication {

static final String baseUrl = "https://www.javawebscrapingsandbox.com/" ;
static final String loginUrl = "account/login" ;
static final String email = "test@test.com" ;
static final String password = "test" ;

public static void main(String[] args) throws FailingHttpStatusCodeException,
 MalformedURLException, IOException, InterruptedException {
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);

// Get the login page
HtmlPage page = client.getPage(String.
format("%s%s", baseUrl, loginUrl)) ;

// Select the email input
HtmlInput inputEmail = page.getFirstByXPath(
"//form//input[@name='email']");

// Select the password input
HtmlInput inputPassword = page.getFirstByXPath(
"//form//input[@name='password']");

// Set the value for both inputs
inputEmail.setValueAttribute(email);
inputPassword.setValueAttribute(password);

// Select the form
HtmlForm loginForm = inputPassword.getEnclosingForm() ;

// Generate the POST request with the form
page = client.getPage(loginForm.getWebRequest(null));

if(!page.asText().contains("You are now logged in")){
System.err.println("Error: Authentication failed");
}else{
System.out.println("Success ! Logged in");
}

}
}

This method works for almost every websites. Sometimes if the website uses a Javascript framework, HtmlUnit will not be able to execute the Javascript code (even with setJavaScriptEnabled(true) ) and you will have to either 1) inspect the HTTP POST request in Chrome Dev Tools and recreate it, or use Headless Chrome which I will cover in the next chapter.

Let’s take a look at the POST request created by HtmlUnit when we call loginForm.getWebRequest(null). To view this, launch the main method in debug mode, and inspect the content (ctrl/cmd + MAJ + D in eclipse) :

WebRequest[]

We have a lot going one here. You can see that instead of just having two parameters sent to the server (email and password), we also have a csrf_token parameter, and its value changes everytime we submit the form. This parameter is hidden, as you can see in the form’s HTML :

CSRF token

CSRF stands for Cross Site Request Forgery. The token is generated by the server and is required in every form submissions / POST requests. Almost every website use this mechanism to prevent CSRF attack. You can learn more about CSRF attack here. Now let’s create our own POST request with HtmlUnit.

The first thing we need is to create a WebRequest object. Then we need to set the URL, the HTTP method, headers, and parameters. Adding request header to a WebRequest object is quite simple, all you need to to is to call the setAdditionalHeader method. Adding parameters to your request must me done with the setRequestParametersmethod, which takes a list of NameValuePair. As discussed earlier, we have to add the csrf_token to the parameters, which can be selected easily with this XPath expression : //form//input[@name="csrf_token"]

HtmlInput csrfToken = page.getFirstByXPath("//form//input[@name='csrf_token']") ;
WebRequest request = new WebRequest(
new URL("http://www.javawebscrapingsandbox.com/account/login"), HttpMethod.POST);
List<NameValuePair> params = new ArrayList<NameValuePair>();
params.add(new NameValuePair("csrf_token", csrfToken.getValueAttribute()));
params.add(new NameValuePair("email", email));
params.add(new NameValuePair("password", password));

request.setRequestParameters(params);
request.setAdditionalHeader("Content-Type", "application/x-www-form-urlencoded");
request.setAdditionalHeader("Accept-Encoding", "gzip, deflate");

page = client.getPage(request);

Case study: Hacker News authentication

Let’s say you want to create a bot that logs into a website (to submit a link or perform an action that requires being authenticated) :

Here is the login form and the associated DOM :

Hacker News login form

Now we can implement the login algorithm

public static WebClient autoLogin(String loginUrl, String login, String password)
throws FailingHttpStatusCodeException, MalformedURLException, IOException{
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);

HtmlPage page = client.getPage(loginUrl);

HtmlInput inputPassword = page.getFirstByXPath("
//input[@type='password']");
//The first preceding input that is not hidden
HtmlInput inputLogin = inputPassword.getFirstByXPath("
.//preceding::input[not(@type='hidden')]");

inputLogin.setValueAttribute(login);
inputPassword.setValueAttribute(password);

//get the enclosing form
HtmlForm loginForm = inputPassword.getEnclosingForm() ;

//submit the form
page = client.getPage(loginForm.getWebRequest(null));

//returns the cookie filled client :)
return client;
}

Then the main method, which :

  • calls autoLogin with the right parameters

  • Go to https://news.ycombinator.com

  • Check the logout link presence to verify we’re logged

  • Prints the cookie to the console

public static void main(String[] args) {

String baseUrl = "https://news.ycombinator.com" ;
String loginUrl = baseUrl + "/login?goto=news" ;
String login = "login";
String password = "password" ;

try {
System.out.println("Starting autoLogin on " + loginUrl);
WebClient client = autoLogin(loginUrl, login, password);
HtmlPage page = client.getPage(baseUrl) ;

HtmlAnchor logoutLink = page
.getFirstByXPath(String.format(
"//a[@href='user?id=%s']", login)) ;

if(logoutLink != null ){
System.out.println("Successfuly logged in !");
// printing the cookies
for(Cookie cookie : client.
getCookieManager().getCookies()){
System.out.println(cookie.toString());
}
}else{
System.err.println("Wrong credentials");
}

} catch (Exception e) {
e.printStackTrace();
}
}

You can find the code in this Github repo

Go further

There are many cases where this method will not work: Amazon, DropBox… and all other two-steps/captcha-protected login forms.

Things that can be improved with this code :

File Upload

File upload is not something often used in web scraping. But it can be interesting to know how to upload files, for example if you want to test your own website or to automate some tasks on websites.

There is nothing complicated, here is a little form on the sandbox website (you need to be authenticated):

File upload form

Here is the HTML code for the form :

<div class="ui text container">
<h1>Upload Your Files Broh1>

<form action="/upload_file" method="POST" enctype="multipart/form-data">

<label for="user_file">Upload Your Filelabel>
<br>br>
<input type="file" name="user_file">
<br>br>
<button type="submit">Uploadbutton>

form>
div>

As usual, the goal here is to select the form, if there is a name attribute you can use the method getFormByName() but in this case there isn’t, so we will use a good old XPath expression. Then we have to select the input for the file and set our file name to this input. Note that you have to be authenticated to post this form.

fileName = "file.png" ;
page = client.getPage(baseUrl + "upload_file") ;
HtmlForm uploadFileForm = page.getFirstByXPath("//form[@action='/upload_file']");
HtmlFileInput fileInput = uploadFileForm.getInputByName("user_file");

fileInput.setValueAttribute(fileName);
fileInput.setContentType("image/png");

HtmlElement button = page.getFirstByXPath("//button");
page = button.click();

if(page.asText().contains("Your file was successful uploaded")){
System.out.println("File successfully uploaded");
}else{
System.out.println("Error uploading the file");
}

Other forms

Search Forms

Another common need when doing web scraping is to submit search forms. Websites having a large database, like marketplaces often provide a search form to look for a specific set of items.
There is generally three different ways search forms are implemented :

  • When you submit the form, a POST request is sent to the server
  • A GET request is sent with query parameters
  • An AJAX call is made to the server

As an example, I’ve set up a search form on the sandbox website :

Search Form

It is a simple form, there is nothing complicated. As usual, we have to select the inputs field, fill it with the values we want, and submit the form. We could also reproduce the POST request manually, as we saw in the beginning of the chapter. When the server sends the response back, I chose to loop over the result, and print it in the console (The whole code is available in the repo as usual.)

HtmlPage page = client.getPage(baseUrl + "product/search");

HtmlInput minPrice = page.getHtmlElementById("min_price");
HtmlInput maxPrice = page.getHtmlElementById("max_price");

// set the min/max values
minPrice.setValueAttribute(MINPRICE);
maxPrice.setValueAttribute(MAXPRICE);
HtmlForm form = minPrice.getEnclosingForm();

page = client.getPage(form.getWebRequest(null));

HtmlTable table = page.getFirstByXPath("//table");
for(HtmlTableRow elem : table.getBodies().get(0).getRows()){
System.out.println(String.format("Name : %s Price: %s", elem.getCell(0).asText(), elem.getCell(2).asText()));
}

And here is the result:

Name : ClosetMaid 1937440 SuiteS Price: 319.89 $
Name : RWS Model 34 .22 Caliber Price: 314.97 $
Name : Neato Botvac D5 Connected Price: 549.00 $
Name : Junghans Men's 'Max Bill' Price: 495.00 $

Basic Authentication

In the 90s, basic authentication was everywhere. Nowadays, it’s rare, but you can still find it on corporate websites. It’s one of the simplest forms of authentication. The server will check the credentials in the Authorization header sent by the client, or issue a prompt in case of a web browser.

If the credentials are not correct, the server will respond with a 401 (Unauthorized) response status.

Basic Authentication

Here is the URL on the sandbox website : https://www.javawebscrapingsandbox.com/basic_auth

The Username is : basic

The password is : auth

It’s really simple to use basic auth with HtmlUnit, all you have to do is format your URL with this pattern : https://username:password@www.example.com

HtmlPage page = client.getPage(String.format("https://%s:%s@www.javawebscrapingsandbox.com/basic_auth", username, password));
System.out.println(page.asText());

4. Dealing with Javascript

Dealing with a website that uses lots of Javascript to render their content can be tricky. These days, more and more sites are using frameworks like Angular, React, Vue.js for their frontend. These frontend frameworks are complicated to deal with because there are often using the newest features of the HTML5 API, and HtmlUnit and other headless browsers do not commonly support these features.

So basically the problem that you will encounter is that your headless browser will download the HTML code, and the Javascript code, but will not be able to execute the full Javascript code, and the webpage will not be totally rendered.

There are some solutions to these problems. The first one is to use a better headless browser. And the second one is to inspect the API calls that are made by the Javascript frontend and to reproduce them.

Javascript 101

Javascript is an interpreted scripting language. It’s more and more used to build “Web applications” and “Single Page Applications”.

The goal of this chapter is not to teach you Javascript, to be honest, I’m a terrible Javascript developer, but I want you to understand how it is used on the web, with some examples.

The Javascript syntax is similar to C or Java, supporting common data types, like Boolean, Number, String, Arrays, Object… Javascript is loosely typed, meaning there is no need to declare the data type explicitly.

Here is some code examples:

function plusOne(number) {
    return number + 1 ;
}
var a = 4 ;
var b = plusOne(a) ;
console.log(b);
// will print 5 in the console

As we saw in chapter 2, Javascript is mainly used on the web to modify the DOM dynamically and perform HTTP requests. Here is a sample code that use a stock API to retrieve the latest Apple stock price when clicking a button:


<html>
<head>
    <script>
    function refreshAppleStock(){
      fetch("https://api.iextrading.com/1.0/stock/aapl/batch?types=quote,news,chart&range=1m&last=10")
        .then(function(response){
          return response.json();
        }).then(function(data){
           document.getElementById('my_cell').innerHTML = '$' + data.quote.latestPrice ;
        });
    }
  script>
head>
<body>
  <div>
    <h2>Apple stock price:h2>
    <div id="my_cell">
    div>
    <button id="refresh" onclick="refreshAppleStock()">Refreshbutton>
  div>
body>
html>

Jquery

jQuery is one of the most used Javascript libraries. It’s really old, the first version was written in 2006, and it is used for lots of things such as:

  • DOM manipulation
  • AJAX calls
  • Event handling
  • Animation
  • Plugins (Datepicker etc.)

Here is a jQuery version of the same apple stock code (you can note that the jQuery version is not necessarily clearer than the vanilla Javascript one…) :


<html>
<head>
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js">script>
    <script>
    function refreshAppleStock(){
      $.get("https://api.iextrading.com/1.0/stock/aapl/batch?types=quote,news,chart&range=1m&last=10", function(data, status) {
          $('#my_cell').html('$' + data.quote.latestPrice);
      });
    }

    $(document).ready(function(){
      $("#refresh").click(function(){
          refreshAppleStock();
      });
    });

  script>
head>
<body>
  <div>
    <h2>Apple stock price:h2>
    <div id="my_cell">
    div>
    <button id="refresh">Refreshbutton>
  div>

body>
html>

If you want to know more about Javascript, I suggest you this excellent book: Eloquent Javascript

Modern Javascript frameworks

There are several problems with jQuery. It is extremely difficult to write clean/maintainable code with it as the Javascript application growths. Most of the time, the codebase becomes full of “glue code”, and you have to be careful with each id or class name changes. The other big concern is that it can be complicated to implement data-binding between Javascript models and the DOM.

The other problem with the traditional server-side rendering is that it can be inefficient. Let’s say you are browsing a table on an old website. When you request the next page, the server is going to render the entire HTML page, with all the assets and send it back to your browser. With an SPA, only one HTTP request would have been made, the server would have sent back a JSON containing the data, and the Javascript framework would have filled the HTML model it already has with the new values!

Here is a diagram to better understand how it works :

Single Page Application

In theory, SPAs are faster, have better scalability and lots of other benefits compared to server-side rendering.

That’s why Javascript frameworks were created. There are lots of different Javascript frameworks :

These frameworks are often used to create so-called “Single Page Applications”. There are lots of differences between these, but it is out of this book scope to dive into it.

It can be challenging to scrape these SPAs because there are often lots of Ajax calls and websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing interesting data.

So depending on what you want to do, there are several ways to scrape these websites. For example, if you need to take a screenshot, you will need a real browser, capable of interpreting and executing all the Javascript code, that is what the next part is about.

Headless Chrome

We are going to introduce a new feature from Chrome, the headless mode. There was a rumor going around, that Google used a special version of Chrome for their crawling needs. I don’t know if this is true, but Google launched the headless mode for Chrome with Chrome 59 several months ago.

PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about Headless Chrome, the PhantomJS maintainer said that he was stepping down as maintainer, because I quote “Google Chrome is faster and more stable than PhantomJS […]” It looks like Chrome headless is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.

HtmlUnit, PhantomJS, and the other headless browsers are very useful tools, the problem is they are not as stable as Chrome, and sometimes you will encounter Javascript errors that would not have happened with Chrome.

Prerequisites

  • Google Chrome > 59
  • Chromedriver
  • Selenium
  • In your pom.xml add a recent version of Selenium :

    org.seleniumhq.selenium
    selenium-java
    3.8.1

If you don’t have Google Chrome installed, you can download it here To install Chromedriver you can use brew on MacOS :

brew install chromedriver

You can also install Chrome driver with npm:

npm install chromedriver

Or download it using the link below. There are a lot of versions, I suggest you to use the last version of Chrome and chromedriver.

Let’s take a screenshot of a real SPA

We are going to take a screenshot of the Coinbase website, which is a cryptocurrency exchange, made with React framework, and full of API calls and websocket !

Coinbase screenshot

We are going to manipulate Chrome in headless mode using the Selenium API. The first thing we have to do is to create a WebDriver object, whose role is similar the toe WebClient object with HtmlUnit, and set the chromedriver path and some arguments :

// Init chromedriver
String chromeDriverPath = "/Path/To/Chromedriver" ;
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless", "--disable-gpu", "--window-size=1920,1200","--ignore-certificate-errors");
WebDriver driver = new ChromeDriver(options);

The --disable-gpu option is needed on Windows systems, according to the documentation Chromedriver should automatically find the Google Chrome executable path, if you have a special installation, or if you want to use a different version of Chrome, you can do it with :

options.setBinary("/Path/to/specific/version/of/Google Chrome");

If you want to learn more about the different options, here is the Chromedriver documentation

The next step is to perform a GET request to the Coinbase website, wait for the page to load and then take a screenshot.

We have done this in a previous article, here is the full code :

public class ChromeHeadlessTest {
    private static String userName = "" ;
    private static String password = "" ;

    public static void main(String[] args) throws IOException{
      String chromeDriverPath = "/path/to/chromedriver" ;
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless", "--disable-gpu", "--window-size=1920,1200","--ignore-certificate-errors", "--silent");
        WebDriver driver = new ChromeDriver(options);

        // Get the login page
        driver.get("https://pro.coinbase.com/trade/BTC-USD");
        Thread.sleep(10000);

        // Take a screenshot of the current page
        File screenshot = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        FileUtils.copyFile(screenshot, new File("screenshot.png"));
        driver.close();
    driver.quit();
   }
}

You should now have a nice screenshot of the Coinbase homepage.

Several things are going on here. The line with the Thread.sleep(10000) allows the browser to wait for the entire page to load. This is not necessarily the best method, because maybe we are waiting too long, or too little depending on multiple factors (your own internet connection, the target website speed etc.).

This is a common problem when scraping SPAs, and one way I like to solve this is by using the WebDriverWait object:

WebDriverWait wait = new WebDriverWait(driver, 20);
wait.until(ExpectedConditions.
    presenceOfElementLocated(By.xpath("/path/to/element")));

There are lots of different ExpectedConditions you can find the documentation here I often use ExpectedConditions.visibilityOfAllElementsLocatedBy(locator) because the element can be present, but hidden until the asynchronous HTTP call is completed.

This was a brief introduction to headless Chrome and Selenium, now let’s see some common and useful Selenium objects and methods!

Selenium API

In the Selenium API, almost everything is based around two interfaces :

  • WebDriver which is the HTTP client
  • WebElement which represents a DOM object

The WebDriver can be initialized with almost every browser, and with different options (and of course, browser-specific options) such as the window size, the logs file’s path etc.

Here are some useful methods :

Method Description
driver.get(URL) performs a GET request to the specified URL
driver.getCurrentUrl() returns the current URL
driver.getPageSource() returns the full HTML code for the current page
driver.navigate().back() navigate one step back in the history, works with forward too
driver.switchTo().frame(frameElement) switch to the specified iFrame
driver.manage().getCookies() returns all cookies, lots of other cookie related methods exists
driver.quit() quits the driver, and closes all associated windows
driver.findElement(by) returns a WebElement located by the specified locator

The findElement() method is one of the most interesting for our scraping needs.

You can locate elements with different ways :

  • findElement(By.Xpath('/xpath/expression'))
  • findElement(By.className(className)))
  • findElement(By.cssSelector(selector)))

Once you have a WebElement object, there are several useful methods you can use:

Method Description
findElement(By) you can again use this method, using a relative selector
click() clicks on the element, like a button
getText() returns the inner text (meaning the text that is inside the element)
sendKeys('some string') enters some text in an input field
getAttribute('href') returns the attribute’s value(in this example, the href attribute)

Infinite scroll

Infinite scroll is heavily used in social websites, news websites, or when dealing with a lot of information. We are going to see three different ways to scrape infinite scroll.

I’ve set up a basic infinite scroll here: Infinite Scroll Basically, each time you scroll near the bottom of the page, an AJAX call is made to an API and more elements are added to the table.

Infinite table

Scrolling to the bottom

The first way of scraping this page is to make our headless browser scroll to the bottom of the page. There is a nice method we can use on the Window object, called ScrollTo(). It is really simple to use, you give it an X and Y coordinate, and it will scroll to that location.

In order to execute this Javascript code, we are going to use a Javascript executor. It allows us to execute any Javascript code in the context of the current web page (or more specifically, the current tab). It means we have access to every Javascript function and variables defined in the current page.

In this example, note that the webpage is showing a fixed 20 rows in the table on the first load. So if our browser window is too big, we won’t be able to scroll. This “mistake” was made on purpose. To deal with this, we must tell our headless Chrome instance to open with a small window size !

String chromeDriverPath = "/path/to/chromedriver" ;
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless" ,"--disable-gpu", "--ignore-certificate-errors", "--silent");
// REALLY important option here, you must specify a small window size to be able to scroll
options.addArguments("window-size=600,400");

WebDriver driver = new ChromeDriver(options);
JavascriptExecutor js = (JavascriptExecutor) driver;
int pageNumber = 5 ;

driver.get("https://www.javawebscrapingsandbox.com/product/infinite_scroll");
for(int i = 0; i < pageNumber; i++){
    js.executeScript("window.scrollTo(0, document.body.scrollHeight);");
    // There are better ways to wait, like using the WebDriverWait object
    Thread.sleep(1200);
}
List<WebElement> rows = driver.findElements(By.xpath("//tr"));

// do something with the row list
processLines(rows);

driver.quit();

Executing a Javascript function

The second way of doing this, is inspecting the Javascript code to understand how the infinite scroll is built, to do this, as usual, right click + inspect to open the Chrome Dev tools, and find the

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *