The Java Web Scraping Handbook

list. Then we will iterate through this list, and for each item select title, the URL, author etc with a relative Xpath and then print the text content or value.

HackerNewsScraper.java

HtmlPage page = client.getPage(baseUrl);
List<HtmlElement> itemList =  page.getByXPath("//tr[@class='athing']");
if(itemList.isEmpty())")[1];
else")[1];

Printing the result in your IDE is cool, but exporting to JSON or another well formated/reusable format is better. We will use JSON, with the Jackson library, to map items in JSON format.

First we need a POJO (plain old java object) to represent the Hacker News items :

HackerNewsItem.java

public class HackerNewsItem {
private String title;

private String url ;
private String author;
private int score;
private int position ;
private int id ;

public HackerNewsItem(String title, String url, String author, int score, int position, int id) 
    jobId = stringResponse.split("\
//getters and setters
}

Then add the Jackson dependency to your pom.xml : pom.xml


  com.fasterxml.jackson.core
  jackson-databind
  2.7.0

Now all we have to do is create an HackerNewsItem, set its attributes, and convert it to JSON string (or a file …). Replace the old System.out.prinln() by this :

HackerNewsScraper.java

HackerNewsItem hnItem = new HackerNewsItem(title, url, author, score, position, id);
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(hnItem) ;
// print or save to a file
System.out.println(jsonString);

And that’s it. You should have a nice list of JSON formatted items.

Go further

You can find the full code in this Github repository.

3. Handling forms

In this chapter, we are going to see how to handle forms on the web. Knowing how to submit forms can be critical to extract information behind a login form, or to perform actions that require to be authenticated. Here are some examples of actions that require to submit a form :

Create an account
Authentication
Post a comment on a blog
Upload an image or a file
Search and Filtering on a website
Collecting a user email
Collecting payment information from a user
Any user-generated content !

Form Theory

There are two parts of a functional HTML form: the user interface (defined by its HTML code and CSS) with different inputs and the backend code, which is going to process the different values the user entered, for example by storing it in a database, or charging the credit card in case of a payment form.

Form tag

HTML forms begins with a tag. There are many attributes. The most important ones are the action and method attribute.


The action attribute represents the URL where the HTTP request will be sent, and the method attribute specifies which HTTP method to use.
Generally, POST methods are used when you create or modify something, for example:

Login forms
Account creation
Add a comment to a blog

Form inputs
In order to collect user inputs, the  element is used. It is this element that makes the text field appear. The  element has different attributes :

type: email, text, radio, file, date…
name: the name associated with the value that will be sent
many more

Let’s take an example of a typical login form :

And here is the corresponding HTML code (CSS code is not included):

<form action="login" method="POST">
  <div class="imgcontainer">
    <img src="img_avatar2.png" alt="Avatar" class="avatar">
  div>

  <div class="container">
    <label for="uname"><b>Usernameb>label>
    <input type="text" placeholder="Enter Username" name="uname" required>

    <label for="psw"><b>Passwordb>label>
    <input type="password" placeholder="Enter Password" name="psw" required>

    <button type="submit">Loginbutton>

  div>

form>


When a user fills the form with his credentials, let’s say usernameand my_great_password and click the submit button, the request sent by the browser will look like this :

POST /login HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
uname=username&psw=my_great_password


Cookies
After the POST request is made, if the credentials are valid the server will generally set cookies in the response headers, to allow the user to navigate.
This cookie is often named (the name depends on the technology/framework used by the website’s backend):

session_id
session
JSESSION_ID
PHPSESSID

This cookie will be sent for each subsequent requests by the browser, and the website’s backend will check its presence and validity to authorize requests. Cookies are not only used for login, but for lots of different use cases:

Shopping carts
User preferences
Tracking user behavior

Cookies are small key/value pairs stored in the browser, or in an HTTP client, that looks like this:
cookie_name=cookie_value

An HTTP response that sets a cookie looks like this:

HTTP/1.0 200 OK
Content-type: text/html
Set-Cookie: cookie_name=cookie_value


An HTTP request with a cookie looks like this:

GET /sample_page.html HTTP/1.1
Host: www.example.org
Cookie: cookie_name=cookie_value


A cookie can have different attributes :

Expires: Expiration date, by default, cookies expire when the client closes the connection.
Secure: only sent to HTTPS URLs
HttpOnly: Inaccessible to Javascript Document.cookie, to prevent session hijacking and XSS attack
Domain: Specifies which host is allowed to receive the cookie

Login forms
To study login forms, let me introduce you the website I made to apply some example in this book : https://www.javawebscrapingsandbox.com
This website will serve for the rest of the book for lots of different examples, starting with the authentication example. Let’s take a look at the login form HTML :

Basically, our scraper needs to :

Get to the login page
Fills the input with the right credentials
Submit the form
Check if there is an error message or if we are logged in.

There are two “difficult” thing here, the XPath expressions to select the different inputs, and how to submit the form.
To select the email input, it is quite simple, we have to select the first input inside a form, which name attribute is equal to email, so this XPath attribute should be ok: //form//input[@name="email"].
Same for the password input : //form//input[@name="password"]
To submit the form, HtmlUnit provides a great method to select a form : HtmlForm loginForm = input.getEnclosingForm().
Once you have the form object, you can generate the POST request for this form using: loginForm.getWebRequest(null) that’s all you have to do 🙂
Let’s take a look at the full code:

public class Authentication {

static final String baseUrl = "https://www.javawebscrapingsandbox.com/" ;
static final String loginUrl = "account/login" ;
static final String email = "test@test.com" ;
static final String password = "test" ;

public static void main(String[] args) throws FailingHttpStatusCodeException,
 MalformedURLException, IOException, InterruptedException {
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);

// Get the login page
HtmlPage page = client.getPage(String.
format("%s%s", baseUrl, loginUrl)) ;

// Select the email input
HtmlInput inputEmail = page.getFirstByXPath(
"//form//input[@name='email']");

// Select the password input
HtmlInput inputPassword = page.getFirstByXPath(
"//form//input[@name='password']");

// Set the value for both inputs
inputEmail.setValueAttribute(email);
inputPassword.setValueAttribute(password);

// Select the form
HtmlForm loginForm = inputPassword.getEnclosingForm() ;

// Generate the POST request with the form
page = client.getPage(loginForm.getWebRequest(null));

if(!page.asText().contains("You are now logged in")){
System.err.println("Error: Authentication failed");
}else{
System.out.println("Success ! Logged in");
}

}
}


This method works for almost every websites. Sometimes if the website uses a Javascript framework, HtmlUnit will not be able to execute the Javascript code (even with setJavaScriptEnabled(true) ) and you will have to either 1) inspect the HTTP POST request in Chrome Dev Tools and recreate it, or use Headless Chrome which I will cover in the next chapter.
Let’s take a look at the POST request created by HtmlUnit when we call loginForm.getWebRequest(null). To view this, launch the main method in debug mode, and inspect the content (ctrl/cmd + MAJ + D in eclipse) :
WebRequest[]

We have a lot going one here. You can see that instead of just having two parameters sent to the server (email and password), we also have a csrf_token parameter, and its value changes everytime we submit the form. This parameter is hidden, as you can see in the form’s HTML :

CSRF stands for Cross Site Request Forgery. The token is generated by the server and is required in every form submissions / POST requests. Almost every website use this mechanism to prevent CSRF attack. You can learn more about CSRF attack here. Now let’s create our own POST request with HtmlUnit.
The first thing we need is to create a WebRequest object. Then we need to set the URL, the HTTP method, headers, and parameters. Adding request header to a WebRequest object is quite simple, all you need to to is to call the setAdditionalHeader method. Adding parameters to your request must me done with the setRequestParametersmethod, which takes a list of NameValuePair. As discussed earlier, we have to add the csrf_token to the parameters, which can be selected easily with this XPath expression : //form//input[@name="csrf_token"]

HtmlInput csrfToken = page.getFirstByXPath("//form//input[@name='csrf_token']") ;
WebRequest request = new WebRequest(
new URL("http://www.javawebscrapingsandbox.com/account/login"), HttpMethod.POST);
List<NameValuePair> params = new ArrayList<NameValuePair>();
params.add(new NameValuePair("csrf_token", csrfToken.getValueAttribute()));
params.add(new NameValuePair("email", email));
params.add(new NameValuePair("password", password));

request.setRequestParameters(params);
request.setAdditionalHeader("Content-Type", "application/x-www-form-urlencoded");
request.setAdditionalHeader("Accept-Encoding", "gzip, deflate");

page = client.getPage(request);


Case study: Hacker News authentication
Let’s say you want to create a bot that logs into a website (to submit a link or perform an action that requires being authenticated) :
Here is the login form and the associated DOM :

Now we can implement the login algorithm

public static WebClient autoLogin(String loginUrl, String login, String password)
throws FailingHttpStatusCodeException, MalformedURLException, IOException{
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);

HtmlPage page = client.getPage(loginUrl);

HtmlInput inputPassword = page.getFirstByXPath("
//input[@type='password']");
//The first preceding input that is not hidden
HtmlInput inputLogin = inputPassword.getFirstByXPath("
.//preceding::input[not(@type='hidden')]");

inputLogin.setValueAttribute(login);
inputPassword.setValueAttribute(password);

//get the enclosing form
HtmlForm loginForm = inputPassword.getEnclosingForm() ;

//submit the form
page = client.getPage(loginForm.getWebRequest(null));

//returns the cookie filled client :)
return client;
}


Then the main method, which :


calls autoLogin with the right parameters


Go to https://news.ycombinator.com


Check the logout link presence to verify we’re logged


Prints the cookie to the console



public static void main(String[] args) {

String baseUrl = "https://news.ycombinator.com" ;
String loginUrl = baseUrl + "/login?goto=news" ;
String login = "login";
String password = "password" ;

try {
System.out.println("Starting autoLogin on " + loginUrl);
WebClient client = autoLogin(loginUrl, login, password);
HtmlPage page = client.getPage(baseUrl) ;

HtmlAnchor logoutLink = page
.getFirstByXPath(String.format(
"//a[@href='user?id=%s']", login)) ;

if(logoutLink != null ){
System.out.println("Successfuly logged in !");
// printing the cookies
for(Cookie cookie : client.
getCookieManager().getCookies()){
System.out.println(cookie.toString());
}
}else{
System.err.println("Wrong credentials");
}

} catch (Exception e) {
e.printStackTrace();
}
}


You can find the code in this Github repo
Go further
There are many cases where this method will not work: Amazon, DropBox… and all other two-steps/captcha-protected login forms.
Things that can be improved with this code :
File Upload
File upload is not something often used in web scraping. But it can be interesting to know how to upload files, for example if you want to test your own website or to automate some tasks on websites.
There is nothing complicated, here is a little form on the sandbox website (you need to be authenticated):

Here is the HTML code for the form :

<div class="ui text container">
<h1>Upload Your Files Broh1>

<form action="/upload_file" method="POST" enctype="multipart/form-data">

<label for="user_file">Upload Your Filelabel>
<br>br>
<input type="file" name="user_file">
<br>br>
<button type="submit">Uploadbutton>

form>
div>


As usual, the goal here is to select the form, if there is a name attribute you can use the method getFormByName() but in this case there isn’t, so we will use a good old XPath expression. Then we have to select the input for the file and set our file name to this input. Note that you have to be authenticated to post this form.

fileName = "file.png" ;
page = client.getPage(baseUrl + "upload_file") ;
HtmlForm uploadFileForm = page.getFirstByXPath("//form[@action='/upload_file']");
HtmlFileInput fileInput = uploadFileForm.getInputByName("user_file");

fileInput.setValueAttribute(fileName);
fileInput.setContentType("image/png");

HtmlElement button = page.getFirstByXPath("//button");
page = button.click();

if(page.asText().contains("Your file was successful uploaded")){
System.out.println("File successfully uploaded");
}else{
System.out.println("Error uploading the file");
}


Other forms
Search Forms
Another common need when doing web scraping is to submit search forms. Websites having a large database, like marketplaces often provide a search form to look for a specific set of items.
There is generally three different ways search forms are implemented :

When you submit the form, a POST request is sent to the server
A GET request is sent with query parameters
An AJAX call is made to the server

As an example, I’ve set up a search form on the sandbox website :

It is a simple form, there is nothing complicated. As usual, we have to select the inputs field, fill it with the values we want, and submit the form. We could also reproduce the POST request manually, as we saw in the beginning of the chapter. When the server sends the response back, I chose to loop over the result, and print it in the console (The whole code is available in the repo as usual.)

HtmlPage page = client.getPage(baseUrl + "product/search");

HtmlInput minPrice = page.getHtmlElementById("min_price");
HtmlInput maxPrice = page.getHtmlElementById("max_price");

// set the min/max values
minPrice.setValueAttribute(MINPRICE);
maxPrice.setValueAttribute(MAXPRICE);
HtmlForm form = minPrice.getEnclosingForm();

page = client.getPage(form.getWebRequest(null));

HtmlTable table = page.getFirstByXPath("//table");
for(HtmlTableRow elem : table.getBodies().get(0).getRows()){
System.out.println(String.format("Name : %s Price: %s", elem.getCell(0).asText(), elem.getCell(2).asText()));
}


And here is the result:
Name : ClosetMaid 1937440 SuiteS Price: 319.89 $
Name : RWS Model 34 .22 Caliber Price: 314.97 $
Name : Neato Botvac D5 Connected Price: 549.00 $
Name : Junghans Men's 'Max Bill' Price: 495.00 $

Basic Authentication
In the 90s, basic authentication was everywhere. Nowadays, it’s rare, but you can still find it on corporate websites. It’s one of the simplest forms of authentication. The server will check the credentials in the Authorization header sent by the client, or issue a prompt in case of a web browser.
If the credentials are not correct, the server will respond with a 401 (Unauthorized) response status.

Here is the URL on the sandbox website : https://www.javawebscrapingsandbox.com/basic_auth
The Username is : basic
The password is : auth
It’s really simple to use basic auth with HtmlUnit, all you have to do is format your URL with this pattern : https://username:password@www.example.com

HtmlPage page = client.getPage(String.format("https://%s:%s@www.javawebscrapingsandbox.com/basic_auth", username, password));
System.out.println(page.asText());


4. Dealing with Javascript
Dealing with a website that uses lots of Javascript to render their content can be tricky. These days, more and more sites are using frameworks like Angular, React, Vue.js for their frontend. These frontend frameworks are complicated to deal with because there are often using the newest features of the HTML5 API, and HtmlUnit and other headless browsers do not commonly support these features.
So basically the problem that you will encounter is that your headless browser will download the HTML code, and the Javascript code, but will not be able to execute the full Javascript code, and the webpage will not be totally rendered.
There are some solutions to these problems. The first one is to use a better headless browser. And the second one is to inspect the API calls that are made by the Javascript frontend and to reproduce them.
Javascript 101
Javascript is an interpreted scripting language. It’s more and more used to build “Web applications” and “Single Page Applications”.
The goal of this chapter is not to teach you Javascript, to be honest, I’m a terrible Javascript developer, but I want you to understand how it is used on the web, with some examples.
The Javascript syntax is similar to C or Java, supporting common data types, like Boolean, Number, String, Arrays, Object… Javascript is loosely typed, meaning there is no need to declare the data type explicitly.
Here is some code examples:

function plusOne(number) {
    return number + 1 ;
}
var a = 4 ;
var b = plusOne(a) ;
console.log(b);
// will print 5 in the console


As we saw in chapter 2, Javascript is mainly used on the web to modify the DOM dynamically and perform HTTP requests. Here is a sample code that use a stock API to retrieve the latest Apple stock price when clicking a button:


<html>
<head>
    <script>
    function refreshAppleStock(){
      fetch("https://api.iextrading.com/1.0/stock/aapl/batch?types=quote,news,chart&range=1m&last=10")
        .then(function(response){
          return response.json();
        }).then(function(data){
           document.getElementById('my_cell').innerHTML = '$' + data.quote.latestPrice ;
        });
    }
  script>
head>
<body>
  <div>
    <h2>Apple stock price:h2>
    <div id="my_cell">
    div>
    <button id="refresh" onclick="refreshAppleStock()">Refreshbutton>
  div>
body>
html>


Jquery
jQuery is one of the most used Javascript libraries. It’s really old, the first version was written in 2006, and it is used for lots of things such as:

DOM manipulation
AJAX calls
Event handling
Animation
Plugins (Datepicker etc.)

Here is a jQuery version of the same apple stock code (you can note that the jQuery version is not necessarily clearer than the vanilla Javascript one…) :


<html>
<head>
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js">script>
    <script>
    function refreshAppleStock(){
      $.get("https://api.iextrading.com/1.0/stock/aapl/batch?types=quote,news,chart&range=1m&last=10", function(data, status) {
          $('#my_cell').html('$' + data.quote.latestPrice);
      });
    }

    $(document).ready(function(){
      $("#refresh").click(function(){
          refreshAppleStock();
      });
    });

  script>
head>
<body>
  <div>
    <h2>Apple stock price:h2>
    <div id="my_cell">
    div>
    <button id="refresh">Refreshbutton>
  div>

body>
html>


If you want to know more about Javascript, I suggest you this excellent book: Eloquent Javascript
Modern Javascript frameworks
There are several problems with jQuery. It is extremely difficult to write clean/maintainable code with it as the Javascript application growths. Most of the time, the codebase becomes full of “glue code”, and you have to be careful with each id or class name changes. The other big concern is that it can be complicated to implement data-binding between Javascript models and the DOM.
The other problem with the traditional server-side rendering is that it can be inefficient. Let’s say you are browsing a table on an old website. When you request the next page, the server is going to render the entire HTML page, with all the assets and send it back to your browser. With an SPA, only one HTTP request would have been made, the server would have sent back a JSON containing the data, and the Javascript framework would have filled the HTML model it already has with the new values!
Here is a diagram to better understand how it works :

In theory, SPAs are faster, have better scalability and lots of other benefits compared to server-side rendering.
That’s why Javascript frameworks were created. There are lots of different Javascript frameworks :
These frameworks are often used to create so-called “Single Page Applications”. There are lots of differences between these, but it is out of this book scope to dive into it.
It can be challenging to scrape these SPAs because there are often lots of Ajax calls and websockets connections involved. If performance is an issue, you should always try to reproduce the Javascript code, meaning manually inspecting all the network calls with your browser inspector, and replicating the AJAX calls containing interesting data.
So depending on what you want to do, there are several ways to scrape these websites. For example, if you need to take a screenshot, you will need a real browser, capable of interpreting and executing all the Javascript code, that is what the next part is about.
Headless Chrome
We are going to introduce a new feature from Chrome, the headless mode. There was a rumor going around, that Google used a special version of Chrome for their crawling needs. I don’t know if this is true, but Google launched the headless mode for Chrome with Chrome 59 several months ago.
PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about Headless Chrome, the PhantomJS maintainer said that he was stepping down as maintainer, because I quote “Google Chrome is faster and more stable than PhantomJS […]” It looks like Chrome headless is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.
HtmlUnit, PhantomJS, and the other headless browsers are very useful tools, the problem is they are not as stable as Chrome, and sometimes you will encounter Javascript errors that would not have happened with Chrome.
Prerequisites

Google Chrome > 59
Chromedriver
Selenium
In your pom.xml add a recent version of Selenium :



    org.seleniumhq.selenium
    selenium-java
    3.8.1



If you don’t have Google Chrome installed, you can download it here To install Chromedriver you can use brew on MacOS :
brew install chromedriver

You can also install Chrome driver with npm:
npm install chromedriver

Or download it using the link below. There are a lot of versions, I suggest you to use the last version of Chrome and chromedriver.
Let’s take a screenshot of a real SPA
We are going to take a screenshot of the Coinbase website, which is a cryptocurrency exchange, made with React framework, and full of API calls and websocket !

We are going to manipulate Chrome in headless mode using the Selenium API. The first thing we have to do is to create a WebDriver object, whose role is similar the toe WebClient object with HtmlUnit, and set the chromedriver path and some arguments :

// Init chromedriver
String chromeDriverPath = "/Path/To/Chromedriver" ;
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless", "--disable-gpu", "--window-size=1920,1200","--ignore-certificate-errors");
WebDriver driver = new ChromeDriver(options);


The --disable-gpu option is needed on Windows systems, according to the documentation Chromedriver should automatically find the Google Chrome executable path, if you have a special installation, or if you want to use a different version of Chrome, you can do it with :
options.setBinary("/Path/to/specific/version/of/Google Chrome");

If you want to learn more about the different options, here is the Chromedriver documentation
The next step is to perform a GET request to the Coinbase website, wait for the page to load and then take a screenshot.
We have done this in a previous article, here is the full code :

public class ChromeHeadlessTest {
    private static String userName = "" ;
    private static String password = "" ;

    public static void main(String[] args) throws IOException{
      String chromeDriverPath = "/path/to/chromedriver" ;
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless", "--disable-gpu", "--window-size=1920,1200","--ignore-certificate-errors", "--silent");
        WebDriver driver = new ChromeDriver(options);

        // Get the login page
        driver.get("https://pro.coinbase.com/trade/BTC-USD");
        Thread.sleep(10000);

        // Take a screenshot of the current page
        File screenshot = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        FileUtils.copyFile(screenshot, new File("screenshot.png"));
        driver.close();
    driver.quit();
   }
}


You should now have a nice screenshot of the Coinbase homepage.
Several things are going on here. The line with the Thread.sleep(10000) allows the browser to wait for the entire page to load. This is not necessarily the best method, because maybe we are waiting too long, or too little depending on multiple factors (your own internet connection, the target website speed etc.).
This is a common problem when scraping SPAs, and one way I like to solve this is by using the WebDriverWait object:

WebDriverWait wait = new WebDriverWait(driver, 20);
wait.until(ExpectedConditions.
    presenceOfElementLocated(By.xpath("/path/to/element")));


There are lots of different ExpectedConditions you can find the documentation here I often use ExpectedConditions.visibilityOfAllElementsLocatedBy(locator) because the element can be present, but hidden until the asynchronous HTTP call is completed.
This was a brief introduction to headless Chrome and Selenium, now let’s see some common and useful Selenium objects and methods!
Selenium API
In the Selenium API, almost everything is based around two interfaces :

WebDriver which is the HTTP client
WebElement which represents a DOM object

The WebDriver can be initialized with almost every browser, and with different options (and of course, browser-specific options) such as the window size, the logs file’s path etc.
Here are some useful methods :



Method
Description




driver.get(URL)
performs a GET request to the specified URL


driver.getCurrentUrl()
returns the current URL


driver.getPageSource()
returns the full HTML code for the current page


driver.navigate().back()
navigate one step back in the history, works with forward too


driver.switchTo().frame(frameElement)
switch to the specified iFrame


driver.manage().getCookies()
returns all cookies, lots of other cookie related methods exists


driver.quit()
quits the driver, and closes all associated windows


driver.findElement(by)
returns a WebElement located by the specified locator



The findElement() method is one of the most interesting for our scraping needs.
You can locate elements with different ways :

findElement(By.Xpath('/xpath/expression'))
findElement(By.className(className)))
findElement(By.cssSelector(selector)))

Once you have a WebElement object, there are several useful methods you can use:



Method
Description




findElement(By)
you can again use this method, using a relative selector


click()
clicks on the element, like a button


getText()
returns the inner text (meaning the text that is inside the element)


sendKeys('some string')
enters some text in an input field


getAttribute('href')
returns the attribute’s value(in this example, the href attribute)



Infinite scroll
Infinite scroll is heavily used in social websites, news websites, or when dealing with a lot of information. We are going to see three different ways to scrape infinite scroll.
I’ve set up a basic infinite scroll here: Infinite Scroll Basically, each time you scroll near the bottom of the page, an AJAX call is made to an API and more elements are added to the table.

Scrolling to the bottom
The first way of scraping this page is to make our headless browser scroll to the bottom of the page. There is a nice method we can use on the Window object, called ScrollTo(). It is really simple to use, you give it an X and Y coordinate, and it will scroll to that location.
In order to execute this Javascript code, we are going to use a Javascript executor. It allows us to execute any Javascript code in the context of the current web page (or more specifically, the current tab). It means we have access to every Javascript function and variables defined in the current page.
In this example, note that the webpage is showing a fixed 20 rows in the table on the first load. So if our browser window is too big, we won’t be able to scroll. This “mistake” was made on purpose. To deal with this, we must tell our headless Chrome instance to open with a small window size !

String chromeDriverPath = "/path/to/chromedriver" ;
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless" ,"--disable-gpu", "--ignore-certificate-errors", "--silent");
// REALLY important option here, you must specify a small window size to be able to scroll
options.addArguments("window-size=600,400");

WebDriver driver = new ChromeDriver(options);
JavascriptExecutor js = (JavascriptExecutor) driver;
int pageNumber = 5 ;

driver.get("https://www.javawebscrapingsandbox.com/product/infinite_scroll");
for(int i = 0; i < pageNumber; i++){
    js.executeScript("window.scrollTo(0, document.body.scrollHeight);");
    // There are better ways to wait, like using the WebDriverWait object
    Thread.sleep(1200);
}
List<WebElement> rows = driver.findElements(By.xpath("//tr"));

// do something with the row list
processLines(rows);

driver.quit();


Executing a Javascript function
The second way of doing this, is inspecting the Javascript code to understand how the infinite scroll is built, to do this, as usual, right click + inspect to open the Chrome Dev tools, and find the

Method	Description
`driver.get(URL)`	performs a GET request to the specified URL
`driver.getCurrentUrl()`	returns the current URL
`driver.getPageSource()`	returns the full HTML code for the current page
`driver.navigate().back()`	navigate one step back in the history, works with forward too
`driver.switchTo().frame(frameElement)`	switch to the specified iFrame
`driver.manage().getCookies()`	returns all cookies, lots of other cookie related methods exists
`driver.quit()`	quits the driver, and closes all associated windows
`driver.findElement(by)`	returns a `WebElement` located by the specified locator

Method	Description
`findElement(By)`	you can again use this method, using a relative selector
`click()`	clicks on the element, like a button
`getText()`	returns the inner text (meaning the text that is inside the element)
`sendKeys('some string')`	enters some text in an input field
`getAttribute('href')`	returns the attribute’s value(in this example, the href attribute)






	Post navigation

	Previous Post
 After DDOS attacks, Blizzard rolls back Hardcore WoW deaths for the first time
Next Post
The best multiplayer maps in video game history

Go further

3. Handling forms

Form Theory

Form tag

Form inputs

Cookies

Case study: Hacker News authentication

Go further

File Upload

Other forms

Search Forms

Basic Authentication

4. Dealing with Javascript

Javascript 101

Jquery

Modern Javascript frameworks

Headless Chrome

Prerequisites

Let’s take a screenshot of a real SPA

Selenium API

Infinite scroll

Scrolling to the bottom

Executing a Javascript function

Comments

Leave a Reply Cancel reply