Web Scraping with Scala | ScrapingBee - Your Source for B2B Tech Trends

This tutorial explains how to use three technologies for
web scraping
with Scala. The article first explains how to scrape a static HTML page with Scala using
jsoup
and
Scala Scraper
. Then, it explains how to scrape a dynamic HTML website with Scala using
Selenium
.

💡 Interested in web scraping with Java? Check out our guide to the
best Java web scraping libraries

Setting Up a Scala Project

The first step is to create a project in Scala. This tutorial uses
Scala
version 2.13.10 with
sbt
version 1.7.2. However, these examples also work with Scala 2.12 and 3.

Execute the following script to generate the necessary directories:

mkdir scala-web-scraping && cd $_
git init
echo '.bsp/
.idea/
target/' > .gitignore
mkdir -p src/{main,test}/{scala,resources} project
echo 'sbt.version = 1.7.2' > ./project/build.properties
echo 'ThisBuild / version := "0.1.0"
ThisBuild / scalaVersion := "2.13.10"
lazy val root = (project in file("."))" > build.sbt

Name your project dev.draft. Then, modify the build.sbt file to include the dependencies for jsoup 1.15.3, scala-scraper 3.0.0, and selenium-java 4.5.0:

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "2.13.10"

lazy val root = (project in file("."))
 .settings(
   name := "scala-web-scraping",
   libraryDependencies ++= Seq(
     "org.jsoup" % "jsoup" % "1.15.3",
     "net.ruippeixotog" %% "scala-scraper" % "3.0.0",
     "org.seleniumhq.selenium" % "selenium-java" % "4.5.0"
   )
 )

Basic Web Scraping with jsoup

In the directory src/main/scala/, make a new package called dev.draft, and inside that package, make a file called JsoupScraper.scala with the following contents:

package dev.draft

import org.jsoup._
import scala.jdk.CollectionConverters._

object JsoupScraper {

  def main(args: Array[String]): Unit = {
    val doc = Jsoup.connect("http://en.wikipedia.org/").get()
  }
}

Following the
jsoup documentation
, this particular line calls the connect method of the org.jsoup.Jsoup class to download the web page you’re scraping:

val doc = Jsoup.connect("http://en.wikipedia.org/").get()

Most of the jsoup classes are in the org.jsoup.Jsoup package. You use the connect method here to download the entire body of the page. Although parse is another method that works with the same syntax, it can only examine documents stored locally. The main difference between the two methods is that connect downloads and parses, while parse simply parses without downloading.

The doc is a nodes.Document type that contains the following:

doc: nodes.Document =  doctype html >
<html class="client-nojs" lang="en" dir="ltr">
   <head>
     <meta charset="UTF-8">
       <title>Wikipedia, the free encyclopediatitle>
       <script>document.documentElement.className="client-js";

To get the page title, use the following command:

If you use println(title), the type and value of title should be displayed as follows:

title: String = "Wikipedia, the free encyclopedia"

For practical purposes, in this tutorial, you’ll only use selection (select) and extraction (text and attr) methods. However, jsoup has many other functions aside from performing queries and modifying HTML documents. For example, it can also be used to perform unit tests on generated HTML code.

Selecting with jsoup

In this tutorial, you’ll select from three sections on the
Wikipedia home page
:

In the news
On this day
Did you know

While in your web browser on the Wikipedia page, right-click the In the news section. In the context menu, select Inspect in Firefox or View page source in Chrome. Since the relevant source code is contained in


, you'll use the element id with the value mp-itn to obtain the contents of the section:

val inTheNews = doc.select("#mp-itn b a")


If you use println(inTheNews), the resulting type and values should look similar to the following:

inTheNews: select.Elements = <a href="/wiki/AnnieErnaux" title="Annie Ernaux">Annie Ernauxa>
<a href="/wiki/2022_Nong_Bua_Lamphu_attack" title="2022 Nong Bua Lamphu attack">An attacka>
<a href="/wiki/Svante_P%C3%A4%C3%A4bo" title="Svante Paabo">Svante Paaboa>
<a href="/wiki/2022_London_Marathon" title="2022 London Marathon">the London Marathona>
<a href="/wiki/Portal:Current_events" title="Portal:Current events">Ongoinga>
<a href="/wiki/Deaths_in_2022" title="Deaths in 2022">Recent deathsa>
<a href="/wiki/Wikipedia:In_the_news/Candidates" title="Wikipedia:In the news/Candidates">Nominate an articlea>


Follow the same steps as before to view the source code and get the contents of the On this day section, and you should find an id with the value mp-otd, which you can use to obtain this section's elements:

val onThisDay = doc.select("#mp-otd b a")


The resulting type and values should look similar to the following:

onThisDay: select.Elements = <a href="/wiki/October_10" title="October 10">October 10a>
<a href="/wiki/Thanksgiving_(Canada)" title="Thanksgiving (Canada)">Thanksgivinga>
<a href="/wiki/Battle_of_Karbala" title="Battle of Karbala">Battle of Karbalaa>
<a href="/wiki/Ndyuka_people" title="Ndyuka people">Ndyuka peoplea>
<a href="/wiki/Triton_(moon)" title="Triton (moon)">Tritona>
<a href="/wiki/Spiro_Agnew" title="Spiro Agnew">Spiro Agnewa>
<a href="/wiki/Vidyasagar_Setu" title="Vidyasagar Setu">Vidyasagar Setua>


Once again, follow the same steps to view the source code and get the contents of the Did you know section, and you should get an id with the value mp-dyk, which you can use to obtain this section's elements:

val didYouKnow = doc.select("#mp-dyk b a")


The resulting type and values should look similar to the following:

didYouKnow: select.Elements =
 <a href="/wiki/Ranjit_Vilas_Palace_(Wankaner)" title="Ranjit Vilas Palace (Wankaner)">Ranjit Vilas Palacea>
 <a href="/wiki/Tova_Friedman" title="Tova Friedman">Tova Friedmana>
 <a href="/wiki/Ampullae_of_Lorenzini" title="Ampullae of Lorenzini">ampullae of Lorenzinia>
 <a href="/wiki/Gilbert_Bundy" title="Gilbert Bundy">Gilbert Bundya>
 <a href="/wiki/Hours_of_Charles_the_Noble" title="Hours of Charles the Noble">Hours of Charles the Noblea>
 <a href="/wiki/Cleo_Damianakes" title="Cleo Damianakes">Cleo Damianakesa>


To grab data within the HTML document for each section above, you use the select method, which takes a string that represents a

CSS selector

. You use CSS selector syntax to extract elements from the document that meet the specified search criteria.
The selector criteria are as follows:

bar extracts all elements (tags) with that name, for example .
As you saw before, #bar extracts all elements with that id, for example 
.
Selectors can be combined to extract elements that meet multiple criteria. For example, bar#baz.foo would match an element  with id="baz" and class="foo".

Note that if there are any blank spaces between selectors, they'll combine to get elements that support the leftmost selector and any child elements that meet the selector criteria. For example, bar #baz .foo would match the innermost div in .
Using the > character, for example in bar > #baz > .foo, selects only the direct children. It ignores other members nested more deeply and grandchildren.
In the three examples above, you combined selectors with spaces, for example #mp-otd b a. This notation means that there's a link to each article in each item within the outer  tag and inner  tag.

In addition to the select method, other methods of iterating through the elements of the selection include next, nextAll, nextSibling, and nextElementSibling.
Now that you have the required elements, the next step is to obtain the data inside each element. HTML elements have three parts, each of which have a corresponding method of retrieval in jsoup:

The children method is used to obtain child elements.
The text method is used to extract strings from elements like 
No more pre-text
.
The attr method extracts the foo value from bar="foo" using .attr("bar").

For example, the following command obtains the title and the link href of each element:

val otds = for(otd <- onThisDay.asScala) yield (otd.attr("title"), otd.attr("href"))


The type and values are as follows:

otds: collection.mutable.Buffer[(String, String)] = ArrayBuffer(
 ("October 10", "/wiki/October_10"),
 ("Thanksgiving (Canada)", "/wiki/Thanksgiving_(Canada)"),
 ("Battle of Karbala", "/wiki/Battle_of_Karbala"),
 ("Ndyuka people", "/wiki/Ndyuka_people"),
 ("Triton (moon)", "/wiki/Triton_(moon)"),
 ("Spiro Agnew", "/wiki/Spiro_Agnew"),
 ("Vidyasagar Setu", "/wiki/Vidyasagar_Setu")


The following command retrieves only the headlines:

val headers = for (otd <- onThisDay.asScala) yield otd.text


The type and values are as follows:

headers: collection.mutable.Buffer[String] = ArrayBuffer(
 "October 10",
 "Thanksgiving",
 "Battle of Karbala",
 "Ndyuka people",
 "Triton",
 "Spiro Agnew",
 "Vidyasagar Setu"


Web Scraping with Scala Scraper
Inside the directory src/main/scala/dev/draft, make a file called ScalaScraper.scala with the following contents:

package dev.draft

import net.ruippeixotog.scalascraper.browser._

import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._

object ScalaScraper {

  def main(args: Array[String]): Unit = {
    val browser = JsoupBrowser()
  }
}


Following the

Scala Scraper documentation

, the first step is to call the constructor JsoupBrowser(). As the name suggests, this generates a web browser implementation based on jsoup. However, unlike other browsers, JsoupBrowser doesn't run JavaScript and only works with HTML. In the above code, you call JsoupBrowser() using the following command:

val browser = JsoupBrowser()


You'll then use the get method of the JsoupBrowser class to download the web page you're going to scrape:

val doc = browser.get("http://en.wikipedia.org/")


You use the get method here to download the entire body of the page. Although parseFile is another possible method, it can only examine documents stored locally. The main difference between the two methods is that get downloads and parses, while parseFile just parses without downloading.
The doc is a JsoupDocument type that contains the following:

JsoupDocument(doctype html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
 <meta charset="UTF-8">
 <title>Wikipedia, the free encyclopediatitle>
 <script>document.documentElement.className="client-js"


To get the page title, use the following command:
The type and value of the title are as follows:

title: String = "Wikipedia, the free encyclopedia"


For practical purposes, as with the jsoup examples, this tutorial only looks at the selection (the operator >>) and extraction (text and attr) methods. However, Scala Scraper has many other functions. It can perform queries and modifications on HTML documents and also perform unit tests on generated HTML code.
Selecting with Scala Scraper
The following code obtains the contents of the In the news section on the

Wikipedia home page

with Scala Scraper:

val inTheNews = doc >> elementList("#mp-itn b a")


If you use println(inTheNews), the resulting type and values should look similar to the following:

List(JsoupElement(<a href="/wiki/AnnieErnaux" title="Annie Ernaux">Annie Ernauxa>),
JsoupElement(<a href="/wiki/2022_Nong_Bua_Lamphu_attack" title="2022 Nong Bua Lamphu attack">An attacka>),
JsoupElement(<a href="/wiki/Svante_P%C3%A4%C3%A4bo" title="Svante Paabo">Svante Paaboa>),
JsoupElement(<a href="/wiki/2022_London_Marathon" title="2022 London Marathon">the London Marathona>),
JsoupElement(<a href="/wiki/Portal:Current_events" title="Portal:Current events">Ongoinga>),
JsoupElement(<a href="/wiki/Deaths_in_2022" title="Deaths in 2022">Recent deathsa>))


To view the contents of the On this day section, use the id with the value mp-otd to obtain its elements:

val onThisDay = doc >> elementList("#mp-otd b a")


The resulting type and values should look similar to the following:

List(JsoupElement(<a href="/wiki/October_11" title="October 11">October 11a>),
JsoupElement(<a href="/wiki/Mawlid" title="Mawlid">Mawlida>),
JsoupElement(<a href="/wiki/James_the_Deacon" title="James the Deacon">Saint James the Deacona>),
JsoupElement(<a href="/wiki/National_Coming_Out_Day" title="National Coming Out Day">National Coming Out Daya>),
JsoupElement(<a href="/wiki/Jin%E2%80%93Song_Wars" title="Jin–Song Wars">Jin–Song Warsa>),
JsoupElement(<a href="/wiki/Ordinances_of_1311" title="Ordinances of 1311">Ordinances of 1311a>),
JsoupElement(<a href="/wiki/Battle_of_Camperdown" title="Battle of Camperdown">Battle of Camperdowna>))


Likewise, to get the contents of the Did you know section, use the id with the value mp-dyk to obtain its elements:

val didYouKnow = doc >> elementList("#mp-dyk b a")


The resulting type and values should look similar to the following:

List(JsoupElement(<a href="/wiki/East_African_Mounted_Rifles" title="East African Mounted Rifles">East African Mounted Riflesa>),
  JsoupElement(<a href="/wiki/Kiriko_(Overwatch)" title="Kiriko (Overwatch)">Kirikoa>),
  JsoupElement(<a href="/wiki/Doctor_Who_(season_2)" title="Doctor Who (season 2)">the second seasona>),
  JsoupElement(<a href="/wiki/First_National_Bank_Tower" title="First National Bank Tower">First National Bank Towera>),
  JsoupElement(<a href="/wiki/Roger_Robinson_(academic)" title="Roger Robinson (academic)">Roger Robinsona>),
  JsoupElement(<a href="/wiki/M_Club_banner" title="M Club banner">Michigan bannera>))


In the three examples above, you combined the selectors with spaces, for example #mp-otd b a. This notation means that there's a link to each article in each item within the outer  tag and inner  tag.

As with the jsoup example, the next step is to obtain the data inside each element. Scala Scraper's corresponding methods for the three different parts of HTML elements are as follows:

The children method is used to extract child elements.
The text method is used to extract text content. It extracts the string from elements like 
No more pre-text
.
The attr method extracts attributes. For example, you'd use .attr("bar") to get the foo value from bar="foo".

For example, the following command obtains the title and the link href of each element:

val otds = for (otd <- onThisDay) yield (otd >> attr("title"), otd >> attr("href"))


The type and values are as follows:

List((October 11,/wiki/October_11),
(Mawlid,/wiki/Mawlid),
(James the Deacon,/wiki/James_the_Deacon),
(National Coming Out Day,/wiki/National_Coming_Out_Day),
(Jin–Song Wars,/wiki/Jin%E2%80%93Song_Wars),
(Ordinances of 1311,/wiki/Ordinances_of_1311),
(Battle of Camperdown,/wiki/Battle_of_Camperdown))


The following instruction obtains just the headlines:

val headers = for (otd <- onThisDay) yield otd >> text


The type and values are as follows:

List(October 11,
 Mawlid,
 Saint James the Deacon,
 National Coming Out Day,
 Jin–Song Wars,
 Ordinances of 1311,
 Battle of Camperdown)


Limitations of These Methods
One limitation of jsoup and Scala Scraper is that dynamic websites and

single-page applications

(SPAs) can't be scraped. As mentioned before, JsoupBrowser just scrapes HTML documents. If you want to scrape a dynamic website or interact with JavaScript code, you'll need to use a headless browser like Selenium.
Advanced Web Scraping with Selenium
Selenium

is a tool that can be used to build bots and automate unit tests in addition to being used for scraping.
Below, you'll use Selenium to run the same examples you executed with jsoup and Scala Scraper.
First, use Selenium to download

WebDriver

. Note that the instructions for downloading and installing the client differ for

Firefox

and

Chrome

. To use the WebDriver module, download the latest

geckodriver release

and ensure it can be found on your system PATH.
In the directory src/main/scala/dev/draft, make a file called SeleniumScraper.scala with the following contents:

package dev.draft

import java.time.Duration
import org.openqa.selenium.By
import org.openqa.selenium.firefox.FirefoxDriver

object SeleniumScraper {
  def main(args: Array[String]): Unit = {
    System.setProperty("webdriver.gecko.driver", "/usr/local/bin/geckodriver")
    val driver = new FirefoxDriver
    driver.manage.window.maximize()
    driver.manage.deleteAllCookies()
    driver.manage.timeouts.pageLoadTimeout(Duration.ofSeconds(40))
    driver.manage.timeouts.implicitlyWait(Duration.ofSeconds(30))
    driver.get("http://en.wikipedia.org/")
    val inTheNews = driver.findElement(By.id("#mp-itn b a"))
    println(inTheNews.getText)
    val onThisDay = driver.findElement(By.id("#mp-otd b a"))
    println(onThisDay.getText)
    val didYouKnow = driver.findElement(By.id("#mp-dyk b a"))
    println(didYouKnow.getText)
    driver.quit()
  }
}


In the code above, you obtain the same three sections of the Wikipedia home page using CSS selector syntax.
As mentioned, Selenium can also scrape dynamic web pages. For example, on the

Related Words

web page, you can type a word to retrieve all related words and their respective links. The following code will retrieve all the dynamically generated words related to the word Draft:

package dev.draft

import java.time.Duration
import org.openqa.selenium.By
import org.openqa.selenium.firefox.FirefoxDriver

object SeleniumScraper {
  def main(args: Array[String]): Unit = {
    System.setProperty("webdriver.gecko.driver", "/usr/local/bin/geckodriver")
    val driver = new FirefoxDriver
    driver.manage.window.maximize()
    driver.manage.deleteAllCookies()
    driver.manage.timeouts.pageLoadTimeout(Duration.ofSeconds(40))
    driver.manage.timeouts.implicitlyWait(Duration.ofSeconds(30))
    driver.get("https://relatedwords.org/relatedto/" + "Draft")
    val relatedWords = driver.findElement(By.className("words"))
    println(relatedWords.getText)
    driver.quit()
  }
}


Conclusion
Scala Scraper and jsoup are sufficient when you have to parse a static HTML web page or validate generated HTML code. However, when you need to validate dynamic web pages or JavaScript code, you need to use tools like Selenium.
In this tutorial, you learned how to set up a Scala project and use jsoup and Scala Scraper to load and parse HTML. You were also introduced to some web scraping techniques. Finally, you saw how a headless browser library like Selenium can be used to scrape a dynamic website.
If you're a JVM fan, don't hesitate to take a look at our guide about

web scraping with Kotlin

.
If you prefer not to have to deal with rate limits, proxies, user agents, and browser fingerprints, please check out the web scraping API from

ScrapingBee

. Did you know that the first 1,000 calls are on us?





Raul Estrada
Raul is a serial entrepreneur who loves functional programming languages like Scala, Clojure, and Elixir. He's written several books on massive data processing, and he always has a story to tell.

Web Scraping with Scala | ScrapingBee

Setting Up a Scala Project

Basic Web Scraping with jsoup

Selecting with jsoup

Web Scraping with Scala Scraper

Selecting with Scala Scraper

Limitations of These Methods

Advanced Web Scraping with Selenium

Conclusion

Comments

Leave a Reply Cancel reply