Web Scraping with Scala | ScrapingBee

This tutorial explains how to use three technologies for
web scraping
with Scala. The article first explains how to scrape a static HTML page with Scala using
jsoup
and
Scala Scraper
. Then, it explains how to scrape a dynamic HTML website with Scala using
Selenium
.


cover image

💡 Interested in web scraping with Java? Check out our guide to the
best Java web scraping libraries

Setting Up a Scala Project

The first step is to create a project in Scala. This tutorial uses
Scala
version 2.13.10 with
sbt
version 1.7.2. However, these examples also work with Scala 2.12 and 3.

Execute the following script to generate the necessary directories:

mkdir scala-web-scraping && cd $_
git init
echo '.bsp/
.idea/
target/' > .gitignore
mkdir -p src/{main,test}/{scala,resources} project
echo 'sbt.version = 1.7.2' > ./project/build.properties
echo 'ThisBuild / version := "0.1.0"
ThisBuild / scalaVersion := "2.13.10"
lazy val root = (project in file("."))" > build.sbt

Name your project dev.draft. Then, modify the build.sbt file to include the dependencies for jsoup 1.15.3, scala-scraper 3.0.0, and selenium-java 4.5.0:

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "2.13.10"

lazy val root = (project in file("."))
 .settings(
   name := "scala-web-scraping",
   libraryDependencies ++= Seq(
     "org.jsoup" % "jsoup" % "1.15.3",
     "net.ruippeixotog" %% "scala-scraper" % "3.0.0",
     "org.seleniumhq.selenium" % "selenium-java" % "4.5.0"
   )
 )

Basic Web Scraping with jsoup

In the directory src/main/scala/, make a new package called dev.draft, and inside that package, make a file called JsoupScraper.scala with the following contents:

package dev.draft

import org.jsoup._
import scala.jdk.CollectionConverters._

object JsoupScraper {

  def main(args: Array[String]): Unit = {
    val doc = Jsoup.connect("http://en.wikipedia.org/").get()
  }
}

Following the
jsoup documentation
, this particular line calls the connect method of the org.jsoup.Jsoup class to download the web page you’re scraping:

val doc = Jsoup.connect("http://en.wikipedia.org/").get()

Most of the jsoup classes are in the org.jsoup.Jsoup package. You use the connect method here to download the entire body of the page. Although parse is another method that works with the same syntax, it can only examine documents stored locally. The main difference between the two methods is that connect downloads and parses, while parse simply parses without downloading.

The doc is a nodes.Document type that contains the following:

doc: nodes.Document =  doctype html >
<html class="client-nojs" lang="en" dir="ltr">
   <head>
     <meta charset="UTF-8">
       <title>Wikipedia, the free encyclopediatitle>
       <script>document.documentElement.className="client-js";

To get the page title, use the following command:

If you use println(title), the type and value of title should be displayed as follows:

title: String = "Wikipedia, the free encyclopedia"

For practical purposes, in this tutorial, you’ll only use selection (select) and extraction (text and attr) methods. However, jsoup has many other functions aside from performing queries and modifying HTML documents. For example, it can also be used to perform unit tests on generated HTML code.

Selecting with jsoup

In this tutorial, you’ll select from three sections on the
Wikipedia home page
:

  • In the news
  • On this day
  • Did you know

While in your web browser on the Wikipedia page, right-click the In the news section. In the context menu, select Inspect in Firefox or View page source in Chrome. Since the relevant source code is contained in

, you'll use the element id with the value mp-itn to obtain the contents of the section:

val inTheNews = doc.select("#mp-itn b a")

If you use println(inTheNews), the resulting type and values should look similar to the following:

inTheNews: select.Elements = <a href="/wiki/AnnieErnaux" title="Annie Ernaux">Annie Ernauxa>
<a href="/wiki/2022_Nong_Bua_Lamphu_attack" title="2022 Nong Bua Lamphu attack">An attacka>
<a href="/wiki/Svante_P%C3%A4%C3%A4bo" title="Svante Paabo">Svante Paaboa>
<a href="/wiki/2022_London_Marathon" title="2022 London Marathon">the London Marathona>
<a href="/wiki/Portal:Current_events" title="Portal:Current events">Ongoinga>
<a href="/wiki/Deaths_in_2022" title="Deaths in 2022">Recent deathsa>
<a href="/wiki/Wikipedia:In_the_news/Candidates" title="Wikipedia:In the news/Candidates">Nominate an articlea>

Follow the same steps as before to view the source code and get the contents of the On this day section, and you should find an id with the value mp-otd, which you can use to obtain this section's elements:

val onThisDay = doc.select("#mp-otd b a")

The resulting type and values should look similar to the following:

onThisDay: select.Elements = <a href="/wiki/October_10" title="October 10">October 10a>
<a href="/wiki/Thanksgiving_(Canada)" title="Thanksgiving (Canada)">Thanksgivinga>
<a href="/wiki/Battle_of_Karbala" title="Battle of Karbala">Battle of Karbalaa>
<a href="/wiki/Ndyuka_people" title="Ndyuka people">Ndyuka peoplea>
<a href="/wiki/Triton_(moon)" title="Triton (moon)">Tritona>
<a href="/wiki/Spiro_Agnew" title="Spiro Agnew">Spiro Agnewa>
<a href="/wiki/Vidyasagar_Setu" title="Vidyasagar Setu">Vidyasagar Setua>

Once again, follow the same steps to view the source code and get the contents of the Did you know section, and you should get an id with the value mp-dyk, which you can use to obtain this section's elements:

val didYouKnow = doc.select("#mp-dyk b a")

The resulting type and values should look similar to the following:

didYouKnow: select.Elements =
 <a href="/wiki/Ranjit_Vilas_Palace_(Wankaner)" title="Ranjit Vilas Palace (Wankaner)">Ranjit Vilas Palacea>
 <a href="/wiki/Tova_Friedman" title="Tova Friedman">Tova Friedmana>
 <a href="/wiki/Ampullae_of_Lorenzini" title="Ampullae of Lorenzini">ampullae of Lorenzinia>
 <a href="/wiki/Gilbert_Bundy" title="Gilbert Bundy">Gilbert Bundya>
 <a href="/wiki/Hours_of_Charles_the_Noble" title="Hours of Charles the Noble">Hours of Charles the Noblea>
 <a href="/wiki/Cleo_Damianakes" title="Cleo Damianakes">Cleo Damianakesa>

To grab data within the HTML document for each section above, you use the select method, which takes a string that represents a
CSS selector
. You use CSS selector syntax to extract elements from the document that meet the specified search criteria.

The selector criteria are as follows:

image description

Raul Estrada

Raul is a serial entrepreneur who loves functional programming languages like Scala, Clojure, and Elixir. He's written several books on massive data processing, and he always has a story to tell.