In a previous blog,
we evaluated popular browser automation frameworks and patches developed for them to bypass CreepJS
, which is a browser fingerprinting tool that can detect headless browsers and stealth plugins. Of all the tools we tried, we found that
Camoufox
scored the best, being indistinguishable from a real, human-operated browser. In this blog, we’ll see what it is, how it works, and try using it for some web scraping tasks.
We’ll be using Python for all the examples in this blog. Camoufox is also fully compatible with the Playwright API, so the code will be similar to any Playwright code that you already have, with only a change in the way the browser is initialized. We’ll also be using Camoufox in headless mode for all examples unless otherwise mentioned.
What Is Camoufox?
Camoufox’s Github package
describes itself as a “stealthy, minimalistic, custom build of Firefox for web scraping.” It uses Firefox rather than the usual Chromium because most anti-bot systems detect Chromium much easier than they do Firefox. Firefox is also simple to patch, and easily lends itself to fingerprint rotation. While most of Camoufox is open-source, some components such as fingerprint spoofing have been kept closed-source with the sole intention of preventing them from being reverse-engineered by anti-bot service providers.
Key Features Of Camoufox
Camoufox bundles the following core features for stealthy web scraping:
- Fingerprint Spoofing: Camoufox can spoof a comprehensive list of browser properties that are used to create fingerprints. Various parameters such as navigator properties, screen properties, WebGL and canvas capabilities, audio, video, webcams, geolocation, and battery API are spoofed, among other things.
- Stealth Patches: Camoufox fixes various leaks that can be used to detect that the browser is automated and running in headless mode. It avoids executing JavaScript in the main world and runs it in a sandbox instead. It also fixes other minor leaks such as
navigator.webdriver
detection, headless Firefox detection, and so on. It also passes all stealth checks against popular tests such as CreepJS and remains undetected. - Anti Font Fingerprinting: Some fingerprinting tools use the fonts present on a device and compare it with the expected default font set for that OS. Camoufox can also spoof the list of available fonts based on the chosen OS.
- Optimizations: The Firefox build is also optimized for speed and performance by removing some Mozilla services and including fixes from other projects such as LibreWolf, Ghostery, and PeskyFox. It also features other minor enhancements such as the removal of themes, telemetry, and so on.
- Addons: It bundles addons such as UBlock Origin for blocking ads and further provides a capability to add custom addons, pins them to the toolbar, and runs them in Private Browsing mode.
Camoufox Python Interface
The Camoufox Python interface offers the following additional capabilities:
- GeoIP and Proxy Support: We can define proxies to be used by the headless browser while initializing the browser. Further, it uses the
geoip
module to change browser timezone, language, etc., based on the location detected from the proxy IP - Main World Execution: JavaScript execution in the page happens in an isolated environment by default, to avoid being detected. However, this means the DOM is read-only and cannot be modified. If we need to modify the DOM, Camoufox provides the capability to execute JavaScript in the main world at the cost of potentially being detected.
- Remote Server Mode: Camoufox can be run as a remote server, which enables it to be used from other languages using the Playwright API.
- Virtual Display: Camoufox provides a ‘virtual’ headless mode that actually runs the browser in headful mode using a virtual display, without the need for a physical monitor. This makes it even harder to detect as an automated browser, while still being deployable on a cloud server.
- BrowserForge Fingerprints: Camoufox can forge fingerprints using BrowserForge to spoof as real browsers based on specified OS and screen size.
- Custom Config Data: The default Camoufox configs can be overridden with custom configs.
Installing Camoufox
To install Camoufox, we’ll first install the
camoufox
Python package using pip
:
If we plan to use Proxies, it is recommended that we install it with the geoip
package:
$ pip install -U camoufox[geoip]
We can then download the browser using the fetch
command:
This command will download the custom Firefox build, and once it is complete, we’re ready to go!
Crunchbase Scraping With Camoufox
To get our hands wet, let’s write a basic scraper that can get data from
ScrapingBee’s Crunchbase profile
. Let’s start our code by initializing the Camoufox and visiting the web page:
from camoufox.sync_api import Camoufox
# Camoufox also has an Async API
with Camoufox(headless=False) as browser:
page = browser.new_page()
page.goto("https://www.crunchbase.com/organization/scrapingbee")
page.wait_for_timeout(20000)
page.close()
The code above is pretty self-explanatory. Briefly, we initialized the browser, and a new page, then visited ScrapingBee’s Crunchbase page. We also included a 20-second timeout and ran the browser in headful mode to see what was happening:
Turns out, a Cloudflare Turnstile box has appeared. Now, this could be avoided by using a good proxy, but now that it has shown up, let’s try to bypass it. The thing about the Cloudflare Turnstile is that, once we click the button and verify it on normal browsers, it stores a verification cookie and stops bugging us further, at least for a while. In Camoufox, we can use a persistent context and store the cookies between sessions.
In an attempt to stop the triggering of the turnstile, we’ve also added useragent settings in the config to disguise our browser and a “i_know_what_im_doing” flag to silence warnings.
So, let’s get through the turnstile in headful mode and persist this for future runs:
from camoufox.sync_api import Camoufox
config = {
'window.outerHeight': 1056,
'window.outerWidth': 1920,
'window.innerHeight': 1008,
'window.innerWidth': 1920,
'window.history.length': 4,
'navigator.userAgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
'navigator.appCodeName': 'Mozilla',
'navigator.appName': 'Netscape',
'navigator.appVersion': '5.0 (Windows)',
'navigator.oscpu': 'Windows NT 10.0; Win64; x64',
'navigator.language': 'en-US',
'navigator.languages': ['en-US'],
'navigator.platform': 'Win32',
'navigator.hardwareConcurrency': 12,
'navigator.product': 'Gecko',
'navigator.productSub': '20030107',
'navigator.maxTouchPoints': 10,
}
with Camoufox(
headless=False,
persistent_context=True,
user_data_dir='user-data-dir',
os=('windows'),
config=config,
i_know_what_im_doing=True
) as browser:
# Open the page
page = browser.new_page()
page.goto("https://www.crunchbase.com/organization/scrapingbee")
# Allow time to turnstile to load and manually solve it
page.wait_for_timeout(30000)
# Close the page
page.close()
Running the above code, we opened the page in headful mode and manually went through the Cloudflare turnstile, and the cookies should now be persisted in the user-data-dir
directory that we specified. We can check this by inspecting the cookies.sqlite
file in that directory:
$ sqlite3 user-data-dir/cookies.sqlite 'SELECT * from moz_cookies;'
2||cf_clearance|N....Q|.crunchbase.com|/|1772087016|1740551017518401|1740551017518401|1|1|0|0|0|2|1
3||cb_analytics_consent|granted|www.crunchbase.com|/|1775111017|1740551017519007|1740551017519007|0|0|0|1|0|2|0
4||cid|C...=|.crunchbase.com|/|1775111017|1740551017519338|1740551017519338|0|0|0|1|0|2|0
5||__cf_bm|Hc....ZQ|.crunchbase.com|/|1740552817|1740551017519780|1740550992648041|1|1|0|0|0|2|0
6||__cflb|0...i|www.crunchbase.com|/|1740633817|1740551017520281|1740551017520281|0|1|0|1|1|2|0
...
We can see that it has some Cloudflare cookies (cf) for .crunchbase.com
. Now, we should be ready to return to the original task we set out to do; which is to scrape our Crunchbase profile. Let’s see the code for this:
import json
from camoufox.sync_api import Camoufox
config = {
'window.outerHeight': 1056,
'window.outerWidth': 1920,
'window.innerHeight': 1008,
'window.innerWidth': 1920,
'window.history.length': 4,
'navigator.userAgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
'navigator.appCodeName': 'Mozilla',
'navigator.appName': 'Netscape',
'navigator.appVersion': '5.0 (Windows)',
'navigator.oscpu': 'Windows NT 10.0; Win64; x64',
'navigator.language': 'en-US',
'navigator.languages': ['en-US'],
'navigator.platform': 'Win32',
'navigator.hardwareConcurrency': 12,
'navigator.product': 'Gecko',
'navigator.productSub': '20030107',
'navigator.maxTouchPoints': 10,
}
with Camoufox(
headless=False,
persistent_context=True,
user_data_dir='user-data-dir',
os=('windows'),
config=config,
i_know_what_im_doing=True
) as browser:
# Open the page
page = browser.new_page()
page.goto("https://www.crunchbase.com/organization/scrapingbee")
# Wait for Network Idle
page.wait_for_load_state('networkidle')
# Wait more, just in case
page.wait_for_timeout(10000)
# Get the required data
data = {
'Name': page.locator('span.entity-name.ng-star-inserted').inner_text(),
'Description': page.locator('span.expanded-only-content.ng-star-inserted').inner_text()
}
score_els = page.locator('.top-row-left-groups score-and-trend').all()
for el in score_els:
key_name = el.locator('span.label').inner_text()
value = int(el.locator('div.chip-text').inner_text())
data[key_name] = value
data['Overview'] = []
overview_els = page.locator('.overview-row label-with-icon').all()
for el in overview_els:
value = el.locator('.component--field-formatter').inner_text()
data['Overview'].append(value)
# print scraped data and close the page
print(json.dumps(data, indent=2))
page.close()
The above code prints an output with the company details extracted from the page:
{
"Name": "ScrapingBee",
"Description": "ScrapingBee is a software company that offers a web scraping API that handles headless browsers.",
"Growth Score": 89,
"CB Rank": 203243,
"Heat Score": 82,
"Overview": [
"Jan 1, 2019",
"Private",
"Pre-Seed",
"Paris, Ile-de-France, France",
"1-10",
"scrapingbee.com"
]
}
🔥 If you’re still having trouble scraping without getting blocked, check out our expert level guide to
Web Scraping Without Getting Blocked
Handling Login Sessions With Camoufox
As we saw in the previous section, Camoufox can persist sessions in a directory that we define. Now, we can use this to login to websites and store our session for future use, or even move it around between different systems. With Crunchbase as an example, let’s see how we can login into the service by visiting the login page:
from camoufox.sync_api import Camoufox
config = {
'window.outerHeight': 1056,
'window.outerWidth': 1920,
'window.innerHeight': 1008,
'window.innerWidth': 1920,
'window.history.length': 4,
'navigator.userAgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
'navigator.appCodeName': 'Mozilla',
'navigator.appName': 'Netscape',
'navigator.appVersion': '5.0 (Windows)',
'navigator.oscpu': 'Windows NT 10.0; Win64; x64',
'navigator.language': 'en-US',
'navigator.languages': ['en-US'],
'navigator.platform': 'Win32',
'navigator.hardwareConcurrency': 12,
'navigator.product': 'Gecko',
'navigator.productSub': '20030107',
'navigator.maxTouchPoints': 10,
}
with Camoufox(
headless=False,
persistent_context=True,
user_data_dir='user-data-dir',
os=('windows'),
config=config,
i_know_what_im_doing=True
) as browser:
# Open the page
page = browser.new_page()
page.goto("https://www.crunchbase.com/login")
# Wait for a while
page.wait_for_load_state('networkidle')
page.wait_for_timeout(10000)
# Fill the fields
page.locator('form input[type=email]').fill('' )
page.locator('form input[type=password]').fill('' )
# Wait and click the login button
page.wait_for_timeout(10000)
page.locator('form button.login').click()
# Wait and close the page
page.wait_for_timeout(15000)
page.close()
The above code opens the Crunchbase login page, fills in the email and password provided, and clicks on the login button. Once the login is successful, it is persisted. We can check this by visiting the home page and taking a screenshot, in a separate script:
from camoufox.sync_api import Camoufox
config = {
'window.outerHeight': 1056,
'window.outerWidth': 1920,
'window.innerHeight': 1008,
'window.innerWidth': 1920,
'window.history.length': 4,
'navigator.userAgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
'navigator.appCodeName': 'Mozilla',
'navigator.appName': 'Netscape',
'navigator.appVersion': '5.0 (Windows)',
'navigator.oscpu': 'Windows NT 10.0; Win64; x64',
'navigator.language': 'en-US',
'navigator.languages': ['en-US'],
'navigator.platform': 'Win32',
'navigator.hardwareConcurrency': 12,
'navigator.product': 'Gecko',
'navigator.productSub': '20030107',
'navigator.maxTouchPoints': 10,
}
with Camoufox(
headless=False,
persistent_context=True,
user_data_dir='user-data-dir',
os=('windows'),
config=config,
i_know_what_im_doing=True
) as browser:
# Open the page
page = browser.new_page()
page.goto("https://www.crunchbase.com/home")
page.wait_for_load_state('networkidle')
page.wait_for_timeout(10000)
page.screenshot(path='crunchbase-home.png')
page.close()
If you noticed, in this piece of code, we’ve set headless=True
while in the previous snippet, we used headful mode. This was just to see the login happening, but once we’re logged in we should be fine with headless mode. Let’s see what we have in the screenshot:
We can see that we’re already logged in and Crunchbase shows us a greeting welcoming us back!
Scraping Google Maps Listings Using Camoufox
Google Maps is one of the most JavaScript-heavy interactive websites containing valuable information. Hence it’s a worthy target to demonstrate data extraction using Camoufox. Right away, let’s see if we can get a list of restaurants in New York starting from a manual search URL.
The above is a screenshot of a Google Maps search for “restaurants” in New York. What we’ll be scraping for this exercise are the listings on the left side. Let’s see the code:
from camoufox.sync_api import Camoufox
import json
import re
INITIAL_URL = 'https://www.google.com/maps/search/restaurants/@40.7500474,-74.0132272,12z/data=!4m2!2m1!6e5'
config = {
'window.outerHeight': 1056,
'window.outerWidth': 1920,
'window.innerHeight': 1008,
'window.innerWidth': 1920,
'window.history.length': 4,
'navigator.userAgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
'navigator.appCodeName': 'Mozilla',
'navigator.appName': 'Netscape',
'navigator.appVersion': '5.0 (Windows)',
'navigator.oscpu': 'Windows NT 10.0; Win64; x64',
'navigator.language': 'en-US',
'navigator.languages': ['en-US'],
'navigator.platform': 'Win32',
'navigator.hardwareConcurrency': 12,
'navigator.product': 'Gecko',
'navigator.productSub': '20030107',
'navigator.maxTouchPoints': 10,
}
with Camoufox(
headless=True,
os=('windows'),
config=config,
i_know_what_im_doing=True
) as browser:
# Open the URL
page = browser.new_page()
page.goto(INITIAL_URL)
# Wait for list to load
page.wait_for_selector('div[role=feed]')
# Get the list of restaurants
divs = list(page.locator('div[role=feed] div.Nv2PK').all())
# Extract required information onto another list
results = []
for div in divs:
results.append({
'name': div.locator('a.hfpxzc').get_attribute('aria-label'),
'link': div.locator('a.hfpxzc').get_attribute('href'),
'rating': float(div.locator('span.MW4etd').inner_text()),
'reviews': int(re.sub(
r'[(),]', '',
div.locator('span.UY7F9').inner_text(),
)),
})
# Save the results
with open('results.json', 'w') as f:
f.write(json.dumps(results, indent=2))
f.close()
# Close the page
page.close()
The above code visits the INITIAL_URL
that we defined, extracts a list of div
s containing restaurant listings, parses some fields from them, and saves them to a JSON file. Let’s see what we have in the file:
[
{
"name": "Mojo",
"link": "https://www.google.com/maps/place/Mojo/data=!4m7!3m6!1s0x89c25f58368ed953:0x5184e1a5b510a6fe!8m2!3d40.7205717!4d-73.8464128!16s%2Fg%2F11fkptd63p!19sChIJU9mONlhfwokR_qYQtaXhhFE?authuser=0&hl=en&rclk=1",
"rating": 4.6,
"reviews": 3556
},
{
"name": "Upland",
"link": "https://www.google.com/maps/place/Upland/data=!4m7!3m6!1s0x89c259a715fb5059:0xe5543b76e952fab3!8m2!3d40.7419313!4d-73.984644!16s%2Fg%2F11btmqg_59!19sChIJWVD7FadZwokRs_pS6XY7VOU?authuser=0&hl=en&rclk=1",
"rating": 4.5,
"reviews": 2217
},
{
"name": "Salinas Restaurant",
"link": "https://www.google.com/maps/place/Salinas+Restaurant/data=!4m7!3m6!1s0x89c259b92fa3c71d:0xb8dbc9965cf2d536!8m2!3d40.7436822!4d-74.0030697!16s%2Fg%2F1tdp1css!19sChIJHcejL7lZwokRNtXyXJbJ27g?authuser=0&hl=en&rclk=1",
"rating": 4.7,
"reviews": 1647
},
{
"name": "The Avenue Restaurant & Bar",
"link": "https://www.google.com/maps/place/The+Avenue+Restaurant+%26+Bar/data=!4m7!3m6!1s0x89c25e78690140df:0x807458e6fb09894d!8m2!3d40.7019674!4d-73.8792549!16s%2Fg%2F1x5fc8ng!19sChIJ30ABaXhewokRTYkJ--ZYdIA?authuser=0&hl=en&rclk=1",
"rating": 4.4,
"reviews": 367
},
{
"name": "Buona Notte",
"link": "https://www.google.com/maps/place/Buona+Notte/data=!4m7!3m6!1s0x89c25989d432ef1f:0x90b13821b661c011!8m2!3d40.7177496!4d-73.9979524!16s%2Fg%2F1vhlz2zr!19sChIJH-8y1IlZwokREcBhtiE4sZA?authuser=0&hl=en&rclk=1",
"rating": 4.5,
"reviews": 1609
}
]
In the file, we have names, links, ratings, and the number of reviews for the 5 restaurants the URL initially loads. You are welcome to extend the scraper to extract more fields such as images, timings, and descriptions. Next, let’s extend this scraper to get us more results.
In the previous method, one obvious way to get more results is to scroll down the list. On Google Maps, this loads more and more results. We’ve covered
handling infinite scroll using Playwright for Python in a previous blog
and whatever works with Playwright should technically work for Camoufox too. For this tutorial, we’ll do something even more fun: we’ll scroll around the New York City map instead of scrolling down the list because it’s Google Maps! Doesn’t that sound fun?
Essentially, the code will be similar to the previous section, except we’ll click a checkbox on the Google Maps UI called “Update results when map moves”, and then simulate keyboard arrow presses to move the map around. When the map moves, the results will be updated and we’ll extract the results on each iteration. Let’s see the code:
from camoufox.sync_api import Camoufox
import json
import re
INITIAL_URL = 'https://www.google.com/maps/search/restaurants/@40.7500474,-74.0132272,12z/data=!4m2!2m1!6e5'
config = {
'window.outerHeight': 1056,
'window.outerWidth': 1920,
'window.innerHeight': 1008,
'window.innerWidth': 1920,
'window.history.length': 4,
'navigator.userAgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0',
'navigator.appCodeName': 'Mozilla',
'navigator.appName': 'Netscape',
'navigator.appVersion': '5.0 (Windows)',
'navigator.oscpu': 'Windows NT 10.0; Win64; x64',
'navigator.language': 'en-US',
'navigator.languages': ['en-US'],
'navigator.platform': 'Win32',
'navigator.hardwareConcurrency': 12,
'navigator.product': 'Gecko',
'navigator.productSub': '20030107',
'navigator.maxTouchPoints': 10,
}
with Camoufox(
headless=True,
os=('windows'),
config=config,
i_know_what_im_doing=True
) as browser:
# Open the URL
page = browser.new_page()
page.goto(INITIAL_URL)
# Wait for list to load
page.wait_for_selector('div[role=feed]')
# Click "update result when map moves"
page.locator('button.D6NGZc[role=checkbox]').click()
page.wait_for_timeout(5000)
# List to store extracted info
results = []
# Get initial list of restaurants
divs = list(page.locator('div[role=feed] div.Nv2PK').all())
# Start scroll loop
while True:
# Process current list
for div in divs:
results.append({
'name': div.locator('a.hfpxzc').get_attribute('aria-label'),
'link': div.locator('a.hfpxzc').get_attribute('href'),
'rating': float(div.locator('span.MW4etd').inner_text()),
'reviews': int(re.sub(
r'[(),]', '',
div.locator('span.UY7F9').inner_text(),
)),
})
# Break if we have enough results
if len(results)>=15:
break
# Else scroll and get more divs
page.locator('div.widget-scene').focus()
# Press arrow twice to move far enough
page.keyboard.press('ArrowUp')
page.keyboard.press('ArrowUp')
page.wait_for_timeout(10000)
divs = list(page.locator('div[role=feed] div.Nv2PK').all())
# Save the results
with open('results.json', 'w') as f:
f.write(json.dumps(results, indent=2))
f.close()
# Close the page
page.close()
The above code gives us 15 results instead of 5, getting the additional results by scrolling up on the city map. The JSON file looks similar:
[
{
"name": "Mojo",
"link": "https://www.google.com/maps/place/Mojo/data=!4m7!3m6!1s0x89c25f58368ed953:0x5184e1a5b510a6fe!8m2!3d40.7205717!4d-73.8464128!16s%2Fg%2F11fkptd63p!19sChIJU9mONlhfwokR_qYQtaXhhFE?authuser=0&hl=en&rclk=1",
"rating": 4.6,
"reviews": 3556
},
{
"name": "Connolly's",
"link": "https://www.google.com/maps/place/Connolly%27s/data=!4m7!3m6!1s0x89c2585579862e01:0xe21f27dec63cf83d!8m2!3d40.7573681!4d-73.9835798!16s%2Fg%2F1vxzbjrt!19sChIJAS6GeVVYwokRPfg8xt4nH-I?authuser=0&hl=en&rclk=1",
"rating": 4.4,
"reviews": 4516
},
{
"name": "V{IV}",
"link": "https://www.google.com/maps/place/V%7BIV%7D/data=!4m7!3m6!1s0x89c2585130118027:0x237f3e220f422247!8m2!3d40.7627843!4d-73.9897032!16s%2Fg%2F1hc9wxh16!19sChIJJ4ARMFFYwokRRyJCDyI-fyM?authuser=0&hl=en&rclk=1",
"rating": 4.7,
"reviews": 4275
},
...11 results hidden here (total 15 results)...
{
"name": "The Inn by Fumo",
"link": "https://www.google.com/maps/place/The+Inn+by+Fumo/data=!4m7!3m6!1s0x89c2f7d883798bab:0x60a3d35f4da362e3!8m2!3d40.8252495!4d-73.9509651!16s%2Fg%2F11qpzydmt9!19sChIJq4t5g9j3wokR42KjTV_To2A?authuser=0&hl=en&rclk=1",
"rating": 4.4,
"reviews": 353
}
]
If you’re curious, here’s a screen recording of what is going on in the browser (obtained with headless=False
of course):
Conclusion
In this blog, we looked at what Camoufox is and how we can use it for some web scraping tasks. Camoufox provides several stealth patches and other features for scraping out-of-the-box, and we were able to straightaway use it on Crunchbase and Google Maps.
Further, all the code we wrote is similar to Playwright code. Camoufox Python is meant to be fully compatible with Playwright, so any working Playwright code that you already have can be ported to Camoufox with minimal changes.
At the time of writing, Camoufox only provides Python bindings, but it can be used from other programming languages that support the Playwright API, using its
remote server mode
.