Skip to content

"Playing it right" Efficiently crawling Javascript-heavy websites. [Python] [Scrapy] [Playwright]

Notifications You must be signed in to change notification settings

GerritGeeraerts/scrapy-playwright-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

2dehands.be: Authenticating and Crawling with Scrapy and Playwright

Python Scrapy Playwright Linux Selenium

Scrapy and Playwright

πŸ“– Introduction

Welcome to my GitHub repository, where we blend the strengths of Scrapy with the speed and simplicity of Playwright for advanced web scraping. My journey began with an interest in combining Selenium and Scrapy to tackle JavaScript-heavy websites. As a testcase I will make the program log in to my 2dehands.be account and scrape the saved searches for my user. (you can have a peak in the screenshot below) Before starting the project, I wandered a bit around on github and discovered Playwright. It skyrocketed in popularity in the last 4 years. In the Why Playwrigh? Is Selenium left behind? section I go into more detail about this.

πŸ“¦ Repo structure

.
β”œβ”€β”€ data
β”‚Β Β  └── data.jsonl           # the result of the spider
β”œβ”€β”€ log
β”‚Β Β  └── screenshots
β”‚Β Β      └── ...              # screenshots of the crawled pages
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ scrapy.cfg
└── tweedehands
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ items.py             # the data model
    β”œβ”€β”€ middlewares.py
    β”œβ”€β”€ pipelines.py         # the pipeline that saves the data
    β”œβ”€β”€ settings.py          # settings for the spider
    └── spiders
        β”œβ”€β”€ __init__.py
        └── tweedehands.py   # the spider

πŸš€ To start crawling the pages

install requirements

pip install -r requirements.txt

Add some saved searches

Go to 2dehands.be log in and save some searches. You can find them in the "Mijn Zoekopdrachten" section.

Configure the environment variables

For Linux

Add the environment variables to /etc/environment and restart your computer

# /etc/environment
TWEEDHANDS_USERNAME=**your_tweedehands_username**
TWEEDHANDS_PASSWORD=**your_twedehands_password**

For Windows

This project is not tested on windows, but you can try the following: Add the environment variables to the system environment variables and restart your computer Search "Edit environment variables" in Start Menu, click it, then add or modify variables.

Launch the spider

scrapy crawl tweedehands

Result

You can find the result in data/data.jsonl You can look at example_data.jsonl

Optional

If you are also a Geuze fan, and you are also looking for Geuze, than you can activate the GueuzeOnlyFilter in the settings.py file in the FEEDS section. Or go ahead and create your own filter.

Screenshot

Screenshot

Why Playwright? Is Selenium left behind?

Github stars don't say everything but they do give an indication of the popularity of a project. Below you can see the history of the Scrapy and Playwright projects and some of their siblings: Selenium. In my limited experience with Playwright I found that Playwright is very easy to Install and use. No more webdrivers to manage and or install. It supports async which is faster. The code is more readable and easier and less tedious because there is an auto waiting for elements. Below I made 2 sample scripts that log in to reddit. 1 in Selenium and 1 in Playwright.

Selenium code example

# selenium example - illustrative dummy code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver_path = '/path/to/your/webdriver'
driver = webdriver.Chrome(driver_path)
driver.get('https://www.reddit.com/login/')

username_field = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "loginUsername"))
)
username_field.send_keys('your_username')
password_field = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "loginPassword"))
)
password_field.send_keys('your_password')
login_button = WebDriverWait(driver, 10).until(
    EC.element_to_be_clickable((By.XPATH, '//button[contains(text(), "Log In")]'))
)
login_button.click()

Playwright code example

# playwright example - illustrative dummy code
import asyncio
from playwright.async_api import async_playwright


async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        await page.goto('https://www.reddit.com/login/')
        await page.fill('input#loginUsername', 'your_username')
        await page.fill('input#loginPassword', 'your_password')
        await page.click('button[type="submit"]')
        await browser.close()

asyncio.run(main())

Github star history

Star History Chart

Conclusion

When it comes to web crawling, Playwright truly "plays right" into developers' hands with its simplicity and power.

Is Selenium left behind than? In a lot of ways yes. But Selenium still has its strength especially for software testing as it has a broader support and a solid community and foundation. Also, a lot of projects are already build on Selenium, and turning a project that is head deep in Selenium will probably not compensate the benefits of Playwright. If you are into web scraping than I would highly recommend playing around with Playwright.

⏱️ Timeline

In the course of developing this project, I dedicated two full days not just to building but also to research and explore libraries.

πŸ“Œ Personal Situation

My enthusiasm for initiating a project with Scrapy and Selenium was so intense that it inadvertently bypassed the crucial step of preparing: do your research first!. So I dived into this project head first! Because of this I rotated half way the project from Selenium to Playwright. Note to self: Always do your research first! No matter how excited you are about a project!

🚫 Disclaimer

The actions performed by the program do not abide by the robots.txt of 2dehands.be, this project is ment for educational purpose only. Run at your own risk.

⚑ Possible issues

I had to log in to 2dehands.be the first time with 2fa. But once it knows my IP the 2fa is not needed anymore and playwright could log in with 1 step. If you do think you are experiencing issues, make sure to check the screenshots' folder. It will have screenshot before and after the login.

🀝 Connect with me!

LinkedIn Stack Overflow Ask Ubuntu

πŸ”— Links

The Python Scrapy Playbook | ScrapeOps: I found lots of great info here.

About

"Playing it right" Efficiently crawling Javascript-heavy websites. [Python] [Scrapy] [Playwright]

Topics

Resources

Stars

Watchers

Forks

Languages