scrapy-deltafetch

This is a Scrapy spider middleware to ignore requests to pages seen in previous crawls of the same spider, thus producing a "delta crawl" containing only new requests.

This also speeds up the crawl, by reducing the number of requests that need to be crawled, and processed (typically, item requests are the most CPU intensive).

DeltaFetch middleware uses Python's dbm package to store requests fingerprints.

Installation

Install scrapy-deltafetch using pip:

$ pip install scrapy-deltafetch

Configuration

Add DeltaFetch middleware by including it in SPIDER_MIDDLEWARES in your settings.py file:
```
SPIDER_MIDDLEWARES = {
    'scrapy_deltafetch.DeltaFetch': 100,
}
```
Here, priority 100 is just an example. Set its value depending on other middlewares you may have enabled already.
Enable the middleware using DELTAFETCH_ENABLED in your settings.py:
```
DELTAFETCH_ENABLED = True
```

Usage

Following are the different options to control DeltaFetch middleware behavior.

Supported Scrapy settings

DELTAFETCH_ENABLED — to enable (or disable) this extension
DELTAFETCH_DIR — directory where to store state
DELTAFETCH_RESET — reset the state, clearing out all seen requests

These usually go in your Scrapy project's settings.py.

Supported Scrapy spider arguments

deltafetch_reset — same effect as DELTAFETCH_RESET setting

Example:

$ scrapy crawl example -a deltafetch_reset=1

Supported Scrapy request meta keys

deltafetch_key — used to define the lookup key for that request. by default it's Scrapy's default Request fingerprint function, but it can be changed to contain an item id, for example. This requires support from the spider, but makes the extension more efficient for sites that many URLs for the same item.
deltafetch_enabled - if set to False it will disable deltafetch for some specific request

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
scrapy_deltafetch		scrapy_deltafetch
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.coveragerc		.coveragerc
.gitignore		.gitignore
CHANGES.rst		CHANGES.rst
README.rst		README.rst
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapy-deltafetch

Installation

Configuration

Usage

Supported Scrapy settings

Supported Scrapy spider arguments

Supported Scrapy request meta keys

About

Releases 2

Packages

Contributors 6

Languages

scrapy-plugins/scrapy-deltafetch

Folders and files

Latest commit

History

Repository files navigation

scrapy-deltafetch

Installation

Configuration

Usage

Supported Scrapy settings

Supported Scrapy spider arguments

Supported Scrapy request meta keys

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 6

Languages

Packages