Skip to content

Commit

Permalink
document pagination
Browse files Browse the repository at this point in the history
  • Loading branch information
jamesturk committed Jun 6, 2023
1 parent 7dfcae2 commit 26ab9ec
Show file tree
Hide file tree
Showing 4 changed files with 56 additions and 19 deletions.
13 changes: 13 additions & 0 deletions docs/examples/yoyodyne.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import json
from scrapeghost.scrapers import PaginatedSchemaScraper


schema = {"first_name": "str", "last_name": "str", "position": "str", "url": "url"}
url = "https://scrapple.fly.dev/staff"

scraper = PaginatedSchemaScraper(schema)
resp = scraper.scrape(url)

# the resulting response is a ScrapeResponse object just like any other
# all the results are gathered in resp.data
json.dump(resp.data, open("yoyodyne.json", "w"), indent=2)
43 changes: 42 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,4 +138,45 @@ If you want to validate that the returned data isn't just JSON, but data in the
--8<-- "docs/examples/pydantic_example.log"
```

This works by converting the `pydantic` model to a schema and registering a `PydanticPostprocessor` to validate the results automatically.
This works by converting the `pydantic` model to a schema and registering a `PydanticPostprocessor` to validate the results automatically.

## Pagination

One technique to handle pagination is provided by the `PaginatedSchemaScraper` class.

This class takes a schema that describes a single result, and wraps it in a schema that describes a list of results as well as an additional page.

For example:

```python
{"first_name": "str", "last_name": "str"}
```

Automatically becomes:

```python
{"next_page": "url", "results": [{"first_name": "str", "last_name": "str"}]}
```

The `PaginatedSchemaScraper` class then takes care of following the `next_page` link until there are no more pages.

!!! note

Right now, given the library's stance on customizing requests being "just use your own HTTP library", the `PaginatedSchemaScraper` class does not provide a means to customize the HTTP request used to retrieve the next page.

If you need a more complicated approach it is recommended you implement your own pagination logic for now,
<https://github.com/jamesturk/scrapeghost/blob/main/src/scrapeghost/scrapers.py#L238> may be a good starting point.

If you have strong opinions here, please open an issue to discuss.

It then takes the combined "results" and returns them to the user.

Here's a functional example that scrapes several pages of employees:

```python
--8<-- "docs/examples/yoyodyne.py"
```

!!! warning

One caveat of the current approach: The `url` attribute on a `ScraperResult` from a `PaginatedSchemaScraper` is a semicolon-delimited list of all the URLs that were scraped to produce that result.
18 changes: 0 additions & 18 deletions examples/yoyodyne.py

This file was deleted.

1 change: 1 addition & 0 deletions src/scrapeghost/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# ruff: noqa
from .scrapers import (
SchemaScraper,
PaginatedSchemaScraper,
)
from .utils import cost_estimate
from .preprocessors import CSS, XPath

0 comments on commit 26ab9ec

Please sign in to comment.