document pagination

jamesturk · Jun 6, 2023 · 26ab9ec · 26ab9ec
1 parent 7dfcae2
commit 26ab9ec
Show file tree

Hide file tree

Showing 4 changed files with 56 additions and 19 deletions.
diff --git a/docs/examples/yoyodyne.py b/docs/examples/yoyodyne.py
@@ -0,0 +1,13 @@
+import json
+from scrapeghost.scrapers import PaginatedSchemaScraper
+
+
+schema = {"first_name": "str", "last_name": "str", "position": "str", "url": "url"}
+url = "https://scrapple.fly.dev/staff"
+
+scraper = PaginatedSchemaScraper(schema)
+resp = scraper.scrape(url)
+
+# the resulting response is a ScrapeResponse object just like any other
+# all the results are gathered in resp.data
+json.dump(resp.data, open("yoyodyne.json", "w"), indent=2)
diff --git a/docs/usage.md b/docs/usage.md
@@ -138,4 +138,45 @@ If you want to validate that the returned data isn't just JSON, but data in the
 --8<-- "docs/examples/pydantic_example.log"
 ```
 
-This works by converting the `pydantic` model to a schema and registering a `PydanticPostprocessor` to validate the results automatically.
+This works by converting the `pydantic` model to a schema and registering a `PydanticPostprocessor` to validate the results automatically.
+
+## Pagination
+
+One technique to handle pagination is provided by the `PaginatedSchemaScraper` class.
+
+This class takes a schema that describes a single result, and wraps it in a schema that describes a list of results as well as an additional page.
+
+For example:
+
+```python
+{"first_name": "str", "last_name": "str"}
+```
+
+Automatically becomes:
+
+```python
+{"next_page": "url", "results": [{"first_name": "str", "last_name": "str"}]}
+```
+
+The `PaginatedSchemaScraper` class then takes care of following the `next_page` link until there are no more pages.
+
+!!! note
+
+    Right now, given the library's stance on customizing requests being "just use your own HTTP library", the `PaginatedSchemaScraper` class does not provide a means to customize the HTTP request used to retrieve the next page.
+
+    If you need a more complicated approach it is recommended you implement your own pagination logic for now,
+    <https://github.com/jamesturk/scrapeghost/blob/main/src/scrapeghost/scrapers.py#L238> may be a good starting point.
+
+    If you have strong opinions here, please open an issue to discuss.
+
+It then takes the combined "results" and returns them to the user.
+
+Here's a functional example that scrapes several pages of employees:
+
+```python
+--8<-- "docs/examples/yoyodyne.py"
+```
+
+!!! warning
+
+    One caveat of the current approach: The `url` attribute on a `ScraperResult` from a `PaginatedSchemaScraper` is a semicolon-delimited list of all the URLs that were scraped to produce that result.
diff --git a/examples/yoyodyne.py b/examples/yoyodyne.py
diff --git a/src/scrapeghost/__init__.py b/src/scrapeghost/__init__.py
@@ -1,6 +1,7 @@
 # ruff: noqa
 from .scrapers import (
     SchemaScraper,
+    PaginatedSchemaScraper,
 )
 from .utils import cost_estimate
 from .preprocessors import CSS, XPath