This module was designed to run as a prerender client that caches to s3. Utilizing either local or docker to render webpages, which are then posts the rendered static HTML page to S3. The idea behind this is to allow for a place for bots to scan static html pages.
- Create an S3 Bucket.
- Have a domain with a robots.txt (ex. https://example.com/robots.txt)
If developing, ensure to install the requirements.txt file.
pip install -r requirements.txt
docker build -t prerender .
docker run -e AWS_ACCESS_KEY_ID=AWSKEY -e AWS_SECRET_ACCESS_KEY=AWSSECRET -t prerender -i python -c "from prerender.prerender import Prerender; Prerender(#Options).capture()"
python scraper/setup.py install
python prerender/setup.py install
from prerender.prerender import Prerender
pre = Prerender(
# Options
)
Required | Variable | Info |
---|---|---|
True | robots_url | The path to your root robots file. This will contain the sitemap info |
True | s3_bucket | Cache Archive bucket name |
False | auth | Utilized for basic authenticating to page. |
False | query_char_deliminator | (recommended) - Character to replace the question mark. If storing static pages, AWS doesnt allow you to have ? in a file to serve the content. So changing to a different character will fix this. Ex) /subpage?id=1 and your query_char_deliminator is '#', your page will be stored as /subpage#id=1 |
False | allowed_domains | List of domains to allow. If specified all other domains will be blocked during the page capturing. |
pre.invalidate()
pre.capture()
If you prefer to capture a single page, versus a full domain.
pre.capture_page_and_upload("https://example.com")