This project will:
- scrape speakers' names from archived conference websites,
- use SexMachine to infer gender, and
- plot gender ratios for different conferences over time.
The Scrapy team have built a spider that scrapes information about speakers at Python conferences since 2011; please see the Scrapy installation guide for installation instructions.
To get started with the sprint:
-
Pick a currently-active conference that hasn't yet been scraped and write a Scrapy Spider for that conference. You can see conferences that have been scraped already by typing
scrapy list
. -
Create a Scrapy Spider for the conference you wish to scrape, in the pycon_speakers/spiders/ directory. It should crawl as many years of the conference as possible and extract Speaker items.
-
Test your spider
-
Submit a pull request
Other tasks:
- Improve the gender identification in pycon_speakers/pipelines.py
- Review crawled data and fix spiders when the data is incorrect
- Chart results
List available spiders:
scrapy list
Run a spider:
scrapy crawl us.pycon.org
Run all spiders and generate a data.csv file:
run.sh
See https://dash.scrapinghub.com/p/2878/
username: pycon2014
password: pycon2014