Skip to content

Scraping dynamically generated information from web site onehash.com using Scrapy, Selenium, Scrapy.Item, ItemLoader and xpath selectors

License

Notifications You must be signed in to change notification settings

Austerius/OneHash-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

OneHash-scraper

onehash.py

Scrapy spider for scraping dynamically generated esport betting information from web site www.onehash.com

This script was created for educational purposes to demonstrate how to scrape data from dynamically generated web page using Scrapy and Selenium webdriver. Also, here you can find examples of scrapy "Item" and "ItemLoader", as well as "How to use xpath selectors in scrapy" and "How to scroll dynamically loading web page block with Selenium"

This script was written in Python 3.6(for scrapy 1.5), and before running it, you'll need to install:

  • Scrapy (on Windows machine you'll need appropriate C++ SDK to run Twisted - check their docs);
  • Selenium (with geckodriver for Windows machines);
  • Firefox browser. After installing all requirements - create Scrapy project and put this script into "spiders" folder.

"onehash.py" spider scrape information about esport events that not yet been played(or in progress). What kind of data this script will scrape shows below(names in ' ' also are keys for Item container):

  • 'date' - date of the single event/game in timedate format converted to UTC time(or tried to);
  • 'game' - name of the game(csgo, overwatch, dota2 etc);
  • 'player1' - name of the first participant(or team name, like: "Misfits" or "SK Gaming" etc);
  • 'player2' - name of the second participant;
  • 'odds1' - bet rate on the first player(float value, like: 1.345);
  • 'odds2' - bet rate on the second player(float value).

Now, for convenient purpose this script needs that variable TIME_DIFFERENCE been set inside script to your own value.

  • TIME_DIFFERENCE - represent a difference between time for event, that shows on website and UTC time(check in script comments). You can set it to "0", and then all dates for events, which starts 24+ hours from current time will be saved in site_time format(not UTC).

Also, if script works to slowly or scraped information not full - you can try to adjust parameters 'sleep_time' and 'loop_timer' inside of method "parse" of OneHash class.

To run a spider - change your location in terminal to scrapy project folder and type:
scrapy crawl onehash
To save data to .json file(for example), type:
scrapy crawl onehash -o yourfile.json

About

Scraping dynamically generated information from web site onehash.com using Scrapy, Selenium, Scrapy.Item, ItemLoader and xpath selectors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages