In this project we implemented crawlers in order to extract business data from their websites. In order to share the work load, we created 5 individual crawlers.
First Crawler
The first crawler gets as input the business website and extracts the following elements from the website, the demo_first_crawler.ipynb contains the first crawler:
- Social network names
- Social network urls
- If the website provides multi language option (0 or 1)
- If the website provides newsletter (0 or 1)
- If the website provides search option (0 or 1)
- If the website provides Blog (0 or 1)
- If the website provides mobile application (0 or 1)
- If the website has E-shop (0 or 1)
The following image demonstrates the use of First Crawler
Second Crawler
The second crawler gets as input the business website and extracts the following elements from business website, the demo_second_crawler.ipynb contains the second crawler:
- Business phone contacts
- Business email contacts
- Business name
- Business quality certifications
- Countries in which the business operates
- Business scope of activities
The following image demonstrates the use of the Second Crawler
Third Crawler
The third crawler gets as input the business website and extracts the following elements from business website, the demo_third_crawler.ipynb contains the third crawler:
- Business website last modified date (Source: Internet Archive)
- Business website development quality (Source: Google Insights API)
- Total visits/year (Source: StatsShow.com)
- Unique visits/year (Source: StatsShow.com)
The following image demonstrates the use of the Third Crawler
Fourth Crawler
The fourth crawler gets as input the business website and extracts:
- Business street address
- Business geographical coordinates
- Business zip code The file demo_fourth_crawler.ipynb contains the fourth crawler
The following image demonstrates the use of the Fourth Crawler
Fifth Crawler
The fifth crawler, takes as input business website, checks if the following elements are referred on Business websites:
- If "Corporate Social responsibility" is referred on business website (0 or 1)
- If "exports" is referred on business website (0 or 1)
- If "imports" is referred on business website (0 or 1)
- If "customer support" is referred on business website (0 or 1)
- If "representation" is referred on business website (0 or 1)
- If "private facilities" is referred on business websites (0 or 1)
- If "awards" is referred in business websites (0 or 1) The file demo_fifth_crawler.ipynb contains the fifth crawler
The following image demonstrates the use of the Fifth Crawler