Dark-Web-Spiders

This repository contains the web-crawlers used to mine dark-web. We had used Scapy to scrape data from the web. Although we have included some scripts that use BeautifulSoup as well,the major part was taken care by Scrapy itself.

What differentiates this from normal scrapers?

In the dark web, CAPTCHAs pose a problem for spiders. This was taken care of by solving CAPTCHAs manually and then feeding cookies to the spider.

How to use these files?

To use these files:

Start a new scrapy project.
Overwrite the existing settings by referring to settings.py
First run the title scraper. For this, verify that the selectors work for your website or write your own selectors. Replace 'sample.website' and put proper cookies.
Now using data scraped, do the same for post scaper.

Comparision of various Scapping frameworks

Scrapy - Scrapy is a web crawling multithread framework with a large number of tools for web crawling. It is built on top Twisted which is an asynchronous networking framework that follows non-blocking I/O calls to servers. Because it is multithreaded and non-blocking, it is actually the best in terms of performance and actually very fast. It has been built to consume less memory and use CPU resources minimally. In fact, some benchmarks have stated that Scrapy is 20 times faster than the other tools in scraping. It is portable, and its functionality can be extended.

One advantage of Scrapy is that it comes with modules to send requests as well as to parse responses. The major problem associated with Scrapy is that it is not a beginner-centric tool.

BeautifulSoup - It is an open-source tool and used for web scraping. However, unlike Scrapy, which is a web crawling and scraping framework, BeautifulSoup is not. BeautifulSoup is a module that can be used for pulling data out of HTML and XML documents. BeautifulSoup is a beginner-friendly tool that a newbie can hit the ground running with it. This is because it has very good documentation and a friendly user community. Most web scrapers must have used BeautifulSoup before heading over to Scrapy. The tool is not complex and makes it easier for you to transverse an HTML document and pick the required data.
Selenium - Selenium can send web requests and also comes with a parser. With Selenium, you can pull out data from an HTML document as you do with Javascript DOM API. The major advantage Selenium has is that it loads Javascript and can help you access data behind JavaScript without necessarily going through the pain of sending additional requests yourself. This had made Selenium not only useful to itself but to the other tools. Web scrapers that use either Scrapy or BeautifulSoup make use of Selenium if they require data that can only be available when Javascript files are loaded.

Scrapped files

The files that we have scrapped have been done in a controlled environment. Do not try to do the same without safety. Also, due to confidentiality, I have not uploaded the complete database file. Contact me at [email protected], for getting the access. The data is available for viewing purpose, press this link for the same.

The drive folder also contains the softcopies of the book Sion Retzkin - Hands-On Dark Web Analysis Learn wh…es on in the Dark Web, and how to work with it and Dark Web_ Exploring and Data Mining the Dark Side of the Web

There are 4 different databse that we were able to scrap:

IronMarch Neo-Nazi Hackforum
Nulled.io hackforum
Indian Markets on darkweb
Agora database

Given below is a snapshot of the Agora database :

Sqlite-Python

The scrapped files were constructed into an Sql Database for personal reasons. The following link gives a guide on how to connnect a Python session to a sql server : https://github.com/Jash-2000/Dark-Web-analysis/tree/master/SQL2Python.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Iron_March		Iron_March
Resources		Resources
Scrapped_Files		Scrapped_Files
Scripts		Scripts
Agora.PNG		Agora.PNG
CRISP-DM.png		CRISP-DM.png
README.md		README.md
newman_watts_strogatz_graph.ipynb		newman_watts_strogatz_graph.ipynb
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dark-Web-Spiders

What differentiates this from normal scrapers?

How to use these files?

Comparision of various Scapping frameworks

Scrapped files

Sqlite-Python

About

Releases

Packages

Languages

Jash-2000/Dark-Web-Spiders

Folders and files

Latest commit

History

Repository files navigation

Dark-Web-Spiders

What differentiates this from normal scrapers?

How to use these files?

Comparision of various Scapping frameworks

Scrapped files

Sqlite-Python

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages