Top-1M-jarm

This repo is used to compute the jarm values of top 1 millions website.
More info on jarm.

Output file template

alexa rank	domain	ip	JARM hash
1	google.com	216.58.213.78	29d3fd00029d29d21c42d43d00041df48f145f65c66577d0b01ecea881c1ba
2	youtube.com	172.217.18.206	29d3fd00029d29d21c42d43d00041df48f145f65c66577d0b01ecea881c1ba

Output file from February 2023 scan: result.csv (Alexa rank column has been removed).

Architecture

flowchart LR
   csv(CSV file with domains to scan and their rank) --> Scheduler --> domainQueue[/queue of domains/]
   domainQueue--> workerDNS1[Worker resolving the domain into IP] 
   domainQueue --> workerDNS2[Worker resolving the domain into IP]
   domainQueue --> workerDNS3[Worker resolving the domain into IP]
   workerDNS1 --> ipQueue[/queue of IPs/]
   workerDNS2 --> ipQueue
   workerDNS3 --> ipQueue
   ipQueue--> workerJARM1[Worker performing the JARM scan on the IP] 
   ipQueue --> workerJARM2[Worker performing the JARM scan on the IP]
   ipQueue --> workerJARM3[Worker performing the JARM scan on the IP]
   workerJARM1 --> scanResultQueue[/queue of JARM result/]
   workerJARM2 --> scanResultQueue
   workerJARM3 --> scanResultQueue
   scanResultQueue--> workerAggregation[Aggregates results in a single CSV]

Note
The use of rq and docker compose does really make sense for tasks that are CPU-bond which is not the case here.
Nonetheless with a master and 3 workers (for a total of 5Go of RAM and 8vCPU, so a very modest cluster) it took a day and a half to process ~600k scans.

A batch of 1k domains being processed (RQ debug view)

As workers focus on priority on the ip queue, few jobs stay in this queue

Set up for development

Run poetry install to install dependencies.
This project use PyO3 to bind rust code, to use it run maturin develop --locked --release
To prepare local docker image run docker build -t top-1m-jarm:latest .

Running

This project use docker swarm (might require docker swarm init).
One node has to be marked as a coordinator with: docker node update --label-add coordinator=1 $(docker node inspect --format '{{ .ID }}' self).
It'll be responsible for input/output files.
result.csv must also be created via touch (by default touch ./data/result.csv).

docker stack deploy --compose-file docker-compose.yml top1MjarmStack
docker stack ls
docker service ls
docker service logs top1MjarmStack_scheduler -f

To monitor the queue:

docker exec -it $(docker ps -qf "name=top1MjarmStack_csv_writer" | head -n 1) poetry run rq info default domains ips jarm_result --url redis://:XXX_SET_REDIS_PASS_XXX@redis_queue:6379 -i 1

To remove the running containers:

docker stack rm top1MjarmStack
docker stack ls

Push to docker hub

docker build -t hugocker/top-1m-jarm --pull --no-cache .
docker push hugocker/top-1m-jarm

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data/redis_data		data/redis_data
output		output
src		src
top1Mjarm		top1Mjarm
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Top-1M-jarm

Output file template

Architecture

A batch of 1k domains being processed (RQ debug view)

Set up for development

Running

Push to docker hub

About

Releases 1

Languages

Hugo-C/top-1M-jarm

Folders and files

Latest commit

History

Repository files navigation

Top-1M-jarm

Output file template

Architecture

A batch of 1k domains being processed (RQ debug view)

Set up for development

Running

Push to docker hub

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Languages