Skip to content
This repository has been archived by the owner on Mar 24, 2022. It is now read-only.

fasten-project/incremental-maven-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BCH compliance Java CI with Maven

⚠ This project has been supersed by Maven crawler plugin that is part of data-processins project.

Incremental Maven Crawler

This application crawls from Maven Central Incremental Index Repository with a certain interval. Running this application will follow this repository and outputs the unique artifacts released on Maven central. Currently, Maven Central releases a new (incremental) index every week.

Several outputs exist including Kafka and HTTP support. Moreover, a checkpointing mechanism is added to support persistence across restarts. More specifically, the checkpointDir stores an INDEX.index file where the INDEX is the next index to crawl. E.g. when 800.index is stored, the crawler will start crawling including index 800.

Usage

usage: IncrementalMavenCrawler
 -i,--interval <hours>            Time to wait between crawl attempts (in
                                  hours). Defaults to 1 hour.
 -o,--output <[std|kafka|rest]>   Output to send the crawled artifacts to.
                                  Defaults to std.
 -si,start_index                  Index to start crawling from (inclusive). Required.
 -bs,--batch_size <amount>        Size of batches to send to output.
                                  Defaults to 50.
 -cd,--checkpoint_dir <hours>     Directory to checkpoint/store latest
                                  crawled index. Used for recovery on
                                  crash or restart. Optional.
 -kb,--kafka_brokers <brokers>    Kafka brokers to connect with. I.e.
                                  broker1:port,broker2:port,...
                                  Required for Kafka output.
 -kt,--kafka_topic <topic>        Kafka topic to produce to.
                                  Required for Kafka output.
 -re,--rest_endpoint <url>        HTTP endpoint to post crawled batches to.
                                  Required for Rest output.

Outputs

An example JSON output message:

{
   "artifactId":"config",
   "groupId":"software.amazon.awssdk",
   "version":"2.15.58",
   "date":1609791717000,
   "artifactRepository":"https://repo.maven.apache.org/maven2/"
}

StdOutput:
Outputs to the console using System.out.println in a JSON format.

KafkaOutput:
Outputs to a Kafka topic.
Requires the arguments:

  • --output kafka; switch output mode to kafka
  • --kafka_topic TOPIC; the kafka topic to send to. --kafka_brokers BROKER1,BROKER2; the brokers to connect to.

Deployment

To build the image:

mvn clean package
docker build . -t crawler

Alternatively, download from GitHub Packages. This requires your token to be installed.

docker pull docker.pkg.github.com/fasten-project/incremental-maven-crawler/crawler:latest
docker tag docker.pkg.github.com/fasten-project/incremental-maven-crawler/crawler:latest crawler

To run:

docker run crawler [arguments]

For example:

docker run crawler --start_index 682 --output std