Install go
and configure your GOPATH
.
Download this repo and its dependencies with:
go get -v -t github.com/mammothbane/wikite_go
This repo can then be found at $GOPATH/src/github.com/mammothbane/wikite_go
.
Generate reference index refidx.json
(expects refdata/
and out/
directories to exist from Python steps):
go run cmd/refidx/refidx.go
Download references:
go run cmd/refdl/refdl.go
If you want to parallelize more heavily, split your index index.txt
into parts into the indices/
diretory, then run
run_parts.py
and a separate go process will download all the parts.
This was not part of the initial process and is provided for compatibility with the published data format. If you have a
.jsonl
file and you're trying to download the evidence data, this is all you need to do.
go run cmd/dl_jsonl/dl_jsonl.go [-inputFile <filename>]
The input is assumed by default to be named input.jsonl
in the root of this directory. This process produces output in
the evidence/
directory.
Warning: this will take a very long time (days) and will produce dozens of gigabytes of data.
All go binaries are responsive to flags for configurability. You can see the flags that are available by invoking the
binary with the -h
flag.