Skip to content

repository housing web-crawling and scraping code for WikiFactCheck-en evidence

Notifications You must be signed in to change notification settings

wikifactcheck-english/wfc-en-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Install

Install go and configure your GOPATH.

Download this repo and its dependencies with:

go get -v -t github.com/mammothbane/wikite_go

This repo can then be found at $GOPATH/src/github.com/mammothbane/wikite_go.

Run

Generate reference index refidx.json (expects refdata/ and out/ directories to exist from Python steps):

go run cmd/refidx/refidx.go

Download references:

go run cmd/refdl/refdl.go

If you want to parallelize more heavily, split your index index.txt into parts into the indices/ diretory, then run run_parts.py and a separate go process will download all the parts.

Run the jsonl downloader

This was not part of the initial process and is provided for compatibility with the published data format. If you have a .jsonl file and you're trying to download the evidence data, this is all you need to do.

go run cmd/dl_jsonl/dl_jsonl.go [-inputFile <filename>]

The input is assumed by default to be named input.jsonl in the root of this directory. This process produces output in the evidence/ directory.

Warning: this will take a very long time (days) and will produce dozens of gigabytes of data.

Configuration

All go binaries are responsive to flags for configurability. You can see the flags that are available by invoking the binary with the -h flag.

About

repository housing web-crawling and scraping code for WikiFactCheck-en evidence

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published