Parallel NDJSON Reader

Purpose

This script can read and process newline delimited data extremely quickly. For NDJSON files, my 12 core Xeon was able to decode (json.loads) 90,000 Twitter objects per second. This script is basically limited by the amount of CPUs you have and how fast your I/O subsystem is.

Features

Ability to select number of cores used by setting the value of the n_chunks variable.
If the file is too small to split into N pieces, the script will scale to the maximum number of chunks possible. This script is not meant for small files since there is a little bit of startup time involved. This is meant to tear through big data (gigabytes / terabytes / petabytes).

[email protected]

https://pushshift.io/donations

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
parallelReader.py		parallelReader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

parallelReader.py

parallelReader.py

Repository files navigation

Parallel NDJSON Reader

Purpose

Features

End

About

Releases

Packages

Languages

pushshift/Parallel-NDJSON-Reader

Folders and files

Latest commit

History

README.md

README.md

parallelReader.py

parallelReader.py

Repository files navigation

Parallel NDJSON Reader

Purpose

Features

End

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages