Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BWBImportBot Pipeline Epic #97

Open
mekarpeles opened this issue May 4, 2021 · 1 comment
Open

BWBImportBot Pipeline Epic #97

mekarpeles opened this issue May 4, 2021 · 1 comment

Comments

@mekarpeles
Copy link
Member

Some ideas we discussed:

  1. Using sqlite instead of books.jsonl to append every month's results in batches
  2. Replace import.log also w/ sqlite
  3. Possibly explore gnu parallel

cc: @BharatKalluri

@BharatKalluri
Copy link
Collaborator

BharatKalluri commented May 5, 2021

Bot @ https://gist.github.com/BharatKalluri/1b3c7fd88a780a9cdd99063715a5baa1

The script has two modes.

  • ./bwb-import-bot.py setup_db ./bwb.csv : Parses and cleans all the data, and then inserts the data into a database called bwb-import-state.db. All these entries will have a status of TO_BE_IMPORTED and null in a column called comment in the DB. (This should therotically not take more than a few minutes even for files > 1GB, but need to test). This is something which will be run every time OL recieves data (which is most probably once a month).
  • ./bwb-import-bot.py process : Reads a batch (currently 10000) of records whose status is TO_BE_IMPORTED from the DB and tries to import them into OL. If the request succeeds, then the status will change to SUCCESS for that row in the DB else it will change to ERROR with the error in the comment. This is a process which will keep running in the background. It stops running when there are no rows in the DB whose status is TO_BE_IMPORTED.

Some more thoughts here, we can technically make use of all the cores on the system (using pandarell) and parallelize the process step. And thereby start making a lot of import calls in parallel. @mekarpeles had two very good points to not do this for now

  • Import process is a very intensive process and queuing many imports may impact the server's stability
  • OL ratelimits to one request per second, hence we cannot make multiple calls in parallel

This optimization is something we might explore later on, but right now everything will be sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants