[commands] add replay GitHub csv commits option #210
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This includes a command to "replay" the API CSVs stored in GitHub as data updates, such that the changes will be stored in the DB as batches.
This command gets an input file with list of commits, and goes over the commits, in order, fetches the raw CSV content from GH (not through GH API, because then it hits a rate limit pretty fast), and sends the commit as a new published batch.
There are a few things that happen locally (in the command), to reduce the size of updates, and calculate the message for the batch, changed fields, etc.
The heuristics are:
Some heuristics about date/time/date formatting
Some commits with bad data are completely skipped
The process to find the diff (between 2 consecutive commits) and submit only the rows that changed (or added) is done locally in the command, it's faster this way.
I changed some logging -- this is minor.
I commented out the requirement to submit
states
as part of the batch, because I didnt do the cross-ref of history of states_info to the commits