Sierra MARC full-exports

Code to export marc data from III Sierra ILS.

On this page...

A description of the code-flow.
Possible optimizations

Code flow...

The full set of marc records are exported from Sierra twice a week. There are four parts to this code & process...

Determine the last bib

Currently lib/last_bib.py is called by a cron script a few times a day. This produces a web-accessible json-file. The id field contains the last-bib.
Set up the tracker

The tracker is a web-accessible json file. Just before the main processing code is run via cron, the previous tracker is deleted via a separate cron-job. When the main processing code is run via its cron-job, the tracker is checked.
- The first check is to see if it exists. If it doesn't exist, it's created.
- The second check is to see if it contains a last bib. If it doesn't, the last bib is grabbed from the web-accessible last_bib.json url described above.
- The third check is to see if batches have been created. If they haven't been, the tracker uses the last-bib to determine the full-range of bibs, then creates the batches of bib sub-ranges respecting the 2000-bib-range limit for the api.
Query the api
- The 'next-batch' bib-range is grabbed from the tracker.
- The api is queried on the bib-range.
- The api returns a file-url for the specified bib-range.
- The file-url is accessed and the file is saved to the target directory with a unique name.
- The tracker is updated indicating that batch is complete.
- The script gets the 'next-batch' bib-range from the tracker, and the cycle continues.
Validate the marc files

Some of the downloaded *.mrc files aren't actually marc, but instead contain status information from the api. This step goes through each record and moves any invalid *.mrc file to a *.txt file.

Notes

The output of this code is used by...
- Josiah
  - A fifth step occurs: The processing of the marc-files to extract updates. This is currently accomplished via old code in a private repostory. Those 'update-marc-files' are saved into a directory where a final sixth step occurs. Ruby traject code processes each of the update-marc-files, flowing extracted data into solr.
- tech-services reports (code)
  - A cron script triggers code that runs through these marc-files and updates db tables for the web-app.
- new-titles
  - A cron script triggers code that runs through 'updates' from these marc-files and updates db tables for the web-app.

Back to this code...

The Sierra api has built-in rate-limiting. When rate-limiting is in effect, instead of the response providing a file-url, it provides a message that rate limiting is in effect with an estimated number of minutes to wait before making the next api call.
The net effect of this is that after a bunch of files are created/downloaded, the script can run for about five minutes before rate-limiting kicks in.
The low-tech solution to this, that's working in production, is to have the script triggered by cron every 10 minutes during the expected marc-export time-frame, and have each triggered-process run until either rate-limiting kicks in, or until five minutes passes, whichever occurs first.
The tracker-handling is thus useful for two reasons:
- For development, or for possible troubleshooting, processing can pick up where it left off easily.
- It handles the auto-stopping and auto-starting of the script seamlessly.

Possible optimizations

These are in no-particular order; the purpose is to capture ideas that have come up in discussions/brainstorms.

H.C. is exploring using other features of the Sierra api to get json data-elements instead of marc records, and directly act on that json (utilizing the traject massage-logic) to solrize updates.
- If this is not implemented, BJD will create code to use the API to directly extract updates that the ruby traject code can operate on.
It is possible for there to be no records returned for a given range of 2000 bibs, because many bibs have been deleted. This explains why we have a bib range of roughly 8 million bibs, but actually have about 4 million bibs. We've been told that if a bib is deleted, that bib will not return (will not be "undeleted"). Given that, if we were to track deleted bibs, we should be able to significantly reduce the number and increase the efficiency of the queries.
Currently a file-url is provided by the api even when the number of records to be returned is zero. Instead of downloading that non-useful file, we could instead simply not download it.
Currently each range-query saves to a separate download file with a unique name. These numerous download files could be combined for more efficient subsequent processing.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
lib		lib
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
controller.py		controller.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sierra MARC full-exports

Code flow...

Determine the last bib

Set up the tracker

Query the api

Validate the marc files

Notes

Possible optimizations

About

Releases 3

Packages

Languages

Brown-University-Library/sierra_big_export_code

Folders and files

Latest commit

History

Repository files navigation

Sierra MARC full-exports

Code flow...

Determine the last bib

Set up the tracker

Query the api

Validate the marc files

Notes

Possible optimizations

About

Topics

Resources

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages