Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ISBN IA extractor bot #340

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions isbnfromiabot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
A set of scripts to add isbn_13 values to editions with IA/ocaid references containing one.
### How To Use
```bash
# Find Editions with IA ISBN, but no ISBN 13
./find_editions_with_isbnianot13.sh /path/to/ol_dump.txt.gz /path/to/filtered_dump.txt.gz
# Add ISBN 13s converted from the ia ocaid source
python isbn_ia_to_13.py --dump_path=/path/to/filtered_dump.txt.gz --dry_run=<bool> --limit=<init>
```
If `dry_run` is True, the script will run as normal, but no changes will be saved to OpenLibrary.
This is for debugging purposes. By default, `dry_run` is `True`.
`limit` is the maximum number of changes to OpenLibrary that will occur before the script quits.
By default, `limit` is set to `1`. Setting `limit` to `0` allows unlimited edits.
A log is automatically generated whenever `isbn_ia_to_13.py` executes.
22 changes: 22 additions & 0 deletions isbnfromiabot/find_editions_with_isbn_ia_not_13.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

if [[ -z $1 ]]
then
echo "No dump file provided"
exit 1
fi
if [[ -z $2 ]]
then
echo "No output file provided"
exit 1
fi

OL_DUMP=$1
OUTPUT=$2

zgrep ^/type/edition $OL_DUMP |
grep -E '"ia:isbn_\d{13}"' |
grep -v -E '"isbn_13":' |
grep -v -E '"isbn_10"' |
pv |
gzip > $OUTPUT
70 changes: 70 additions & 0 deletions isbnfromiabot/isbn_ia_to_13.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
"""
BWB isbn ref to isbn 13
NOTE: This script ideally works on an Open Library Dump that only contains editions with an BWB isbn ref and no isbn_13
"""
import gzip
import json
import re

import isbnlib
import olclient


class ConvertISBNiato13Job(olclient.AbstractBotJob):
def run(self) -> None:
"""Looks for any IA ISBN to convert to 13"""
self.write_changes_declaration()
header = {"type": 0, "key": 1, "revision": 2, "last_modified": 3, "JSON": 4}
comment = "extract ISBN 13 from IA source_record"
with gzip.open(self.args.file, "rb") as fin:
for row_num, row in enumerate(fin):
row = row.decode().split("\t")
_json = json.loads(row[header["JSON"]])
if _json["type"]["key"] != "/type/edition":
continue

if hasattr(_json, "isbn_13"):
# we only update editions with no existing isbn 13s (for now at least)
continue

if "source_records" in _json:
source_records = _json.get("source_records", None)
else:
continue
regex = "ia:isbn_[0-9]{13}"
isbn_13 = False
for source_record in source_records:
if re.fullmatch(regex, source_record):
isbn_13 = source_record[8:]
break

if not isbn_13:
continue

if not isbnlib.is_isbn13(isbn_13):
continue

olid = _json["key"].split("/")[-1]
edition = self.ol.Edition.get(olid)
if edition.type["key"] != "/type/edition":
continue

if hasattr(edition, "isbn_13"):
# don't update editions that already have an isbn 13
continue

isbns_13 = [isbn_13]

setattr(edition, "isbn_13", isbns_13)
self.logger.info("\t".join([olid, source_record, str(isbns_13)]))
self.save(lambda: edition.save(comment=comment))


if __name__ == "__main__":
job = ConvertISBNiato13Job()

try:
job.run()
except Exception as e:
job.logger.exception(e)
raise e
2 changes: 2 additions & 0 deletions isbnfromiabot/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
openlibrary-client==0.0.30
isbnlib==3.10.14