Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow librarians to import MARC data from other libraries #8360

Open
onnotasler opened this issue Oct 3, 2023 · 7 comments
Open

Allow librarians to import MARC data from other libraries #8360

onnotasler opened this issue Oct 3, 2023 · 7 comments
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: MARC records Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@onnotasler
Copy link
Collaborator

When entering new books or editing existing books, I often have to manually copy from libraries that offer a MARC record for download. It would be great if I could directly import this data instead of having to typing it.

As an example, take Das Postwesen im Postamtbezirk Buxtehude.
This book exists as a really low quality import on Open Library at OL26425107W
The Deutsche Nationalbibliothek offers most of the lacking information on their website. It offers downloads as MARC21-XML and RDF (Turtle).

The DNB is not the only national libraries offering this, even though the formats differ between libraries. The Bibliothèque nationale de France offers Intermarc and Unimarc instead, for instance. LIBRIS (National Library of Sweden) offers MARC21.

It would save me time and prevent spelling errors if I could import those datasets.

Describe the problem that you'd like solved

A way to import MARC records from National Libraries, to at least improve existing records, but ideally also to create new books.

Proposal & Constraints

As far as I understood, Open Library already imports MARC records from some libraries. At least I often read "imported by MARC record from library of ..." at the bottom of editions.

The import should not be more annoying than typing the stuff in manually. Also, there seems to be a lot of technical differences between different MARC versions - I probably won't be able to get up to speed in all of them, this would have to be handled automatically.

Additional context

Stakeholders

@hornc

@onnotasler onnotasler added Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels Oct 3, 2023
@LeadSongDog
Copy link

LeadSongDog commented Oct 3, 2023

Really, this should have been addressed long ago. Once a unique external ID such as ISBN or OCLCn has been furnished, the ImportBot ought not settle for just one repository’s record, but either select the most complete one available from a reliable library, or even better, fuse them together to fill in any blank fields. Certainly not a good plan to be stuck indefinitely with whatever little bit AMZ or BWB furnished.

@mekarpeles mekarpeles added Priority: 2 Important, as time permits. [managed] Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Theme: MARC records Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead labels Oct 16, 2023
@Koenisegg484
Copy link

Hi @hornc
I would like to work on this issue,
Could I get some pointers on how shall I start as this is my first contribution.

@mekarpeles mekarpeles added Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Priority: 3 Issues that we can consider at our leisure. [managed] and removed Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] Priority: 2 Important, as time permits. [managed] labels Nov 6, 2023
@mekarpeles
Copy link
Member

It seems like the ask is:
Ability to upload/submit a MARC record to Open Library

We have a pipeline for importing MARCs to Open Library, backed by Archive.org items which is described here:
https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing#MARC-Records

Also, there is a MARC option in the openlibrary.org/api/import path...

This doesn't seems like a fantastic match for a first project by a community contributor. If someone did want to work through this, the solution would likely be...

To create a librarian-only UI where a contributor with librarian permission group can upload a MARC record which gets submitted to our import process using the MARC format of parse_data:

class importapi:
"""/api/import endpoint for general data formats."""
def error(self, error_code, error='Invalid item', **kwargs):
content = {'success': False, 'error_code': error_code, 'error': error}
content.update(kwargs)
raise web.HTTPError('400 Bad Request', data=json.dumps(content))
def POST(self):
web.header('Content-Type', 'application/json')
if not can_write():
raise web.HTTPError('403 Forbidden')
data = web.data()
try:
edition, format = parse_data(data)

@hornc
Copy link
Collaborator

hornc commented Nov 6, 2023

I agree with @mekarpeles that this is probably a bit tricky for a first time contributor.

I had been meaning to respond with a summary of the two options mentioned above where we do have MARC imports already.

The bulk import process could be used to import a single record, but that's a bit fiddly and involves creating a new archive.org item. Depending on the source though, if MARC records are available publicly, there might be a way to import an entire collection rather than a few books one by one. Is that a possibility here?

The API should work to import a single record in one go, but I have not looked at this in a while. I don't think the single import API will store the MARC record anywhere, which is less useful than it could be. Open Library does not store MARC records, they are all on archive.org as single records stored on a scanned item, or part of a larger bulk-data MARC collection. Single MARC records without corresponding scans is not handled well / at all (if I remember correctly).

The work around has been to only import bulk collections, which gives many new books, and records the source.

Three options:

  1. Use the existing bulk import API because we can get more records from this source (I don't know if that's possible or better than the original request)
  2. Figure out the existing API instructions in way that satisfy the request. The API is there, but is mostly unused.
  3. Implement a new librarian UI interface to the existing APIs, if the first two options aren't sufficient as is.

@onnotasler
Copy link
Collaborator Author

Depending on the source though, if MARC records are available publicly, there might be a way to import an entire collection rather than a few books one by one. Is that a possibility here?

The free MARC records I found were all limited to a single edition of a single work. With the tools and knowledge I have, I can only download and process one edition at a time. If it is possible to import the whole catalogue at once, that would definitely be better.

At least the Deutsche Nationalbibliothek has an Bezugswege und Exportformate entry on their homepage, and they seem to offer their whole catalogue in several different files formats:

  • MARC21
    There are several files, one needs to download all files that end with mrc.gz to get the whole snapshot.
  • RDF
    There are many, many files in there, one needs Stabiler Link auf den aktuellen Gesamtabzug (roughly: Permalink to catalogue snapshot) in one of the formats offered.

They also offer a long list of formats and APIs, but I lack the technical expertise to comment on them.

@hornc
Copy link
Collaborator

hornc commented Nov 12, 2023

@onnotasler There's an issue for DNB data here: internetarchive/openlibrary-bots#29 I have prepared the data and made a start on importing. I stopped because of the various discussion about import data quality, and have not yet resumed importing. This is something I can turn back on again if there is demand.

@onnotasler
Copy link
Collaborator Author

I do not insist on a MARC importer if I can instead get the books imported in bulk, but in that case we should implement a way to suggest sources for bulk data instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: MARC records Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

No branches or pull requests

5 participants