Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

Open
tjouneau opened this issue Apr 5, 2022 · 18 comments · May be fixed by #10578
Open

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

tjouneau opened this issue Apr 5, 2022 · 18 comments · May be fixed by #10578
Labels
Feature: Harvesting GREI 3 Search and Browse NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Bug a defect

Comments

@tjouneau
Copy link

tjouneau commented Apr 5, 2022

After version 5.4, things have improved regarding language mapping problems.
Some codes are still not managed. In the cases encountered, frm (Medieval French) and fro (Old French).
Would it be possible to include all codes in the Dataverse source?

What steps does it take to reproduce the issue?
Try to harvest from https://repository.ortolang.fr/api/oai/?verb=ListRecords&set=producer:atilf&metadataPrefix=oai_dc
6 datasets are not harvested, 4 due to language mapping issues.

What happens?
Mapping errrors documented in the harvest log :
Exception processing getRecord(), oaiUrl=https://repository.ortolang.fr/api/oai, identifier=oai:ortolang.fr:0c2017f1-7c3b-473a-b75d-ad97b4e09bd0, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.util.json.ControlledVocabularyException (Value 'fro' does not exist in type 'language')"
I'm attaching the server.log relevant extract and the harvest log.

harvest_ortolang3_2022-04-04T15-34-00.log
server.log

Which version of Dataverse are you using?
5.10

Any related open or closed issues to this bug report?

@landreev
Copy link
Contributor

These language codes are part of the Citation metadata block, defined as valid controlled vocabulary values for the field "language". So strictly speaking, these values are not in the source code. If this is really urgent, you could fix it in your installation yourself, by adding the lines for "fro" and "frm" etc. to the standard citation.tsv, then update the metadata block (with curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file citation.tsv). But yes, we should go ahead and add these values to citation.tsv for everybody in the next release. We have other open issues where people are requesting more alternative ISO language codes to be added as valid values (such as "en" and "fr", in addition to "eng" and "fre", etc.). It would make sense to handle them all at once.

@pdurbin
Copy link
Member

pdurbin commented Apr 21, 2022

@mreekie mreekie added pm.epic.nih_harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons labels May 9, 2022
@mreekie mreekie added NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... and removed NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... labels Oct 25, 2022
@mreekie mreekie removed the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Nov 2, 2022
@mreekie
Copy link

mreekie commented Dec 5, 2022

reference

people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:

Added back the laberl: NIH OTA: 1.4.1

Need to touch base with Leonid on this.

@mreekie mreekie added the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Dec 5, 2022
@qqmyers
Copy link
Member

qqmyers commented Dec 5, 2022

I looked at this a while ago and am not sure I understand it all. However, FWIW: We have < 200 language codes today (ISO 693-2) and for ISO 693-3 : 'As of 18 February 2021, the standard contains 7,893 entries'. If we simply cut/paste new values we will be making the list users have to scroll through ~40 times bigger. Further 'ISO 639-3 is not a superset of ISO 639-2.' and some languages will have both 693-2 and 693-3 codes. 693-3 also has some hierarchy - with macro-languages that include sub-languages. We also may have to understand how to handle a mix of 693-2 and 693-3 codes for export (what do you do when a language has both codes?) and import. For import, I think our code will already look for aliases of a term so, I think we could accept imports in either standard without more work (should be tested though).

@mreekie
Copy link

mreekie commented Jan 9, 2023

Review with Leonid

  • good candidate
  • Get this estimated and prioritized

@mreekie mreekie added pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards labels Mar 20, 2023
@cmbz cmbz added the pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues label Jun 2, 2023
@pdurbin pdurbin added the Type: Bug a defect label Oct 9, 2023
@landreev
Copy link
Contributor

I'm not sure what to do with this one.
Note the comment by Jim from a year ago, I'm not sure if the questions raised have been answered.
Also, there's a chance that closing #8243 (already prioritized and sized) may address everything here as well?
All in all, this one may need more of a review/discussion with the requestor.

@cmbz
Copy link

cmbz commented Dec 19, 2023

2023/12/19: Requires additional conversation with @DS-INRA and @tjouneau to determine next steps. Note that this is primarily a metadata issue rather than a harvesting issue.

@DS-INRA
Copy link
Member

DS-INRA commented Dec 20, 2023

Thanks for the ping and relaunching the discussion.
We opened this draft PR adressing #8243 with updated values to be able to discuss it with the proposal at hand:

@pdurbin
Copy link
Member

pdurbin commented Jan 5, 2024

This issue (#8578) is sprint ready but before anyone picks it up I think we should:

@landreev landreev added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Feb 12, 2024
@landreev
Copy link
Contributor

The amount of work here will depend on assessing how many problems/issues are there not addressed by the combo of #8243 and #9992.
Put 33 on it just in case.

@stevenwinship
Copy link
Contributor

@landreev Would this issue be solved with #10323?

@landreev
Copy link
Contributor

landreev commented May 7, 2024

I'm going to change the title of the issue, since we've been de-facto planning to use this issue to figure out if we are going to, or how to offer support for the full ISO 639-3 list in general, and not just within the context of import, or specifically harvesting.

There are apparently real life instances where users do want to have the full ~8K extended list, as an actual controlled vocabulary (so, no, the option added in #10323 - allowing an instance to harvest non-CVV conforming values from other sources - while useful to some instances, is not going to solve the issue for everybody). Case in point: see the comments from a user in #10481.

There are good arguments against adding the full list to the metadata block that we distribute for everyone (see Jim's comment above). An external CV could be a solution. Or perhaps a standard mechanism for an optional CV "expansion pack" that an instance can choose to install.

@stevenwinship stevenwinship self-assigned this May 14, 2024
@stevenwinship
Copy link
Contributor

I'm working on a solution that allows an admin to download the full ISO 639-3 list and directly load it into Dataverse via the same api that loads the tsv files. It merges the languages into the CV. I haven't seen any lag in the ui with the addition of 7,615 languages. I'd still like to test the loading of both 639-2 and 639-3 but if this solution is not acceptable then I won't waste the time, and start looking at other options.
@landreev @pdurbin @qqmyers Any opinions?

@pdurbin
Copy link
Member

pdurbin commented May 15, 2024

I don't have a strong opinion but I think @landreev was concerned about ~8k language entries in the database. If it's performant, maybe it's ok? 🤷

Using an external controlled vocabulary service might be an option as well, assuming it exists.

@landreev
Copy link
Contributor

@stevenwinship @qqmyers @pdurbin
It wasn't me, who first argued against adding all the 8K values to the CVV list in the block. May have been Jim (?), but I was convinced by the rationale presented back then.

However, I don't think this (straightforward and simple) solution should be considered completely off the table. If your experiments with the UI suggest that the performance, and the look-and-feel for the user, are not atrocious, reconsidering it should be up for debate.

It wasn't just the size though, I would suggest taking a close look at Jim's comment from 1.5 years ago and see if all the questions there have been answered.

Having taken a quick look, one serious unknown there is about hierarchy (the "macro-languages" defined in ISO 639-3). But, I'm wondering if the solution is... to just not worry about it, and handle them all as a list?
The fact that many languages have multiple codes (2- and 3- letter codes from the earlier 639-* versions; some have multiples of each), I don't think is a problem - these codes are added as "alternate values", and each CVV can have an unlimited number of those.
Similarly, I don't think metadata export is a problem either - because it is the main value ("English") that is used there; and however many extra alternate codes are defined ("en", "eng" ...) does not make any difference.

@landreev landreev changed the title Remaining mapping problems when harvesting from a repository using ISO 639-3 language codes Figure out whether, or how to support the extended ISO 639-3 list of languages May 15, 2024
@landreev
Copy link
Contributor

@stevenwinship @qqmyers @pdurbin
2 more things:

  • It is super important that however we modify this CV, we don't break anything compatibility-wise for any existing CVVs from the currently supported list; so let's make sure to have 8243 improve language controlled vocab #10481 merged before we do anything here.
  • If it turns out that adding the full list to our standard citation.tsv block is a practical option after all, I'll be only happy. But if we end up concluding that there are problems with that approach, I would consider an idea of a CV "extension pack". A version of the controlled vocab. that's still "official" and maintained by the Dataverse project, but is optional to install, something that an instance admin can choose to do if that's something their users need.

@landreev
Copy link
Contributor

@stevenwinship
I read your comment in a hurry, and actually missed the part that you seemed to already be doing this as an "extension pack" of sorts, using the same API, but outside of the main Citation block update (?).

But, everything I said earlier still stands, I believe. That could be a potential model for distributing the CVV. Or, if we play with it and conclude that the UI is working fine with that full list - then we may just shove it into the distributed citation.tsv.

But, I should also put it on record, that this is the kind of an issue that may not be super challenging technically, but will need to have more people involved to finalize any decisions. I can think of Julian, since metadata is his thing. I personally don't necessarily have a beef in it or any super strong opinions - I just got to work on #8243 recently, and ended up learning a lot about the ISO codes.

@qqmyers
Copy link
Member

qqmyers commented May 15, 2024

I'd suggest not having a one-off mechanism for language. If the UI works with 8K items, adding to the block seems OK. If we need to deal with the hierarchy in the UI, probably the easiest and most SPA ready would be to use the external vocab mechanism and JavaScript. It would be nice to allow people who only want to use the -2 list to do so - not sure how to do that with one block, but the external mechanism could be configurable (either a flag in the script or two scripts). If there isn't an online service to ping the ext. mechanism can just have the script include the static list.

I definitely second having UX discussion on this, especially if there won't be a choice to stay with the shorter list. I think it is also worth investigating whether the use of aliases is enough to make export and harvesting work as expected. I.e. are -3 values OK in DDI, DataCite, etc. for the people who use those today. Do we need to use the -2 version where possible for some users? As @landreev said, these are more UX/user questions than technical ones.

@cmbz cmbz added the GREI 3 Search and Browse label May 22, 2024
@stevenwinship stevenwinship linked a pull request May 22, 2024 that will close this issue
@stevenwinship stevenwinship removed their assignment May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting GREI 3 Search and Browse NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Bug a defect
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

8 participants