Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

tjouneau · 2022-04-05T11:05:22Z

After version 5.4, things have improved regarding language mapping problems.
Some codes are still not managed. In the cases encountered, frm (Medieval French) and fro (Old French).
Would it be possible to include all codes in the Dataverse source?

What steps does it take to reproduce the issue?
Try to harvest from https://repository.ortolang.fr/api/oai/?verb=ListRecords&set=producer:atilf&metadataPrefix=oai_dc
6 datasets are not harvested, 4 due to language mapping issues.

What happens?
Mapping errrors documented in the harvest log :
Exception processing getRecord(), oaiUrl=https://repository.ortolang.fr/api/oai, identifier=oai:ortolang.fr:0c2017f1-7c3b-473a-b75d-ad97b4e09bd0, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.util.json.ControlledVocabularyException (Value 'fro' does not exist in type 'language')"
I'm attaching the server.log relevant extract and the harvest log.

harvest_ortolang3_2022-04-04T15-34-00.log
server.log

Which version of Dataverse are you using?
5.10

Any related open or closed issues to this bug report?

Harvesting client fails : Zenodo "Couperin" community / error with "language" field mapping. #7638

landreev · 2022-04-20T20:41:46Z

These language codes are part of the Citation metadata block, defined as valid controlled vocabulary values for the field "language". So strictly speaking, these values are not in the source code. If this is really urgent, you could fix it in your installation yourself, by adding the lines for "fro" and "frm" etc. to the standard citation.tsv, then update the metadata block (with curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file citation.tsv). But yes, we should go ahead and add these values to citation.tsv for everybody in the next release. We have other open issues where people are requesting more alternative ISO language codes to be added as valid values (such as "en" and "fr", in addition to "eng" and "fre", etc.). It would make sense to handle them all at once.

pdurbin · 2022-04-21T14:43:11Z

Related (thanks for the comment, Leonid):

mreekie · 2022-12-05T20:37:43Z

reference

people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:

Added back the laberl: NIH OTA: 1.4.1

Need to touch base with Leonid on this.

qqmyers · 2022-12-05T21:02:41Z

I looked at this a while ago and am not sure I understand it all. However, FWIW: We have < 200 language codes today (ISO 693-2) and for ISO 693-3 : 'As of 18 February 2021, the standard contains 7,893 entries'. If we simply cut/paste new values we will be making the list users have to scroll through ~40 times bigger. Further 'ISO 639-3 is not a superset of ISO 639-2.' and some languages will have both 693-2 and 693-3 codes. 693-3 also has some hierarchy - with macro-languages that include sub-languages. We also may have to understand how to handle a mix of 693-2 and 693-3 codes for export (what do you do when a language has both codes?) and import. For import, I think our code will already look for aliases of a term so, I think we could accept imports in either standard without more work (should be tested though).

mreekie · 2023-01-09T18:24:14Z

Review with Leonid

good candidate
Get this estimated and prioritized

landreev · 2023-12-19T15:53:20Z

I'm not sure what to do with this one.
Note the comment by Jim from a year ago, I'm not sure if the questions raised have been answered.
Also, there's a chance that closing #8243 (already prioritized and sized) may address everything here as well?
All in all, this one may need more of a review/discussion with the requestor.

cmbz · 2023-12-19T16:14:04Z

2023/12/19: Requires additional conversation with @DS-INRA and @tjouneau to determine next steps. Note that this is primarily a metadata issue rather than a harvesting issue.

DS-INRA · 2023-12-20T16:43:50Z

Thanks for the ping and relaunching the discussion.
We opened this draft PR adressing #8243 with updated values to be able to discuss it with the proposal at hand:

Sanitize languages controlled vocabulary values #10197

pdurbin · 2024-01-05T14:26:32Z

This issue (#8578) is sprint ready but before anyone picks it up I think we should:

Give it a size (perhaps it should be in "needs sizing") for now.
Evaluate PR Sanitize languages controlled vocabulary values #10197 and decide if it should be merged first. Does it fix this issue (Figure out whether, or how to support the extended ISO 639-3 list of languages #8578)? If so, let's add "closes" so merging it automatically closes it.

landreev · 2024-02-12T16:00:00Z

The amount of work here will depend on assessing how many problems/issues are there not addressed by the combo of #8243 and #9992.
Put 33 on it just in case.

stevenwinship · 2024-03-06T16:30:38Z

@landreev Would this issue be solved with #10323?

landreev · 2024-05-07T16:43:04Z

I'm going to change the title of the issue, since we've been de-facto planning to use this issue to figure out if we are going to, or how to offer support for the full ISO 639-3 list in general, and not just within the context of import, or specifically harvesting.

There are apparently real life instances where users do want to have the full ~8K extended list, as an actual controlled vocabulary (so, no, the option added in #10323 - allowing an instance to harvest non-CVV conforming values from other sources - while useful to some instances, is not going to solve the issue for everybody). Case in point: see the comments from a user in #10481.

There are good arguments against adding the full list to the metadata block that we distribute for everyone (see Jim's comment above). An external CV could be a solution. Or perhaps a standard mechanism for an optional CV "expansion pack" that an instance can choose to install.

stevenwinship · 2024-05-15T19:45:53Z

I'm working on a solution that allows an admin to download the full ISO 639-3 list and directly load it into Dataverse via the same api that loads the tsv files. It merges the languages into the CV. I haven't seen any lag in the ui with the addition of 7,615 languages. I'd still like to test the loading of both 639-2 and 639-3 but if this solution is not acceptable then I won't waste the time, and start looking at other options.
@landreev @pdurbin @qqmyers Any opinions?

pdurbin · 2024-05-15T19:55:08Z

I don't have a strong opinion but I think @landreev was concerned about ~8k language entries in the database. If it's performant, maybe it's ok? 🤷

Using an external controlled vocabulary service might be an option as well, assuming it exists.

landreev · 2024-05-15T22:25:52Z

@stevenwinship @qqmyers @pdurbin
It wasn't me, who first argued against adding all the 8K values to the CVV list in the block. May have been Jim (?), but I was convinced by the rationale presented back then.

However, I don't think this (straightforward and simple) solution should be considered completely off the table. If your experiments with the UI suggest that the performance, and the look-and-feel for the user, are not atrocious, reconsidering it should be up for debate.

It wasn't just the size though, I would suggest taking a close look at Jim's comment from 1.5 years ago and see if all the questions there have been answered.

Having taken a quick look, one serious unknown there is about hierarchy (the "macro-languages" defined in ISO 639-3). But, I'm wondering if the solution is... to just not worry about it, and handle them all as a list?
The fact that many languages have multiple codes (2- and 3- letter codes from the earlier 639-* versions; some have multiples of each), I don't think is a problem - these codes are added as "alternate values", and each CVV can have an unlimited number of those.
Similarly, I don't think metadata export is a problem either - because it is the main value ("English") that is used there; and however many extra alternate codes are defined ("en", "eng" ...) does not make any difference.

landreev · 2024-05-15T22:36:40Z

@stevenwinship @qqmyers @pdurbin
2 more things:

It is super important that however we modify this CV, we don't break anything compatibility-wise for any existing CVVs from the currently supported list; so let's make sure to have 8243 improve language controlled vocab #10481 merged before we do anything here.
If it turns out that adding the full list to our standard citation.tsv block is a practical option after all, I'll be only happy. But if we end up concluding that there are problems with that approach, I would consider an idea of a CV "extension pack". A version of the controlled vocab. that's still "official" and maintained by the Dataverse project, but is optional to install, something that an instance admin can choose to do if that's something their users need.

landreev · 2024-05-15T22:52:51Z

@stevenwinship
I read your comment in a hurry, and actually missed the part that you seemed to already be doing this as an "extension pack" of sorts, using the same API, but outside of the main Citation block update (?).

But, everything I said earlier still stands, I believe. That could be a potential model for distributing the CVV. Or, if we play with it and conclude that the UI is working fine with that full list - then we may just shove it into the distributed citation.tsv.

But, I should also put it on record, that this is the kind of an issue that may not be super challenging technically, but will need to have more people involved to finalize any decisions. I can think of Julian, since metadata is his thing. I personally don't necessarily have a beef in it or any super strong opinions - I just got to work on #8243 recently, and ended up learning a lot about the ISO codes.

qqmyers · 2024-05-15T23:53:24Z

I'd suggest not having a one-off mechanism for language. If the UI works with 8K items, adding to the block seems OK. If we need to deal with the hierarchy in the UI, probably the easiest and most SPA ready would be to use the external vocab mechanism and JavaScript. It would be nice to allow people who only want to use the -2 list to do so - not sure how to do that with one block, but the external mechanism could be configurable (either a flag in the script or two scripts). If there isn't an online service to ping the ext. mechanism can just have the script include the static list.

I definitely second having UX discussion on this, especially if there won't be a choice to stay with the shorter list. I think it is also worth investigating whether the use of aliases is enough to make export and harvesting work as expected. I.e. are -3 values OK in DDI, DataCite, etc. for the people who use those today. Do we need to use the -2 version where possible for some users? As @landreev said, these are more UX/user questions than technical ones.

pdurbin added the Feature: Harvesting label Apr 12, 2022

pdurbin mentioned this issue Apr 13, 2022

Spike: Inventory and prioritize all existing Harvesting related issues IQSS/dataverse-pm#24

Closed

3 tasks

mreekie mentioned this issue Mar 10, 2023

Collection: Keep track of list of issues that we want to address as part of 1.4.1 IQSS/dataverse-pm#25

Closed

20 tasks

mreekie added pm.epic.nih_harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons labels May 9, 2022

mreekie removed the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Nov 2, 2022

mreekie added the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Dec 5, 2022

sync-by-unito bot mentioned this issue Mar 3, 2023

4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 IQSS/dataverse-pm#10

Closed

3 tasks

mreekie added pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards labels Mar 20, 2023

cmbz mentioned this issue Jun 2, 2023

NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues IQSS/dataverse-pm#85

Closed

cmbz added the pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues label Jun 2, 2023

pdurbin added the Type: Bug a defect label Oct 9, 2023

DS-INRA mentioned this issue Dec 20, 2023

Sanitize languages controlled vocabulary values #10197

Closed

cmbz mentioned this issue Jan 29, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

34 tasks

landreev self-assigned this Jan 31, 2024

cmbz mentioned this issue Feb 2, 2024

Epic: GREI 3 - Search and Browse IQSS/dataverse-pm#117

Open

7 tasks

landreev mentioned this issue Feb 12, 2024

Feature Request/Idea: Sanitize languages controlled vocabulary values #8243

Closed

landreev added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Feb 12, 2024

cmbz mentioned this issue Apr 16, 2024

GREI 3: HDV Task - Improve Dataverse Metadata and Data Packaging Support IQSS/dataverse-pm#175

Open

21 tasks

landreev removed their assignment Apr 25, 2024

landreev mentioned this issue May 6, 2024

8243 improve language controlled vocab #10481

Merged

stevenwinship self-assigned this May 14, 2024

landreev changed the title ~~Remaining mapping problems when harvesting from a repository using ISO 639-3 language codes~~ Figure out whether, or how to support the extended ISO 639-3 list of languages May 15, 2024

cmbz added the GREI 3 Search and Browse label May 22, 2024

stevenwinship linked a pull request May 22, 2024 that will close this issue

Support ISO 639 languages #10578

Open

stevenwinship removed their assignment May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

tjouneau commented Apr 5, 2022 •

edited by pdurbin

Loading

landreev commented Apr 20, 2022

pdurbin commented Apr 21, 2022 •

edited

Loading

mreekie commented Dec 5, 2022

qqmyers commented Dec 5, 2022

mreekie commented Jan 9, 2023

landreev commented Dec 19, 2023

cmbz commented Dec 19, 2023

DS-INRA commented Dec 20, 2023 •

edited

Loading

pdurbin commented Jan 5, 2024

landreev commented Feb 12, 2024

stevenwinship commented Mar 6, 2024

landreev commented May 7, 2024

stevenwinship commented May 15, 2024

pdurbin commented May 15, 2024

landreev commented May 15, 2024

landreev commented May 15, 2024

landreev commented May 15, 2024

qqmyers commented May 15, 2024

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

Comments

tjouneau commented Apr 5, 2022 • edited by pdurbin Loading

landreev commented Apr 20, 2022

pdurbin commented Apr 21, 2022 • edited Loading

mreekie commented Dec 5, 2022

qqmyers commented Dec 5, 2022

mreekie commented Jan 9, 2023

landreev commented Dec 19, 2023

cmbz commented Dec 19, 2023

DS-INRA commented Dec 20, 2023 • edited Loading

pdurbin commented Jan 5, 2024

landreev commented Feb 12, 2024

stevenwinship commented Mar 6, 2024

landreev commented May 7, 2024

stevenwinship commented May 15, 2024

pdurbin commented May 15, 2024

landreev commented May 15, 2024

landreev commented May 15, 2024

landreev commented May 15, 2024

qqmyers commented May 15, 2024

tjouneau commented Apr 5, 2022 •

edited by pdurbin

Loading

pdurbin commented Apr 21, 2022 •

edited

Loading

DS-INRA commented Dec 20, 2023 •

edited

Loading