Support the full ISO 639-3 list of languages #10578

stevenwinship · 2024-05-22T18:17:16Z

What this PR does / why we need it: Some codes are still not managed. In the cases encountered, frm (Medieval French) and fro (Old French).

Which issue(s) this PR closes: 8578

Closes #8578

Special notes for your reviewer:

Suggestions on how to test this: See setup instructions in ISO639IT.java test

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No. Only new languages in the list of languages

Is there a release notes update needed for this change?: Yes. to be included in this PR

Additional documentation:

src/test/java/edu/harvard/iq/dataverse/api/AdminIT.java

coveralls · 2024-05-22T18:36:33Z

coverage: 20.726% (-0.02%) from 20.741%
when pulling 820ff33 on 8578-support-extended-iso-639-languages
into 0d27957 on develop.

pdurbin

Just some high-level, doc, and test questions.

src/main/java/edu/harvard/iq/dataverse/api/DatasetFieldServiceApi.java

doc/release-notes/8578-support-extended-iso-639-languages.md

src/test/java/edu/harvard/iq/dataverse/api/AdminIT.java

qqmyers · 2024-06-18T17:18:15Z

doc/release-notes/8578-support-extended-iso-639-languages.md

@@ -0,0 +1,12 @@
+The Controlled Vocabulary Values list for the metadata field Language in the Citation block has been extended.
+Roughly 300 ISO 639 languages added.


Why only 300? Where did that subset come from? Are these 639-2 codes (guessing from zho which is in 639-2 and as far as I can see not in 639-3)?

The list came from the Library of Congress. Yes, they are 639-2 codes. Once I merged all the codes and removed the duplicates the list increased by ~300.

The PR says that it "closes #8578". That issues is specifically for figuring out how to support ISO 639-3 list (thousands of languages). All the recent discussion in that issue was centered around that - whether it was feasible to just add that many to the CVV, whether the pulldown menu was still going to be usable in the UI and/or whether we wanted to consider using an external CV instead, etc.
I can understand it if we decide to handle this via incremental improvements, and first extend the list to the full 639-2, before proceeding with the 639-3. But I'm not seeing any discussion of this approach neither in #8578 nor in this PR. And, at the very least, I don't think this PR should be closing that issue.

Please note that there is real interest in the full ISO 639-3 support, see, for example, #10481, where a user is specifically requesting it.

I can pull this feature back and load the full 639-3 languages. The file seems like a lot but if I remember correctly it's one line per language and we combine the language codes so the final list would be shorter. Let me take another look at this.
Moving back to in progress

pdurbin · 2024-08-01T17:28:05Z

doc/release-notes/8578-support-extended-iso-639-languages.md

+
+To be added to the 6.4 release instructions:
+
+Update the Citation block, to incorporate the additional controlled vocabulary for languages:


This should be in the guides somewhere as well, not just in a release note.

landreev · 2024-08-02T19:11:50Z

@stevenwinship There are 3 extra .tab files in scripts/api/data/metadatablocks/iso-639-3_Code_Tables_20240415 (iso-639-3-macrolanguages.tab etc.). Are these checked in for reference purposes? (or, are they checked in on purpose?)

stevenwinship · 2024-08-02T19:50:52Z

@stevenwinship There are 3 extra .tab files in scripts/api/data/metadatablocks/iso-639-3_Code_Tables_20240415 (iso-639-3-macrolanguages.tab etc.). Are these checked in for reference purposes? (or, are they checked in on purpose?)

They were part of the origin zip file. I don't think we need them so I'll delete them

github-actions · 2024-08-02T20:10:46Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:8578-support-extended-iso-639-languages

ghcr.io/gdcc/configbaker:8578-support-extended-iso-639-languages

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

landreev

Glad to hear that the extended list of languages does not affect the performance of the UI.
I am ok with/like the idea of not adding the entire CV to the citation block .tsv as distributed, and instead treating it as an "extension pack" that can be added optionally.
As for the implementation, specifically the fact that extending this CV is handled in a very unique and customized way (using the Lib. of Congress tabular file format instead of our normal .tsv format; and adding a dedicated API for extending just this specific CV), I am ok with it, I think. (I can actually see the advantage in being able to use the LoC-distributed list directly whenever they update it going forward). However, I do have a reputation within this team for being fairly cavalier in my willingness to adopt non-standard solutions and hacks. So I would like to ask for a second opinion. @qqmyers What do you think?

qqmyers · 2024-08-05T20:22:02Z

Honestly, I'm not a fan of having a new mechanism and an API call dedicated to parsing this particular language-specific file format, particularly since it relies on one citation.properties file that has to cover both (plus the language-specific properties file variants.) I also think the current code has problems with the implementation of the merge of existing and new entries (see comments). Some of those are probably fixable in code (not sure about all - see the Pular example), but wouldn't exist if we didn't try an auto-merge.

It may also be a pain when the citation block is updated. If that happens and you reload the block (say due to some non-language change to some unrelated metadata field), then reloading the block will restore just the alternates from the block and drop any from the ISO file. So unless we add the step of using the new API to update from the ISO file as well to the release notes, we'd have unintended changes.

I do like the idea of this being optional/not a change to the citation block that everyone has to adopt, but I'm not sure doing it with a separate API and merging is worth it (versus, for example, a Javascript that might allow filtering to common/complete lists for selection).

qqmyers

See inline comments.

qqmyers · 2024-08-05T19:34:48Z

src/main/java/edu/harvard/iq/dataverse/api/DatasetFieldServiceApi.java

+ cvv = datasetFieldService.findControlledVocabularyValueByDatasetFieldTypeAndStrValue(dsv, codesIterator.next(), true);
+ }
+
+ // if it is found we need to merge the alternate codes since the next step will delete all existing alternate codes before adding the new ones


With entries in the citation block, the code automatically deletes old alternate values and just uses the new list. Is there a reason that this code should try to preserve old alternates? Is this to keep the additional non-code alternates we have in some cases (like Haitian and Haitian Creole that we have for hat because we have the name "Hatian, Haitian Creole"?) Looks like this could be a problem for cases like ful where we have Pular as one of the alternates whereas Pular is a separate entry as fuf in the ISO file.

I'm also confused - it looks like this code would look for existing cvv entries that are alternates of an entry in the ISO file and then add it's alternate's as more alternates of the new ISO file, w/o deleting the original cvv entry? Are there any such cases?

The merge does not remove the original codes since they may not be in the new file. This is why the original alts get saved and merged to the new list

OK - so this is a problem for Pular for example.

qqmyers · 2024-08-05T19:35:15Z

src/main/java/edu/harvard/iq/dataverse/api/DatasetFieldServiceApi.java

+ String line = null;
+ String splitBy = "\t";
+ int lineNumber = 0;
+ int offset = 200; // number of existing languages // TODO: get the number from db


there aren't 200 entries in the current develop branch citation.tsv

I think there are 187. This just gives wiggle room. also the display order is actually redundant if you always want to order the list alphabetically. A simple sql script can be written to re-order the list

FWIW: Although I don't think it is necessary give the current code, we've so far tried to keep these in sequence w/o gaps. I still think the logic has a problem though - if the ISO file adds a line in the middle, you'll try to add the new entry at offset + line number and there will already be one with that displayOrder number and it won't change, so two will have the same displayOrder.

qqmyers · 2024-08-05T19:37:37Z

src/main/java/edu/harvard/iq/dataverse/api/DatasetFieldServiceApi.java

+ // Now call parseControlledVocabulary to create/update the Controlled Vocabulary Language
+ // values: unused, type, displayName, identifier, display order, alt codes...
+ String displayName = cvv != null ? cvv.getStrValue() : name;
+ String displayOrder = String.valueOf(cvv != null ? cvv.getDisplayOrder(): lineNumber + offset);


This doesn't seem to give a linear order from 0 to max. Suppose the first 200 match existing entries and get display order 0-199. The next entry gets a displayOrder = 200 (offset) + 200 (line number).

Also - how would this work with future changes. I.e. if the standard changes to split one language into two, the new entry will end up being last (because there are matching entries for everything else in the db) - is that desirable? (Not so sure displayorder is so important with 8K entries but it seems like the approach here will result in out of alphabetical order entries over time).

It's hard to set the display order to remain alphabetical. Especially if the Dataverse installation isn't using English. There are 180 ish languages in the original tsv file. A sql script could be created to re-order them forcing the "Not applicable" last. The ISO639-3 file actually has a "zxx No linguistic content" entry at the end. It really isn't that noticable in the UI since it's filtered on the partial string you enter. I did reorder the tab file from the LoC since that too was ordered by the code and not the display string. I thought it was best to not mess with the display order since that could be set by the admins of the instalation. Maybe it would be better to bump up the offset to 1000.

The ISO file itself seemed to be alphabetical. After one run of the code here, it looks like all of the current citation.tsv entries come first and then everything from the ISO file is in the order from that file. If there are changes in the ISO file though, the current code won't maintain it's internal order (either a duplicate displayOrder number as is, or the new entry will come after all existing entries if you update the code to query the db for number of entries (or max displayOrder) as in the todo.) Having the display order depend on when you run the api/if you ran it for intermediate changes in the ISO file, doesn't seem optimal. (Versus, for example, having a separate script to generate an always in sequence version of what would go in the citation.tsv file, or even just using only the line number from the ISO file as the displayOrder (or do we have citation.tsv entries that are not in the ISO file?)).

landreev · 2024-08-07T14:58:50Z

In the light of having confirmed that having the full list in the CV does not make the UI unusable, I would at least consider adding it in full to citation.tsv (and just giving up on the optional aspect of the expansion). My $0.02?

qqmyers · 2024-08-14T19:04:58Z

This is being replaced by #10762, so this should close.

Support the full ISO 639-3 list of languages #10578

stevenwinship self-assigned this May 22, 2024

github-actions bot reviewed May 22, 2024

View reviewed changes

src/test/java/edu/harvard/iq/dataverse/api/AdminIT.java Outdated Show resolved Hide resolved

stevenwinship changed the title ~~support ISO 639 languages~~ Support ISO 639 languages May 22, 2024

This comment has been minimized.

Sign in to view

stevenwinship removed their assignment May 23, 2024

This comment has been minimized.

Sign in to view

pdurbin reviewed May 23, 2024

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/api/DatasetFieldServiceApi.java Outdated Show resolved Hide resolved

doc/release-notes/8578-support-extended-iso-639-languages.md Outdated Show resolved Hide resolved

src/test/java/edu/harvard/iq/dataverse/api/AdminIT.java Outdated Show resolved Hide resolved

stevenwinship self-assigned this May 24, 2024

github-actions bot reviewed May 29, 2024

View reviewed changes

src/test/java/edu/harvard/iq/dataverse/api/AdminIT.java Outdated Show resolved Hide resolved

stevenwinship closed this May 29, 2024

stevenwinship force-pushed the 8578-support-extended-iso-639-languages branch from 0d8e44d to e3bfe6c Compare May 29, 2024 18:11

This comment has been minimized.

Sign in to view

stevenwinship reopened this May 29, 2024

stevenwinship assigned stevenwinship and unassigned stevenwinship May 29, 2024

This comment has been minimized.

Sign in to view

qqmyers reviewed Jun 18, 2024

View reviewed changes

landreev self-requested a review July 22, 2024 14:14

stevenwinship removed their assignment Jul 30, 2024

sorting languages

c8b30f6

This comment has been minimized.

Sign in to view

landreev self-assigned this Jul 31, 2024

pdurbin reviewed Aug 1, 2024

View reviewed changes

stevenwinship added 6 commits August 2, 2024 10:50

adding more guide documentation

eac9cc2

adding more guide documentation

10807d2

Merge branch 'develop' into 8578-support-extended-iso-639-languages

8b63dc5

fix doc

973e1e9

fix doc

dd9ba95

fix doc

9d9b2f7

This comment has been minimized.

Sign in to view

remove unused ISO files

820ff33

landreev reviewed Aug 5, 2024

View reviewed changes

landreev requested a review from qqmyers August 5, 2024 14:37

landreev changed the title ~~Support ISO 639 languages~~ Support the full ISO 639-3 list of languages Aug 5, 2024

qqmyers reviewed Aug 5, 2024

View reviewed changes

landreev assigned qqmyers Aug 7, 2024

qqmyers closed this Aug 14, 2024

landreev added a commit that referenced this pull request Sep 4, 2024

Merge pull request #10762 from IQSS/8578-support-for-iso-639-3-languages

8fd8c18

Support the full ISO 639-3 list of languages #10578

stevenwinship deleted the 8578-support-extended-iso-639-languages branch September 4, 2024 19:45

		@@ -0,0 +1,12 @@
		The Controlled Vocabulary Values list for the metadata field Language in the Citation block has been extended.
		Roughly 300 ISO 639 languages added.


		To be added to the 6.4 release instructions:

		Update the Citation block, to incorporate the additional controlled vocabulary for languages:

Support the full ISO 639-3 list of languages #10578

Support the full ISO 639-3 list of languages #10578

Conversation

stevenwinship commented May 22, 2024 • edited Loading

coveralls commented May 22, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

pdurbin left a comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev commented Aug 2, 2024

stevenwinship commented Aug 2, 2024

github-actions bot commented Aug 2, 2024

landreev left a comment

Choose a reason for hiding this comment

qqmyers commented Aug 5, 2024

qqmyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landreev commented Aug 7, 2024 • edited Loading

qqmyers commented Aug 14, 2024

stevenwinship commented May 22, 2024 •

edited

Loading

coveralls commented May 22, 2024 •

edited

Loading

landreev commented Aug 7, 2024 •

edited

Loading