Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the full ISO 639-3 list of languages #10578

Closed
wants to merge 17 commits into from

Conversation

stevenwinship
Copy link
Contributor

@stevenwinship stevenwinship commented May 22, 2024

What this PR does / why we need it: Some codes are still not managed. In the cases encountered, frm (Medieval French) and fro (Old French).

Which issue(s) this PR closes: 8578

Closes #8578

Special notes for your reviewer:

Suggestions on how to test this: See setup instructions in ISO639IT.java test

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No. Only new languages in the list of languages

Is there a release notes update needed for this change?: Yes. to be included in this PR

Additional documentation:

@stevenwinship stevenwinship self-assigned this May 22, 2024
@stevenwinship stevenwinship added Feature: Harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues GREI 3 Search and Browse pm.epic.nih_harvesting NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... Size: 50 A percentage of a sprint. 35 hours. Type: Bug a defect labels May 22, 2024
@stevenwinship stevenwinship changed the title support ISO 639 languages Support ISO 639 languages May 22, 2024
@coveralls
Copy link

coveralls commented May 22, 2024

Coverage Status

coverage: 20.726% (-0.02%) from 20.741%
when pulling 820ff33 on 8578-support-extended-iso-639-languages
into 0d27957 on develop.

This comment has been minimized.

1 similar comment

This comment has been minimized.

@stevenwinship stevenwinship removed their assignment May 23, 2024

This comment has been minimized.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some high-level, doc, and test questions.

@stevenwinship stevenwinship self-assigned this May 24, 2024
@stevenwinship stevenwinship force-pushed the 8578-support-extended-iso-639-languages branch from 0d8e44d to e3bfe6c Compare May 29, 2024 18:11

This comment has been minimized.

This comment has been minimized.

@@ -0,0 +1,12 @@
The Controlled Vocabulary Values list for the metadata field Language in the Citation block has been extended.
Roughly 300 ISO 639 languages added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only 300? Where did that subset come from? Are these 639-2 codes (guessing from zho which is in 639-2 and as far as I can see not in 639-3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list came from the Library of Congress. Yes, they are 639-2 codes. Once I merged all the codes and removed the duplicates the list increased by ~300.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR says that it "closes #8578". That issues is specifically for figuring out how to support ISO 639-3 list (thousands of languages). All the recent discussion in that issue was centered around that - whether it was feasible to just add that many to the CVV, whether the pulldown menu was still going to be usable in the UI and/or whether we wanted to consider using an external CV instead, etc.
I can understand it if we decide to handle this via incremental improvements, and first extend the list to the full 639-2, before proceeding with the 639-3. But I'm not seeing any discussion of this approach neither in #8578 nor in this PR. And, at the very least, I don't think this PR should be closing that issue.

Please note that there is real interest in the full ISO 639-3 support, see, for example, #10481, where a user is specifically requesting it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can pull this feature back and load the full 639-3 languages. The file seems like a lot but if I remember correctly it's one line per language and we combine the language codes so the final list would be shorter. Let me take another look at this.
Moving back to in progress

@landreev landreev self-requested a review July 22, 2024 14:14
@stevenwinship stevenwinship removed their assignment Jul 30, 2024

This comment has been minimized.

@landreev landreev self-assigned this Jul 31, 2024

To be added to the 6.4 release instructions:

Update the Citation block, to incorporate the additional controlled vocabulary for languages:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the guides somewhere as well, not just in a release note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more

This comment has been minimized.

3 similar comments

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@landreev
Copy link
Contributor

landreev commented Aug 2, 2024

@stevenwinship There are 3 extra .tab files in scripts/api/data/metadatablocks/iso-639-3_Code_Tables_20240415 (iso-639-3-macrolanguages.tab etc.). Are these checked in for reference purposes? (or, are they checked in on purpose?)

@stevenwinship
Copy link
Contributor Author

@stevenwinship There are 3 extra .tab files in scripts/api/data/metadatablocks/iso-639-3_Code_Tables_20240415 (iso-639-3-macrolanguages.tab etc.). Are these checked in for reference purposes? (or, are they checked in on purpose?)

They were part of the origin zip file. I don't think we need them so I'll delete them

Copy link

github-actions bot commented Aug 2, 2024

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:8578-support-extended-iso-639-languages
ghcr.io/gdcc/configbaker:8578-support-extended-iso-639-languages

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to hear that the extended list of languages does not affect the performance of the UI.
I am ok with/like the idea of not adding the entire CV to the citation block .tsv as distributed, and instead treating it as an "extension pack" that can be added optionally.
As for the implementation, specifically the fact that extending this CV is handled in a very unique and customized way (using the Lib. of Congress tabular file format instead of our normal .tsv format; and adding a dedicated API for extending just this specific CV), I am ok with it, I think. (I can actually see the advantage in being able to use the LoC-distributed list directly whenever they update it going forward). However, I do have a reputation within this team for being fairly cavalier in my willingness to adopt non-standard solutions and hacks. So I would like to ask for a second opinion. @qqmyers What do you think?

@landreev landreev requested a review from qqmyers August 5, 2024 14:37
@landreev landreev changed the title Support ISO 639 languages Support the full ISO 639-3 list of languages Aug 5, 2024
@qqmyers
Copy link
Member

qqmyers commented Aug 5, 2024

Honestly, I'm not a fan of having a new mechanism and an API call dedicated to parsing this particular language-specific file format, particularly since it relies on one citation.properties file that has to cover both (plus the language-specific properties file variants.) I also think the current code has problems with the implementation of the merge of existing and new entries (see comments). Some of those are probably fixable in code (not sure about all - see the Pular example), but wouldn't exist if we didn't try an auto-merge.

It may also be a pain when the citation block is updated. If that happens and you reload the block (say due to some non-language change to some unrelated metadata field), then reloading the block will restore just the alternates from the block and drop any from the ISO file. So unless we add the step of using the new API to update from the ISO file as well to the release notes, we'd have unintended changes.

I do like the idea of this being optional/not a change to the citation block that everyone has to adopt, but I'm not sure doing it with a separate API and merging is worth it (versus, for example, a Javascript that might allow filtering to common/complete lists for selection).

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments.

cvv = datasetFieldService.findControlledVocabularyValueByDatasetFieldTypeAndStrValue(dsv, codesIterator.next(), true);
}

// if it is found we need to merge the alternate codes since the next step will delete all existing alternate codes before adding the new ones
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With entries in the citation block, the code automatically deletes old alternate values and just uses the new list. Is there a reason that this code should try to preserve old alternates? Is this to keep the additional non-code alternates we have in some cases (like Haitian and Haitian Creole that we have for hat because we have the name "Hatian, Haitian Creole"?) Looks like this could be a problem for cases like ful where we have Pular as one of the alternates whereas Pular is a separate entry as fuf in the ISO file.

I'm also confused - it looks like this code would look for existing cvv entries that are alternates of an entry in the ISO file and then add it's alternate's as more alternates of the new ISO file, w/o deleting the original cvv entry? Are there any such cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge does not remove the original codes since they may not be in the new file. This is why the original alts get saved and merged to the new list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - so this is a problem for Pular for example.

String line = null;
String splitBy = "\t";
int lineNumber = 0;
int offset = 200; // number of existing languages // TODO: get the number from db
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there aren't 200 entries in the current develop branch citation.tsv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 187. This just gives wiggle room. also the display order is actually redundant if you always want to order the list alphabetically. A simple sql script can be written to re-order the list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: Although I don't think it is necessary give the current code, we've so far tried to keep these in sequence w/o gaps. I still think the logic has a problem though - if the ISO file adds a line in the middle, you'll try to add the new entry at offset + line number and there will already be one with that displayOrder number and it won't change, so two will have the same displayOrder.

// Now call parseControlledVocabulary to create/update the Controlled Vocabulary Language
// values: unused, type, displayName, identifier, display order, alt codes...
String displayName = cvv != null ? cvv.getStrValue() : name;
String displayOrder = String.valueOf(cvv != null ? cvv.getDisplayOrder(): lineNumber + offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to give a linear order from 0 to max. Suppose the first 200 match existing entries and get display order 0-199. The next entry gets a displayOrder = 200 (offset) + 200 (line number).

Also - how would this work with future changes. I.e. if the standard changes to split one language into two, the new entry will end up being last (because there are matching entries for everything else in the db) - is that desirable? (Not so sure displayorder is so important with 8K entries but it seems like the approach here will result in out of alphabetical order entries over time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to set the display order to remain alphabetical. Especially if the Dataverse installation isn't using English. There are 180 ish languages in the original tsv file. A sql script could be created to re-order them forcing the "Not applicable" last. The ISO639-3 file actually has a "zxx No linguistic content" entry at the end. It really isn't that noticable in the UI since it's filtered on the partial string you enter. I did reorder the tab file from the LoC since that too was ordered by the code and not the display string. I thought it was best to not mess with the display order since that could be set by the admins of the instalation. Maybe it would be better to bump up the offset to 1000.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ISO file itself seemed to be alphabetical. After one run of the code here, it looks like all of the current citation.tsv entries come first and then everything from the ISO file is in the order from that file. If there are changes in the ISO file though, the current code won't maintain it's internal order (either a duplicate displayOrder number as is, or the new entry will come after all existing entries if you update the code to query the db for number of entries (or max displayOrder) as in the todo.) Having the display order depend on when you run the api/if you ran it for intermediate changes in the ISO file, doesn't seem optimal. (Versus, for example, having a separate script to generate an always in sequence version of what would go in the citation.tsv file, or even just using only the line number from the ISO file as the displayOrder (or do we have citation.tsv entries that are not in the ISO file?)).

@landreev
Copy link
Contributor

landreev commented Aug 7, 2024

In the light of having confirmed that having the full list in the CV does not make the UI unusable, I would at least consider adding it in full to citation.tsv (and just giving up on the optional aspect of the expansion). My $0.02?

@qqmyers
Copy link
Member

qqmyers commented Aug 14, 2024

This is being replaced by #10762, so this should close.

@qqmyers qqmyers closed this Aug 14, 2024
landreev added a commit that referenced this pull request Sep 4, 2024
@stevenwinship stevenwinship deleted the 8578-support-extended-iso-639-languages branch September 4, 2024 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting GREI 3 Search and Browse NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 50 A percentage of a sprint. 35 hours. Type: Bug a defect
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Figure out whether, or how to support the extended ISO 639-3 list of languages
5 participants