Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added polish diacritics in non_ascii_equivalents.py #386

Merged
merged 5 commits into from
Nov 13, 2024
Merged

Conversation

finem4n
Copy link
Contributor

@finem4n finem4n commented Oct 31, 2024

As in title I've added some polish diacritics and also extended filter tags list

Copy link
Member

@phw phw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is somewhat similar to #387

Extending the tag list that way makes sense to me. The new tags seem to be in line with how the plugin originally was conceived. It probably would be even better to have some configuration for this, but in absence of this I think the extensions as presented here is useful.

What also applies here is my comment on #387 about using Picard's picard.util.textencoding.unaccent function (please see my detailed comment there). This would allow to get rid of the explicit mapping of most accented characters. As I see it the first mapping section can be completely removed then, except for the two letters "Ł" and "ł" (which could be placed under "Misc letters" then.

In the future this will then avoid the need to add additional accented characters, likely there are a few that we still miss.

@finem4n
Copy link
Contributor Author

finem4n commented Nov 7, 2024

Hi. I implemented picard.util.textencoding.unaccent. I did some testing and at first glance it handled more than letters, e.g. ≠ and the L shaped ones, but results were disappointing:
≠ changed to =
Ls changed back to 「 instead of |-
So I left them in CHAR_TABLE as they were before.
As per your suggestion in #387 I renamed function ascii to to_ascii. I couldn't find a better name.
I also bumped version and appended my name in authors section, if you don't mind.
If I have more free time, I'd be willing to step in and add scripting functionality.

@Sophist-UK
Copy link
Contributor

In # 387 Echelon666 has said:

You have to finish it yourself.

@finem4n Konrad Would you be willing to include the extra characters from the other PR taking into account @phw Philipp's comments?

@finem4n
Copy link
Contributor Author

finem4n commented Nov 8, 2024

@Sophist-UK Yeah, sure.
According to wiki one of the transliterations of þ (thorn) is th not p, so I've changed that. I also went with ascii representations of ♥ → ・ instead of minus.

Copy link
Member

@phw phw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, this looks good to me.

@phw phw requested a review from zas November 13, 2024 15:08
@Echelon666
Copy link

Echelon666 commented Nov 13, 2024

What about this:

"č": "c",
"š": "s",
"ș": "s",

unaccent performs this?

@zas
Copy link
Collaborator

zas commented Nov 13, 2024

What about this:

"č": "c", "š": "s", "ș": "s",

unaccent performs this?

Yes.

>>> unaccent("čšș")
'css'

Copy link
Member

@phw phw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot

@phw phw merged commit 277aa4d into metabrainz:2.0 Nov 13, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants