Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor shaping and encoding checks structure #175

Open
MrBrezina opened this issue Aug 27, 2024 · 3 comments
Open

Refactor shaping and encoding checks structure #175

MrBrezina opened this issue Aug 27, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@MrBrezina
Copy link
Member

MrBrezina commented Aug 27, 2024

No description provided.

@MrBrezina MrBrezina added the enhancement New feature or request label Aug 27, 2024
@MrBrezina MrBrezina changed the title Hyperglot (shaping and encoding) tests Shaping and encoding tests Aug 27, 2024
@MrBrezina MrBrezina changed the title Shaping and encoding tests Shaping and encoding checks Aug 27, 2024
@MrBrezina
Copy link
Member Author

We would like to move to a more general model for checks that would entail:

  • encoding checks (e.g. are all base characters for English included?),
  • shaping checks (e.g. does the joining behaviour in Arabic work?).

We prefer to run these checks automatically based on the language definitions rather than list those checks for each language. The advantage is better scalability and resistance to human errors (we welcome non-technical contributors).

@kontur
Copy link
Contributor

kontur commented Aug 28, 2024

My idea for this was to implement different kind of checks as python classes with a common signature. The checks would have access to the Shaper to implement any checks as needed when passed a set of characters.

To avoid over-explicit definitions of what checks should run for what languages, I was considering a kind of matching mechanism where each check out define a set of conditions. When a check's conditions are met for a language, run it. Such conditions could be:

  • presence of an orthography attribute
  • script
  • presence of a particular unicodepoint

So for example, a general encoding check would trigger for any presence of base. For Arabic shaping, the condition could be the script being Arabic. And for yet to be implemented shaping checks for brahmic script conjuncts the presence of a specific orthography attribute under which they are stored would opt-in to those checks.

One additional thought could be returning more nuanced verdicts than just pass/fail, or return pass/fail plus textual information. E.g. for conjunct checks a pass may be based on a threshold, so informing the user about those details might be necessary.

@kontur kontur added this to the 0.8.0 milestone Aug 28, 2024
@kontur kontur changed the title Shaping and encoding checks Refactor shaping and encoding checks structure Aug 28, 2024
@kontur
Copy link
Contributor

kontur commented Sep 6, 2024

This is implemented but WIP in this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants
@MrBrezina @kontur and others