Automatise handling of diacritics #26

snomos · 2023-01-24T20:15:35Z

This covers two distinct cases:

automatise making all non-ASCII chars with diacritics optionally in Unicode NFD (default is NFC) in descriptive analysers, so that we can analyse PDF files without worries - pdf stores all texts in NFD
automatise splitting combining diacritic letters (ie those that does not exists as premade NFC's in Unicode) into sequences of base letters + diacritics as individual states in the FST (as opposed to a single, multichar symbol state), for use in tokenisers only, and everywhere else as single-state multichar symbols; this makes input tokenisation on a character level reliable in hfst-tokenise, without resorting to problematic Unicode hacks, but tends to be broken because people forget to make such filters themselves, causing hard-to-debug errors

In the first case, the pseudo code could go something like this:

extract all letter symbols (ie non-multichars)
remove ASCII letters, digits, punctuation
uconv nfc nfd
paste nfc nfd
remove lines with identical columns # e.g. ø and ŋ can't be decomposed
make the result into a XFST regex file for optional change from nfc to nfd
apply the compiled regex to descriptive analysers on the __surface__ side

In the second case, the pseudocode could be something like the following:

extract all multichar symbols from the fst
get rid of everything that looks like tags, flag diacritics and internal symbols
make a regex to mandatorily turn a multichar base letter + (one or more) combining \
    diacritics into a sequence of single symbols
apply that regex to tokeniser FST's on the __surface__ side

With routines like the above integrated into the build system, no-one should ever have to worry about these issues anymore 🙂

snomos · 2023-01-24T20:24:05Z

More about the second case:

The reason we want such base letter + combining diacritic as a multichar symbol in all other cases is that it makes life easier to treat things that looks like single letters as actually single symbols even when the underlying Unicode is not a single code point. It is only hfst-tokenise that have issues with this, because of the task at its hand.

snomos · 2023-02-17T09:13:12Z

Excellent work in commit 573af75. Only problem is: it fails on macOS, probably due to a different version of awk or sed.

snomos · 2023-02-17T09:14:24Z

Another comment: would it be possible to filter out all non-diacritic characters, to avoid both noise and extra cpu time when compiling and composing the generated regex?

Trondtr · 2023-02-19T17:26:32Z

... it fails on macOS, probably due to a different version of awk or sed.
If this turns out to be a problem, it remains me of a similar situation, which I solved by installing gsed:

/usr/bin/sed
/usr/local/bin/gsed

flammie · 2023-02-20T06:39:50Z

I changed awk to gawk but not the sed command yet, I think also gnused is used; we have a configure script in langs for checking that could be useful but not sure if everyone runs configure in core even

snomos · 2023-02-20T06:54:55Z

One has to run ./configure in core, to get the correct version info. But the tools in core doesn't carry automatically over to each language, so we need to run the same check there.

flammie · 2023-02-20T09:29:06Z

mmh I have now some checks in core for gnu sed and gawk and the unicode filter scripts use the configured programs.

snomos added bug Something isn't working enhancement New feature or request labels Jan 24, 2023

snomos assigned flammie Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatise handling of diacritics #26

Automatise handling of diacritics #26

snomos commented Jan 24, 2023 •

edited

Loading

snomos commented Jan 24, 2023

snomos commented Feb 17, 2023

snomos commented Feb 17, 2023 •

edited

Loading

Trondtr commented Feb 19, 2023

flammie commented Feb 20, 2023

snomos commented Feb 20, 2023

flammie commented Feb 20, 2023

Automatise handling of diacritics #26

Automatise handling of diacritics #26

Comments

snomos commented Jan 24, 2023 • edited Loading

snomos commented Jan 24, 2023

snomos commented Feb 17, 2023

snomos commented Feb 17, 2023 • edited Loading

Trondtr commented Feb 19, 2023

flammie commented Feb 20, 2023

snomos commented Feb 20, 2023

flammie commented Feb 20, 2023

snomos commented Jan 24, 2023 •

edited

Loading

snomos commented Feb 17, 2023 •

edited

Loading