More consistent line breaks in generated po files #5251

Cerno-b · 2024-05-11T10:24:01Z

It seems that po file line breaks have some weird inconsistencies in them that make them hard to compare (e.g. using Beyond Compare) to the output of common translation tools like poedit.

I described my observations on the Weblate issue tracker here: WeblateOrg/weblate#11615 before being pointed to this repo.

Could anyone shed some light on why it is behaving like this and maybe point to the place in the code where line breaks are determined?

It would be nice to create a format that is in line with what common po editors create to make handling and comparing file changes more natural.

nijel · 2024-05-13T06:51:31Z

There is a custom TextWrapper class since #4511.

Its code seems wrong for emojis (what you observe) and generally the algorithm is different from gettext one.

Maybe it's time to reimplement gettext one in Python instead of tweaking TextWrapper. The tricky part is that it is 630 lines of code: https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=gettext-tools/src/write-po.c;h=1c03b2932f5a15828a5db62ea9c82de5db0d5fdb;hb=HEAD#l652

Cerno-b · 2024-05-14T05:27:00Z

@nijel I could try my hand at this, if it's just a matter of porting to python. How likely is it that a port would be merged? I'm hesitant spending time on this unless it's likely to be used.

nijel · 2024-05-14T08:35:21Z

We definitely want to be consistent with gettext here, so it's likely to be merged unless there is some major issue with the code.

Cerno-b · 2024-05-14T22:26:32Z

I looked a bit into this.

It seems to be a bit more involved than merely porting, a few functions, right? So the existing code overrides python's textwrap.TextWrapper which makes the wrapper object pretty opaque, which is also used in multiple places, so I fear without some better understanding of the code base, I'm a bit out of my depths.

The C-code you linked as 1700 lines, so I assume not all of it would need to be ported, since you mentioned 630 loc?

I could still try to help out here, but I think I would need some support in what exactly needs to be replaced and what parts of the C code are actually needed.

nijel · 2024-05-15T04:01:54Z

I might have miscalculated the lines in gettext code, anyway the point was that it's a lot of code.

The wrapper is used in PO serializer only, it uses TextWrapper because it was existing solution. Later support for CJK was added. But having completely custom wrapper would be okay as well.

Cerno-b · 2024-05-20T00:24:14Z

Sorry but I'm probably out.

Tried to wrap my head around what needs to be done and came to the conclusion that the codebase is too large for me to dive in without a deeper understanding and I wouldn't be sure enough that any changes I make are sufficiently free from side effects.

See translate#5251 and WeblateOrg/weblate#11615

nijel · 2024-05-20T06:51:14Z

The only thing which is needed is a class that will wrap text for the PO file. It has an existing interface and test-cases (which could be extended to cover issues you've discovered), so the risk of unwanted side effects is relatively low once the tests pass.

PS: I've added these tests as xfail in #5256. Those tests are executed with both gettext and Python implementations, so they provide a great way to catch different behavior of both implementations.

See #5251 and WeblateOrg/weblate#11615

This covers more use-cases than simple builtin cjklen which focused on CJK letters only. Issue translate#5251

nijel · 2024-07-02T08:28:33Z

#5282 addresses part of this issue (not handling all Unicode wide chars).

This covers more use-cases than simple builtin cjklen which focused on CJK letters only. Issue #5251

Cerno-b mentioned this issue May 11, 2024

More consistent line breaks in exported po files WeblateOrg/weblate#11615

Closed

nijel added a commit to nijel/translate that referenced this issue May 20, 2024

tests: add test for wrapping differences compared to gettext

29c50fb

See translate#5251 and WeblateOrg/weblate#11615

nijel mentioned this issue May 20, 2024

tests: add test for wrapping differences compared to gettext #5256

Merged

nijel added a commit that referenced this issue May 20, 2024

tests: add test for wrapping differences compared to gettext

1143aa3

See #5251 and WeblateOrg/weblate#11615

nijel added a commit to nijel/translate that referenced this issue Jul 2, 2024

fix(pypo): use wcwidth for calculating unicode string width

60d44e2

This covers more use-cases than simple builtin cjklen which focused on CJK letters only. Issue translate#5251

nijel mentioned this issue Jul 2, 2024

fix(pypo): use wcwidth for calculating unicode string width #5282

Merged

nijel added a commit that referenced this issue Jul 2, 2024

fix(pypo): use wcwidth for calculating unicode string width

80660bc

This covers more use-cases than simple builtin cjklen which focused on CJK letters only. Issue #5251

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More consistent line breaks in generated po files #5251

More consistent line breaks in generated po files #5251

Cerno-b commented May 11, 2024 •

edited

Loading

nijel commented May 13, 2024

Cerno-b commented May 14, 2024

nijel commented May 14, 2024

Cerno-b commented May 14, 2024

nijel commented May 15, 2024

Cerno-b commented May 20, 2024

nijel commented May 20, 2024

nijel commented Jul 2, 2024

More consistent line breaks in generated po files #5251

More consistent line breaks in generated po files #5251

Comments

Cerno-b commented May 11, 2024 • edited Loading

nijel commented May 13, 2024

Cerno-b commented May 14, 2024

nijel commented May 14, 2024

Cerno-b commented May 14, 2024

nijel commented May 15, 2024

Cerno-b commented May 20, 2024

nijel commented May 20, 2024

nijel commented Jul 2, 2024

Cerno-b commented May 11, 2024 •

edited

Loading