Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More consistent line breaks in generated po files #5251

Open
Cerno-b opened this issue May 11, 2024 · 8 comments
Open

More consistent line breaks in generated po files #5251

Cerno-b opened this issue May 11, 2024 · 8 comments

Comments

@Cerno-b
Copy link

Cerno-b commented May 11, 2024

It seems that po file line breaks have some weird inconsistencies in them that make them hard to compare (e.g. using Beyond Compare) to the output of common translation tools like poedit.

I described my observations on the Weblate issue tracker here: WeblateOrg/weblate#11615 before being pointed to this repo.

Could anyone shed some light on why it is behaving like this and maybe point to the place in the code where line breaks are determined?

It would be nice to create a format that is in line with what common po editors create to make handling and comparing file changes more natural.

@nijel
Copy link
Member

nijel commented May 13, 2024

There is a custom TextWrapper class since #4511.

Its code seems wrong for emojis (what you observe) and generally the algorithm is different from gettext one.

Maybe it's time to reimplement gettext one in Python instead of tweaking TextWrapper. The tricky part is that it is 630 lines of code: https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=gettext-tools/src/write-po.c;h=1c03b2932f5a15828a5db62ea9c82de5db0d5fdb;hb=HEAD#l652

@Cerno-b
Copy link
Author

Cerno-b commented May 14, 2024

@nijel I could try my hand at this, if it's just a matter of porting to python. How likely is it that a port would be merged? I'm hesitant spending time on this unless it's likely to be used.

@nijel
Copy link
Member

nijel commented May 14, 2024

We definitely want to be consistent with gettext here, so it's likely to be merged unless there is some major issue with the code.

@Cerno-b
Copy link
Author

Cerno-b commented May 14, 2024

I looked a bit into this.

It seems to be a bit more involved than merely porting, a few functions, right? So the existing code overrides python's textwrap.TextWrapper which makes the wrapper object pretty opaque, which is also used in multiple places, so I fear without some better understanding of the code base, I'm a bit out of my depths.

The C-code you linked as 1700 lines, so I assume not all of it would need to be ported, since you mentioned 630 loc?

I could still try to help out here, but I think I would need some support in what exactly needs to be replaced and what parts of the C code are actually needed.

@nijel
Copy link
Member

nijel commented May 15, 2024

I might have miscalculated the lines in gettext code, anyway the point was that it's a lot of code.

The wrapper is used in PO serializer only, it uses TextWrapper because it was existing solution. Later support for CJK was added. But having completely custom wrapper would be okay as well.

@Cerno-b
Copy link
Author

Cerno-b commented May 20, 2024

Sorry but I'm probably out.

Tried to wrap my head around what needs to be done and came to the conclusion that the codebase is too large for me to dive in without a deeper understanding and I wouldn't be sure enough that any changes I make are sufficiently free from side effects.

@nijel
Copy link
Member

nijel commented May 20, 2024

The only thing which is needed is a class that will wrap text for the PO file. It has an existing interface and test-cases (which could be extended to cover issues you've discovered), so the risk of unwanted side effects is relatively low once the tests pass.

PS: I've added these tests as xfail in #5256. Those tests are executed with both gettext and Python implementations, so they provide a great way to catch different behavior of both implementations.

nijel added a commit to nijel/translate that referenced this issue Jul 2, 2024
This covers more use-cases than simple builtin cjklen which focused on
CJK letters only.

Issue translate#5251
@nijel
Copy link
Member

nijel commented Jul 2, 2024

#5282 addresses part of this issue (not handling all Unicode wide chars).

nijel added a commit that referenced this issue Jul 2, 2024
This covers more use-cases than simple builtin cjklen which focused on
CJK letters only.

Issue #5251
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants