Translation with glossary and target "EN-GB" looses some words #111

EnricoPicci · 2024-06-26T11:47:32Z

I have a text to translate from Italian to English, this one
text_to_translate = "| \\_VOEMI | Data emissione operazione | Deve essere maggiore o uguale alla data di emissione della polizza e minore o uguale alla data di sistema. |"

I have also a glossary I want to use

entries = {"Fattore": "Variable", "Data emissione": "Issuance date"}
my_glossary = translator.create_glossary(
    "My glossary",
    source_lang="IT",
    target_lang="EN",
    entries=entries,
)

If I translate the text with target "EN-GB" i get this result
| Issuance date | Must be greater than or equal to the policy issue date and less than or equal to the system date. |
The issue here is that the part | \\_VOEMI gets lost.

However, if I specify that the target language is "EN-US" I get this correct result
| | \_VOEMI | Issuance date transaction | Must be greater than or equal to the policy issue date and less than or equal to the system date. |

The text was updated successfully, but these errors were encountered:

JanEbbing · 2024-06-26T12:24:18Z

Im not 100% what your use case is, but you will get the highest possible translation quality by parsing structured data like this before feeding it into the API, for example in your case:

text_to_translate = "| \\_VOEMI            | Data emissione operazione | Deve essere maggiore o uguale alla data di emissione della polizza e minore o uguale alla data di sistema.     |"
special_tokens = ["\\_"]
delimiter = "|"
translator = deepl.Translator(...)

translated_texts = []
for text in text_to_translate.split(delimiter):
    if (not text.strip()) or any(map(lambda tok: text.contains(tok), special_tokens)):
        translated_texts.append(text)
        continue
    else:
        # you might want to trim the whitespace here as well with text.trim(), and maybe
        # fill up the missing whitespace when appending to translated_texts, as this looks like a table
        translated_texts.append(translator.translate_text(text, ...).text)
output = delimiter.join(translated_texts)

Due to the nature of ML models, we otherwise cannot guarantee that the output is stable/preserves these kinds of tokens. You can also take a look at ignore tags as another option.

EnricoPicci · 2024-06-26T12:45:33Z

Jan, thanks for your prompt response. I will implement your suggestions. At the same time it is interesting the different behaviour between "EN-GB" and "EN-US".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translation with glossary and target "EN-GB" looses some words #111

Translation with glossary and target "EN-GB" looses some words #111

EnricoPicci commented Jun 26, 2024

JanEbbing commented Jun 26, 2024

EnricoPicci commented Jun 26, 2024

Translation with glossary and target "EN-GB" looses some words #111

Translation with glossary and target "EN-GB" looses some words #111

Comments

EnricoPicci commented Jun 26, 2024

JanEbbing commented Jun 26, 2024

EnricoPicci commented Jun 26, 2024