Wrong encoding #138

art-es · 2020-01-06T08:46:33Z

How to fix the name of the country so that these strange characters do not appear?

In code:

In database:

I changed encoding of table & columns to UTF-16LE:

VictorPulzz · 2020-01-14T17:39:12Z

I join this issue!

remif25 · 2020-04-30T10:15:40Z

Same problem here. How can I fix it ?

Marivint · 2020-08-29T15:56:44Z

You can utf8_decode($countrie["name_en"])

klodoma · 2021-01-29T16:20:23Z

Can it be that the encoding in the backend is wrong? Or double encoded?

devoncmather · 2021-02-08T11:21:25Z

Same issue here

klodoma · 2021-02-08T12:26:08Z

You can utf8_decode($countrie["name_en"])

Yes, this works, but it doesn't make sense to me. Seems that something is double encoded, haven't checked the sources though yet.

giannicic · 2021-02-11T17:56:45Z

I've made a check on the json countries file.

It seems that the double encoded values are the translated ones (Eg. the ones in the fields "name_XX").
For example Österreich is encoded in name_de as "\u00c3\u0096sterreich"
and an utf8_decode returns the correct value of "\u00d6sterreich"
which is the value under the "name->native->bar->common" field

klodoma · 2021-02-11T18:17:29Z

For example Österreich is encoded in name_de as "\u00c3\u0096sterreich"
and an utf8_decode returns the correct value of "\u00d6sterreich"

Yes, exactly. utf8_decode fixes it for the moment.

We'll have to monitor when this gets fixed in the package, we'll have to remove our utf8_decode when this is done.

ademtepe · 2021-05-07T09:56:48Z

I have used the solution suggested here for Laravel Collection and it worked:

use PragmaRX\Countries\Package\Countries as Country;

Collection::macro('decode', function () {
    return $this->map(function ($value) {
        return utf8_decode($value);
    });
});

return Country::all()->pluck('name_tr', 'cca3')->decode();

antonioribeiro · 2021-05-16T22:30:45Z

The reason it's not done is because it's not easy to decode/re-encode them all correctly. Something I always have to say: the data we have here was not done by me, it's a collection of many other sources, and people just choose what they want/can use, I have zero control over this.

Unfortunately utf8_decode(); is not a solution either. While trying to insert all cities into a PostgreSQL database I got this myself:

So if someone can come up with a strong solution for correctly enconding everything to UTF8, I'm more than pleased to merge a PR.

Cheers!

antonioribeiro · 2021-05-17T18:15:26Z

This is working for me:

protected function decode(?string $name): ?string
{
    if (blank($name) || mb_detect_encoding($name) !== 'UTF-8') {
        return $name;
    }

    return utf8_decode($name);
}

But I'm unsure if we have to do this in the package. I can't check if ALL encodings are good, and probably not every single will be fixed. Also, it will take a lot more time to generate all the files, which is already very slow. Any thoughts?

antonioribeiro · 2021-05-18T01:15:44Z

It didn't really fix them all, still had a lot of strings wrongly encoded, so I found this forceutf8 package that solved it (not fully too, still got some wrong, but it's way better):

protected function decode(?string $name): ?string
{
    if (blank($name)) {
        return $name;
    }

    if (mb_detect_encoding($name) !== 'UTF-8') {
        return Encoding::toUTF8($name);
    }

    return Encoding::fixUTF8($name);
}

klodoma · 2021-05-18T07:23:04Z

@antonioribeiro from where are you getting the countries data or how are you putting that data together?
Regarding the data, I think it would make sense to fix the data in the JSON files, even if they are coming from many sources.

Not sure if it helps right now, but initially I thought I have a conversion issue, so I opened up this thread on Stack Overflow:
https://stackoverflow.com/questions/65956182/php-unicode-to-character-conversion

antonioribeiro · 2021-05-19T22:42:39Z

@klodoma , here you have a list of sources I'm using: https://github.com/antonioribeiro/countries#copyright. Sanitize encoding is not impossible, but it's a lot of data to sanitize. And they may require different strategies.

lupinitylabs · 2021-09-04T00:07:07Z

The issue is that part of the unicode is in unicode codepoint notation ("common": "\u00d6sterreich"), while a few lines down the same is encoded as a UTF-8 hex bytestring ("name_de": "\u00c3\u0096sterreich"). I can't imagine how the decoder should know what to do here. The first string is translated correctly into Österreich, while the second is translated into \u00d6sterreich (that's why the utf8_decode works for us in that case).

So, should we go with utf8_decode? Yes, but... be aware that if you are using one of the columns that are encoded differently (like name->common or name->native), you will end up with a binary string representation:

utf8_decode(json_decode('"'. "\u00d6sterreich" . '"'))
=> b"Österreich"

No fun... I would suggest rebuilding the whole JSON files with consistent encoding, or for that matter, I am going to go back to using mledoze/countries, which was in better shape in that regard.

ftrudeau-pelcro · 2021-10-01T17:52:36Z

Indeed, encoding is inconsistent and I do believe @lupinitylabs's suggestion makes sense. Is there any solution brewing for this @antonioribeiro? Regardless, I'd suggest you add technical information in your README in order to help developers handle those inconsistencies properly.

ftrudeau-pelcro · 2021-10-27T01:18:53Z

Any update here @antonioribeiro ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong encoding #138

Wrong encoding #138

art-es commented Jan 6, 2020 •

edited

Loading

VictorPulzz commented Jan 14, 2020

remif25 commented Apr 30, 2020

Marivint commented Aug 29, 2020

klodoma commented Jan 29, 2021

devoncmather commented Feb 8, 2021

klodoma commented Feb 8, 2021

giannicic commented Feb 11, 2021

klodoma commented Feb 11, 2021

ademtepe commented May 7, 2021

antonioribeiro commented May 16, 2021

antonioribeiro commented May 17, 2021

antonioribeiro commented May 18, 2021

klodoma commented May 18, 2021 •

edited

Loading

antonioribeiro commented May 19, 2021

lupinitylabs commented Sep 4, 2021 •

edited

Loading

ftrudeau-pelcro commented Oct 1, 2021 •

edited

Loading

ftrudeau-pelcro commented Oct 27, 2021

Wrong encoding #138

Wrong encoding #138

Comments

art-es commented Jan 6, 2020 • edited Loading

VictorPulzz commented Jan 14, 2020

remif25 commented Apr 30, 2020

Marivint commented Aug 29, 2020

klodoma commented Jan 29, 2021

devoncmather commented Feb 8, 2021

klodoma commented Feb 8, 2021

giannicic commented Feb 11, 2021

klodoma commented Feb 11, 2021

ademtepe commented May 7, 2021

antonioribeiro commented May 16, 2021

antonioribeiro commented May 17, 2021

antonioribeiro commented May 18, 2021

klodoma commented May 18, 2021 • edited Loading

antonioribeiro commented May 19, 2021

lupinitylabs commented Sep 4, 2021 • edited Loading

ftrudeau-pelcro commented Oct 1, 2021 • edited Loading

ftrudeau-pelcro commented Oct 27, 2021

art-es commented Jan 6, 2020 •

edited

Loading

klodoma commented May 18, 2021 •

edited

Loading

lupinitylabs commented Sep 4, 2021 •

edited

Loading

ftrudeau-pelcro commented Oct 1, 2021 •

edited

Loading