Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong encoding #138

Open
art-es opened this issue Jan 6, 2020 · 17 comments
Open

Wrong encoding #138

art-es opened this issue Jan 6, 2020 · 17 comments

Comments

@art-es
Copy link

art-es commented Jan 6, 2020

How to fix the name of the country so that these strange characters do not appear?

In code:
image

In database:
image

I changed encoding of table & columns to UTF-16LE:
image
image

@VictorPulzz
Copy link

I join this issue!

@remif25
Copy link

remif25 commented Apr 30, 2020

Same problem here. How can I fix it ?

@Marivint
Copy link

You can utf8_decode($countrie["name_en"])

@klodoma
Copy link

klodoma commented Jan 29, 2021

Can it be that the encoding in the backend is wrong? Or double encoded?

@devoncmather
Copy link

Same issue here

@klodoma
Copy link

klodoma commented Feb 8, 2021

You can utf8_decode($countrie["name_en"])

Yes, this works, but it doesn't make sense to me. Seems that something is double encoded, haven't checked the sources though yet.

@giannicic
Copy link

I've made a check on the json countries file.

It seems that the double encoded values are the translated ones (Eg. the ones in the fields "name_XX").
For example Österreich is encoded in name_de as "\u00c3\u0096sterreich"
and an utf8_decode returns the correct value of "\u00d6sterreich"
which is the value under the "name->native->bar->common" field

@klodoma
Copy link

klodoma commented Feb 11, 2021

For example Österreich is encoded in name_de as "\u00c3\u0096sterreich"
and an utf8_decode returns the correct value of "\u00d6sterreich"

Yes, exactly. utf8_decode fixes it for the moment.

We'll have to monitor when this gets fixed in the package, we'll have to remove our utf8_decode when this is done.

@ademtepe
Copy link

ademtepe commented May 7, 2021

I have used the solution suggested here for Laravel Collection and it worked:

use PragmaRX\Countries\Package\Countries as Country;

Collection::macro('decode', function () {
    return $this->map(function ($value) {
        return utf8_decode($value);
    });
});

return Country::all()->pluck('name_tr', 'cca3')->decode();

@antonioribeiro
Copy link
Owner

The reason it's not done is because it's not easy to decode/re-encode them all correctly. Something I always have to say: the data we have here was not done by me, it's a collection of many other sources, and people just choose what they want/can use, I have zero control over this.

Unfortunately utf8_decode(); is not a solution either. While trying to insert all cities into a PostgreSQL database I got this myself:

Screenshot 2021-05-17 at 00 27 31

So if someone can come up with a strong solution for correctly enconding everything to UTF8, I'm more than pleased to merge a PR.

Cheers!

@antonioribeiro
Copy link
Owner

This is working for me:

protected function decode(?string $name): ?string
{
    if (blank($name) || mb_detect_encoding($name) !== 'UTF-8') {
        return $name;
    }

    return utf8_decode($name);
}

But I'm unsure if we have to do this in the package. I can't check if ALL encodings are good, and probably not every single will be fixed. Also, it will take a lot more time to generate all the files, which is already very slow. Any thoughts?

@antonioribeiro
Copy link
Owner

It didn't really fix them all, still had a lot of strings wrongly encoded, so I found this forceutf8 package that solved it (not fully too, still got some wrong, but it's way better):

protected function decode(?string $name): ?string
{
    if (blank($name)) {
        return $name;
    }

    if (mb_detect_encoding($name) !== 'UTF-8') {
        return Encoding::toUTF8($name);
    }

    return Encoding::fixUTF8($name);
}

@klodoma
Copy link

klodoma commented May 18, 2021

@antonioribeiro from where are you getting the countries data or how are you putting that data together?
Regarding the data, I think it would make sense to fix the data in the JSON files, even if they are coming from many sources.

Not sure if it helps right now, but initially I thought I have a conversion issue, so I opened up this thread on Stack Overflow:
https://stackoverflow.com/questions/65956182/php-unicode-to-character-conversion

@antonioribeiro
Copy link
Owner

@klodoma , here you have a list of sources I'm using: https://github.com/antonioribeiro/countries#copyright. Sanitize encoding is not impossible, but it's a lot of data to sanitize. And they may require different strategies.

@lupinitylabs
Copy link

lupinitylabs commented Sep 4, 2021

The issue is that part of the unicode is in unicode codepoint notation ("common": "\u00d6sterreich"), while a few lines down the same is encoded as a UTF-8 hex bytestring ("name_de": "\u00c3\u0096sterreich"). I can't imagine how the decoder should know what to do here. The first string is translated correctly into Österreich, while the second is translated into \u00d6sterreich (that's why the utf8_decode works for us in that case).

So, should we go with utf8_decode? Yes, but... be aware that if you are using one of the columns that are encoded differently (like name->common or name->native), you will end up with a binary string representation:

utf8_decode(json_decode('"'. "\u00d6sterreich" . '"'))
=> b"Österreich"

No fun... I would suggest rebuilding the whole JSON files with consistent encoding, or for that matter, I am going to go back to using mledoze/countries, which was in better shape in that regard.

@ftrudeau-pelcro
Copy link

ftrudeau-pelcro commented Oct 1, 2021

Indeed, encoding is inconsistent and I do believe @lupinitylabs's suggestion makes sense. Is there any solution brewing for this @antonioribeiro? Regardless, I'd suggest you add technical information in your README in order to help developers handle those inconsistencies properly.

@ftrudeau-pelcro
Copy link

Any update here @antonioribeiro ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests