Issues with UTF-8 #895

remyabel2 · 2021-11-26T02:53:00Z

remyabel2
Nov 26, 2021

Using archivebox 0.6.2 and Python 3.10.

Some sites (annoyingly) use ’ (CP-1252 instead of UTF-8) instead of ' which is causing "â€™ to be spit out in the input, even if the original website is using UTF-8. This is not a bug in archivebox per se, but I'm wondering if there's a way to handle this? Either have archivebox detect this and convert it properly somehow or if I should write a script to fix it up manually.

Another issue I run into is sometimes is with the wget and readability extractor:

'utf-8' codec can't decode byte 0x94 in position 875: invalid start byte

The website in question is https://belaycpp.com/2021/11/24/is-my-cat-turing-complete/. Since I'm not sure if this is an archivebox bug or just a problem with the website, I didn't want to prematurely open an issue. Note that even though it says the wget extractor fails, I can still see the resulting HTML. Is there any workaround for this?

pirate · 2021-11-26T03:15:51Z

pirate
Nov 26, 2021
Maintainer

Check out #441

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with UTF-8 #895

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Issues with UTF-8 #895

remyabel2 Nov 26, 2021

Replies: 1 comment

pirate Nov 26, 2021 Maintainer

remyabel2
Nov 26, 2021

pirate
Nov 26, 2021
Maintainer