Replies: 1 comment
-
Check out #441 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Check out #441 |
Beta Was this translation helpful? Give feedback.
-
Using archivebox 0.6.2 and Python 3.10.
Some sites (annoyingly) use
’
(CP-1252 instead of UTF-8) instead of'
which is causing"’
to be spit out in the input, even if the original website is using UTF-8. This is not a bug in archivebox per se, but I'm wondering if there's a way to handle this? Either have archivebox detect this and convert it properly somehow or if I should write a script to fix it up manually.Another issue I run into is sometimes is with the
wget
andreadability
extractor:The website in question is https://belaycpp.com/2021/11/24/is-my-cat-turing-complete/. Since I'm not sure if this is an archivebox bug or just a problem with the website, I didn't want to prematurely open an issue. Note that even though it says the wget extractor fails, I can still see the resulting HTML. Is there any workaround for this?
Beta Was this translation helpful? Give feedback.
All reactions