Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues With Saving Chinese Language Website #103

Open
milliem-3923 opened this issue Jan 22, 2019 · 3 comments
Open

Issues With Saving Chinese Language Website #103

milliem-3923 opened this issue Jan 22, 2019 · 3 comments

Comments

@milliem-3923
Copy link

milliem-3923 commented Jan 22, 2019

Are you submitting a bug report or a feature request?

Bug report

What is the current behavior?

When opened in the wayback viewer the Chinese font of a certain website is all muddled and unreadable; a bit like if I saved a Chinese excel file and then reopened it with the wrong encoding. Even if you can't read Chinese you can see the original and the saved version are different. As far as I can tell it's only this webpage that exhibits this issue.

STEPS: Save pages from the website below and then open them in the viewer.
http://kksk.org/youji/r_812_1.html

What is the expected behavior?

The pages should be readable and the same as the original.

What's your environment?

WAIL: 1.2.0-beta3.5
OS: 64 bit Windows 10 (Home ) Version 10.0.17134 Build 17134
(presents the same in both firefox and chrome)

Thanks, hope this is in the right place. I spent a lot of time saving these pages so it would also be nice to know if the files can be salvaged.

@machawk1
Copy link

machawk1 commented Feb 2, 2019

I was able to replicate in a slightly different environment due to the releases provided.

I used the WAIL 1.2.0-beta3 binary for macOS (the latest listed in releases for the platform). I created a new collection and did a page-only archiving process of http://kksk.org/youji/r_812_1.html.

The characters on the archived page are different than those on the live Web, with some being displayed as "unknown":

screen shot 2019-02-02 at 10 16 41 am

Live Web:

liveweb

Perhaps this is a change in encoding at replay time.

WARC:
kksk.org!youji!r_812_1.html-default-1549120546141.warc.txt

@machawk1
Copy link

machawk1 commented Feb 2, 2019

Content-Type: text/html; charset=GBK is consistent between the replayed memento and the live Web (GBK→simplified Chinese) as verified via curl -I (respective URI-R/M).

@milliem-3923
Copy link
Author

Sorry, I don't understand your second answer. Is that confirmation that the issue is with encoding at replay time? If so that means the original file is uncorrupted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants