Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always getting binary diffs, even for html pages #98

Open
v-lopez opened this issue May 26, 2020 · 5 comments
Open

Always getting binary diffs, even for html pages #98

v-lopez opened this issue May 26, 2020 · 5 comments

Comments

@v-lopez
Copy link

v-lopez commented May 26, 2020

I'm always getting binary diffs even on html pages.

Something similar to what is described here: https://github.com/evolvingweb/sitediff/blob/master/lib/sitediff.rb#L103

In case my setup could be affecting, I'm adding sitediff as part of a continuous integration system.
When a commit is made, I start two python simple http servers hosting the updated and old version of the page, and do:

/sitediff/bin/sitediff init http://0.0.0.0:8311/ http://0.0.0.0:8312/ 
/sitediff/bin/sitediff diff --cached=none

Save the result, and later then check offline the generated report.html
I also save full copies of the before and after pages, and can open them fine with a web browser.

@cleaver
Copy link
Contributor

cleaver commented May 27, 2020

It's hard for me to verify, but I'd start by looking at the headers returned by your http servers. Might be that you can resolve by setting an appropriate header.

I haven't looked at this issue personally, but let's keep this open for now pending further testing.

@jigarius
Copy link
Contributor

I've seen this one before. I think this happens when the library that reads a URL as HTML cannot determine the encoding for the page. That's when the Result class (if I remember correctly) treats the content as binary and compares hashes instead of actual content.

@DavidOliver
Copy link

I get this with the default/initial (example?) HTML files, but not with my own site's HTML files.

Ubuntu 18.04, installed via Docker (1.0.0 and latest).

@vmganela
Copy link

vmganela commented Jun 1, 2021

Hello, I think I have a possible source of the problem.
First I will put you in context:

I was trying to compare 2 web pages ( using NGINX ), the web pages and sitediff inside a docker container.

Both websites had the default content, except for a small change to check and review the differences between the two.
Sitediff detected that there were the changes but only showed the difference between binary and as everyone I am interested to view the differences.

In my case specific to both NGINX servers (containers) the charset was not specified and defining it in the index.html file did not work.

The solution

The solution in my case is to define the charset in the Nginx configuration file /etc/nginx/nginx.conf the charset inside the http or https block:

. . .
http {
    charset       utf-8;
     . . .
}

After the changes you would only have to reload nginx:

 $ /etc/init.d/nginx reload

So be careful with the configuration of your web service.

@kirk-brown-ew
Copy link
Collaborator

If SiteDiff can't determine the character encoding of the files, it will revert to a "Binary" encoding. This is something we wish to improve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants