Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long URLs get truncated by wget #13

Open
chosak opened this issue Nov 2, 2020 · 1 comment
Open

Long URLs get truncated by wget #13

chosak opened this issue Nov 2, 2020 · 1 comment

Comments

@chosak
Copy link
Member

chosak commented Nov 2, 2020

Wget truncates long URLs when storing them as HTML files on disk.

For example, this URL is 286 characters:

https://www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-for-the-full-amount-what-should-i-do-en-835/

When converted to a file named index.html under the www.consumerfinance.gov folder, it would be 288 characters:

www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-for-the-full-amount-what-should-i-do-en-835/index.html

Wget truncates this to 236 characters, as seen in the log:

The name is too long, 288 chars total.
Trying to shorten...
New name is www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-fo.

(I believe this is done because the filename can be at max 255 characters, and wget has a 19-character "chomp buffer" that it reserves for appending things like .tmp, /index.html, etc.)

Note that this new filename doesn't have an extension. Wget then tries to download to this location:

--2020-10-30 18:57:07--  https://www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-for-the-full-amount-what-should-i-do-en-835/
...
Saving to ‘www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-fo.tmp’
...
Removing www.consumerfinance.gov/ask-cfpb/i-found-the-car-of-my-dreams-but-the-dealer-says-that-i-have-to-have-a-down-payment-when-i-said-i-didnt-have-the-money-the-dealer-said-that-if-i-add-a-gps-and-a-stereo-he-will-be-able-to-get-me-a-loan-fo.tmp since it should be rejected.

But it then removes the downloaded file because it doesn't end in .html! (I believe that wget adds but then properly ignores the .tmp suffix it adds when downloading files.)

Current behavior

This truncating behavior causes a few problems:

  1. Because we only --accept html, files that get truncated to end with some other extension (or get no extension at all) will be deleted, and won't be tracked in this repository. (See also Crawl seems to be missing some pages #9 (comment); this probably needs to be fixed by instead specifically using --reject on certain non-HTML files like .pdf,.jpg, etc.) (FIXED by Reject non-HTML instead of accepting only HTML #15)
  2. Any downstream code that depends on these files adding in .html won't work correctly. For example, generate_summary.sh currently deliberately diffs only HTML. Additionally, a file could (at least in theory) get truncated to end in an extension specified in our .gitignore!
  3. The diffs and output files in this repo become harder to use if some files don't end in .html. They become harder to edit locally and generally just less useable for downstream applications.

Expected behavior

It would be better if we could somehow ensure that all files always get saved consistently as .html, or, at least makes it easier to track when this happens. As far as I can tell we can't do this with wget itself.

One idea would be to write a script that parses our wget.log file to generate a list of URLs and their truncated filenames. We could then have some other script that "corrects" those filenames.

@chosak
Copy link
Member Author

chosak commented Nov 5, 2020

Problem 1 above (non-HTML files getting rejected) was resolved by #15.

Problem 2 above (downstream code relying on .html) will be resolved by #19.

Problem 3 above (complexity of having truncated filenames) is still a potential issue. Let's wait to see what kinds of issues this creates before we decide if we need to fix this or what the best solution might be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@chosak and others