Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special characters in image file names #33

Open
GitHub-Mike opened this issue Dec 18, 2024 · 3 comments
Open

Special characters in image file names #33

GitHub-Mike opened this issue Dec 18, 2024 · 3 comments

Comments

@GitHub-Mike
Copy link

Some images were not saved because the file name contains special characters. These are spaces and German umlauts, but other special characters will certainly also cause problems.

I would suggest using URL Encoding (Percent-Encoding) according to RFC3986 to store file names. This should cause the fewest problems with the most common operating systems and file systems.

What do you think?

@janreges
Copy link
Owner

Hi @GitHub-Mike,

please provide sample URLs directly from the HTML code of the source website. I need to know if this is related to query strings and the recently added option not to hash query strings.

Alternatively, if you can, please provide the URL to the tested website. This would be very helpful for me to be able to debug this on a specific site. Thanks.

@GitHub-Mike
Copy link
Author

please provide sample URLs directly from the HTML code of the source website. I need to know if this is related to query strings and the recently added option not to hash query strings.

No, the problem has existed since my first crawl on 06.12.2024 and has nothing to do with Issue #30.

Here are 2 examples:

- "/images/2021/04/verkauf-april-2021 verschoben.jpg" --> /images/2021/04/verkauf-april-2021%20verschoben.jpg
- "/images/2021/03/verkauf-märz-2021.jpg" --> /images/2021/03/verkauf-m%C3%A4rz-2021.jpg

Alternatively, if you can, please provide the URL to the tested website. This would be very helpful for me to be able to debug this on a specific site. Thanks.

I don't want to publish the URL of the website here, but I can send it to you by e-mail.

@nilocky
Copy link

nilocky commented Jan 21, 2025

I got similar issue and the crawler treated the links with space in <a> tag as 404 URLs.

Crawler Version: 1.0.8.20240824

Example below:

URL parsed in the 404 URLs report
/download/announcement/22-3-2023

Actual link in the page under <a> tag
/download/announcement/22-3-2023 file has space in filename.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants