-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query String Manipulation #30
Comments
Hi @GitHub-Mike, the tool was not created with the aim of being a tool only for SEO, but to be usable for crawling, analyzes of all kinds and at the same time to be able to export a website into an offline form. Could you please be more specific about how it should work and what specifically the tool should do differently for your purpose (e.g. based on some new |
Thanks for the quick reply. I would like to relaunch a Joomla website with WordPress. Since a migration of the entire content is not possible without losses for various reasons, I would like to make a large part of the pages available as static pages. As these pages are in the Google index, I need to make them available again via a redirect (mod_rewrite). However, this only works if I have a pattern. It is, for example, about such paginated pages:
This becomes current:
This is of no use to me as I have no pattern to rewrite the URL and the php.html ending is also unsightly. I would therefore like to have something like this:
There is already a little-documented --replace-content parameter. You could build on this. My suggestion would be something that also works with multiple key/value pairs: --rewrite-query-structure='/([^&]+)=([^&]*)(&|$)/' -> '$1-$2_' The last underscore should be removed at the end. To start with, it would also be sufficient if the special characters were simply overwritten statically. ? -> __ What do you think of this and is it feasible? |
@GitHub-Mike, I understand now. Thank you. The main reason why I decided to replace query parameters with a short, but unique enough hash, is the limitation on the length of the filename (or the whole file-path) on the disk. The query string and the overall URL, by default, can be up to 2000 characters, in some browsers/technologies for example 32 767 characters/bytes. File names on Windows/Linux/macOS can only be 255 characters/bytes. And a full path on Windows can be only 260 characters, on Linux or macOS it is between 1024 and 4096 characters/bytes. Another reason is that the characters and escaping supported in the query string are very different, but the characters that can be used on different operating systems or filesystems are very limited. Creating a very reliable set of replacement rules would be very time consuming. But to help you, I will introduce some new flag that will disable this hashing and will only do some basic replacing of However, if you (or anyone who uses it) have slashes and other special characters in the query string that are not supported as file-names on the given operating system/filesystem, this will cause problems when saving or when viewing pages offline. The same applies to URLs and query strings that are too long. Also in the documentation I will state that this feature should only be used with caution. After a short research it seems that these could be the characters below. Works on all common platforms. The ones you suggested are all very commonly used characters in URLs or query strings.
|
I will also consider the option to let the user define the replacements himself, e.g. using:
For separating rules, a space is more convenient and less conflicting than, for example, a comma. This could be the most universal option. In the documentation the above recommended example will be mentioned, however, the user can then substitute other characters for his web, which would cause him problems when saving or browsing the web. What do you think? |
I read your message again and my last suggestion is very similar to yours: Your solution with regular expressions is even more versatile, but more complex for the amateur user. However, the possibility will be there and the amateur user will have to try harder, but the advanced user can implement more complex scenarios with the help of regexp. I therefore vote for the option you suggested, i.e. the Do you agree? |
! ( ) is not a good idea, because these are "Reserved Characters" and they need a percent-encoding. See: https://datatracker.ietf.org/doc/html/rfc3986#section-2.2
This is a very useful decision, because you can never foresee all eventualities.
Yes, that would work. |
Yes, that's how I would assess it too.
OK, this should also be included in the documentation with an example. Perhaps the --replace-content parameter could then also be explained. |
… to replace the default behavior where the query string is replaced by a short hash constructed from the query string in filenames, see issue #30
@GitHub-Mike can you please try the current version from main branch with this commit? The description of the parameter can be found only in README.md. It will be available on the website after the next release. Usage examples (simple and regexp):
|
As already mentioned here janreges/siteone-crawler-gui#3, I made a pull request today and then carried out a crawl with one of the new parameters.
Result: /path/news?start=3 --> /path/news.start-3.php.html Even if ? has not been replaced by __, a pattern can be formed. I will test the regex tomorrow. However, there are a few other problems with incorrect paths that I will have to analyse tomorrow. |
In the course of implementation I realized that But I understand that if you want to define mod_rewrite rules, the |
I am of the opinion that you always have to find a compromise between flexibility and simplicity. My suggestion would therefore be to leave the functionality of the --replace-query-string parameter as it is and to address another point that I have already briefly mentioned above. My wish would be to create the option of omitting / configuring the file extensions. So instead of /path/news.start-3.php.html I would like to have /path/news.start-3 . But not only for URLs with query strings, but generally for all URLs the file extension should be omitted or made configurable. I assume that you split the URL anyway, then you could also reassemble it configurably with variables. However, adjusting the file extension would be enough for now. But, of course, you could also solve the problem of ? -> __ as well. Should I create a separate issue for this? During yesterday's run, I also noticed a few errors when rewriting the path names, which then caused 404 errors. What do you think about my comments? |
On the website crawler.siteone.io you will find somewhere in the roadmap that my goal is to enable (e.g. through some --flag) to generate exported website in such a form that it is then possible to use simple rules in Nginx or mod_rewrite in Apache to ensure the functioning of such a website, ideally on the original URLs, as it was with the original site. Alternatively, run the site locally with a mini webserver, such as binserve, spark, tiny-http, etc. My goal is to be able to use this tool also in the CI/CD framework to run automatically maintained static copies of the site, in case of failure of the dynamically generated site. Please open a separate issue on this topic. I believe that together we will be able to implement it in the next few days. The way the export works now is really for without the local disk. There the files must have the *.html extension, otherwise it is not possible. And if the site also uses URLs with query strings, that's another complication. |
OK, that would be a good application. In combination with a health check service, automatic switching could then also be implemented.
OK, I think if we keep at it now we can take the project to another level and also create a very useful tool for the creation of static websites. With the speed of the run yesterday, you could even use it for stress tests. :-)
OK, I hadn't realised until now that the focus was on "offline" and that the migration of dynamic websites to a static version was not the goal at all. But I could have guessed from the parameter names. :-)
Yes, but by integrating the query string into the file name, a simplification is realised. In the end, the page as a static version is a page like any other, which only needs its own file name. Of course, care must be taken to ensure that the links are correct. |
I want to create a static 1:1 copy of a Joomla website and manipulate the query strings so that I can redirect them correctly via the .htaccess.
example.com/?foo=bar -> example.com/foo_bar/
I know that this tool was primarily created for SEO purposes, but it would be nice if there was a solution for this.
Thanks again for making this tool available to the public, but I think it's a shame that there is no community support. Discord and Reddit are not usable and there have been no answers here for a long time.
The text was updated successfully, but these errors were encountered: