Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to decode percent-encoded URLs #377

Open
pabs3 opened this issue Nov 29, 2024 · 5 comments
Open

option to decode percent-encoded URLs #377

pabs3 opened this issue Nov 29, 2024 · 5 comments

Comments

@pabs3
Copy link
Contributor

pabs3 commented Nov 29, 2024

In my monitoring of ArchiveBot I often have to deal with URLs that consist of percent-encoded junk.

I would like to be able to decode the junk with trurl, since it is a convenient tool for wrangling URLs on the command-line.

There doesn't appear to be a way to get trurl to decode the full URL as percent-encoded data. It only seems to do that when extracting query parameters, or for the JSON output.

Here is an example that I had to deal with recently:

$ echo 'https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%20img%5D%0A%0A%5Bimg%5Dhttps:/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg' |
  sed 's@^@https://foo.com/?url=@' |
  trurl -f - -g {query:url} | tee urls
https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg img]

[img]https:/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg
$ cat urls | sed -E 's/ *\[?img\]? *//g' | trurl -f -
https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg
https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg

I propose the solution to this situation would one of these two options:

Change the --get option to also URL decode the {url} component (as currently implied by the documentation (The following component names are available (case sensitive): url, ... Components are shown URL decoded by default.), and require --urlencode to get the non-decoded version.

Update the --get documentation to mention that the {url} component is not URL decoded and then add a --urldecode option to get the URL decoded version of it. This could also be used without the --get option as well.

@jacobmealey
Copy link
Contributor

jacobmealey commented Nov 29, 2024

Trurl tries to ensure that if it outputs a whole url, that URL is always valid. The percent encoded characters in your URL are not valid when decoded. an example of trurl decoding the url can be seen at the top of the man page in the normalization section:

$ trurl 'http://ex%61mple:80/%62ath/a/../b?%2e%FF#tes%74'
 http://example/bath/b?.%ff#test

If we try appending a valid percent encoded value to your url, for example %41 (capital 'A'), we get:

$ trurl 'https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%20img%5D%0A%0A%5Bimg%5Dhttps:/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%41' 
https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%20img%5d%0a%0a%5bimg%5dhttps%3a/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpgA

where you can see its showing the decoded A at the end of the url.

A good solution for you may be to utilize the --json options which gives the following output:

[
  {
    "url": "https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%20img%5d%0a%0a%5bimg%5dhttps%3a/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg",
    "parts": {
      "scheme": "https",
      "host": "live.staticflickr.com",
      "path": "/65535/49752865666_d5b24db0ed_c.jpg img]\n\n[img]https:/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg"
    }
  }
]

@pabs3
Copy link
Contributor Author

pabs3 commented Nov 30, 2024

Thanks for the explanation.

In my case I want the invalid characters rather than their
percent-encoded equivalent, so that I can more easily process
the data as multiple URLs instead of just one.

As mentioned in the initial post, I know about the JSON option,
but it is not as convenient as a single option could be, since I then need to write a jq program to join up the scheme, hostname and path as appropriate.

So maybe what I want is a --keep-invalid-characters feature?

@bagder
Copy link
Member

bagder commented Dec 3, 2024

In my case I want the invalid characters rather than their percent-encoded equivalent, so that I can more easily process the data as multiple URLs instead of just one.

If they would be shown "decoded", then the output would no longer be a URL since it contains illegal letters and then it can't be parsed properly. Like for example %20 is space and %0a is a newline but also other encoded characters are separators that you cannot encode back correctly. Like for example %2f (slash), %40 (@) or %3a (colon) if used in the "wrong" place.

You can probably get (almost?) what you want with trurl $URL -g '{scheme}://{host}{path}'

@pabs3
Copy link
Contributor Author

pabs3 commented Dec 4, 2024

Yes, here I don't want a valid URL, I want data from it, just like when I pass -g {query:foo}, I want data from the URL, and it gets percent-decoded. So I just want a way for the full {url} to be percent-decoded to extract data from it.

@bagder
Copy link
Member

bagder commented Dec 11, 2024

If you want get data from it, you already can: as shown above. You just can't make it pretend it is still a URL when URL decoded. Since all the fields are accessible, there is no data you can't get this way. The only thing you don't get is that exact command syntax you ask for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants