Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builtin: implement, document and test url-parse #1715

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

matheusmoreira
Copy link

@matheusmoreira matheusmoreira commented Apr 28, 2024

Git commands accept a wide variety of URLs syntaxes,
not just standard URLs. This can make parsing git URLs
difficult since standard URL parsers cannot be used.
Even if an external parser were implemented, it would
have to track git's development closely in case support
for any new URL schemes are added.

These patches introduce a new url-parse builtin command
that exposes git's native URL parsing algorithms as a
plumbing command, allowing other programs to then call
upon git itself to parse the git URLs and their components.

This should be quite useful for scripts. For example, a script
might want to add remotes to repositories, naming them according
to the domain name where the repository is hosted. This new builtin
allows it to parse the git URL and extract its host name which can
then be used as input for other operations. This would be difficult
to implement otherwise due to git's support for scp style URLs.

Signed-off-by: Matheus Afonso Martins Moreira [email protected]
cc: Torsten Bögershausen [email protected]
cc: Ghanshyam Thakkar [email protected]

It will be used in more places so it should be placed in url.h.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
Copy link

Welcome to GitGitGadget

Hi @matheusmoreira, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that either:

  • Your Pull Request has a good description, if it consists of multiple commits, as it will be used as cover letter.
  • Your Pull Request description is empty, if it consists of a single commit, as the commit message should be descriptive enough by itself.

You can CC potential reviewers by adding a footer to the PR description with the following syntax:

CC: Revi Ewer <[email protected]>, Ill Takalook <[email protected]>

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the Libera Chat IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this. Note that any reviewers CC'd via the list in the PR description will not actually be sent emails.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join [email protected], where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Libera Chat. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@Ikke
Copy link
Contributor

Ikke commented Apr 28, 2024

/allow

Copy link

User matheusmoreira is now allowed to use GitGitGadget.

Define general parsing function that supports all Git URLs
including scp style URLs such as hostname:~user/repo.
Has the same interface as the URL normalization function
and uses the same data structures, facilitating its use.
It's adapted from the algorithm used to process URLs in connect.c,
so it should support the same inputs.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
Git commands can accept a rather wide variety of URLs syntaxes.
The range of accepted inputs might expand even more in the future.
This makes the parsing of URL components difficult since standard URL
parsers cannot be used. Extracting the components of a git URL would
require implementing all the schemes that git itself supports, not to
mention tracking its development continuously in case new URL schemes
are added.

The url-parse builtin command is designed to solve this problem
by exposing git's native URL parsing facilities as a plumbing command.
Other programs can then call upon git itself to parse the git URLs and
extract their components. This should be quite useful for scripts.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
This function either successfully parses an URL
or dies with an error message. Since this is a
plumbing command, the error message is not translated.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
Create an enumeration containing all possible git URL components
which may be selected by the user. The URL_NONE component is used
when the user did not request the parsing of any component.
In this case, the command will return successfully if the URL parses.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
The extract function returns a newly allocated string
whose contents are the specified git URL component.
The string must be freed later.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
Converts a git URL component name to its corresponding
enumeration value so that it can be conveniently used
internally by the url-parse command.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
Create the data structures expected by the git option parser.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
Prepare to handle input by parsing the command line options
and removing them from the arguments vector.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
Parse all the git URLs given as input on the command line.
Die if an URL cannot be parsed.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
Parse the specified git URL component from each of the given git URLs
and print them to standard output, one per line.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
The new url-parse builtin validates git URLs
and optionally extracts their components.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
Test git URL parsing, validation and component extraction
on all documented git URL schemes and syntaxes.

Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
@matheusmoreira
Copy link
Author

/submit

Copy link

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1715/matheusmoreira/url-parse-v1

To fetch this version to local tag pr-git-1715/matheusmoreira/url-parse-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1715/matheusmoreira/url-parse-v1

Copy link

On the Git mailing list, Torsten Bögershausen wrote (reply to this):

On Sun, Apr 28, 2024 at 10:30:48PM +0000, Matheus Moreira via GitGitGadget wrote:
> Git commands accept a wide variety of URLs syntaxes, not just standard URLs.
> This can make parsing git URLs difficult since standard URL parsers cannot
> be used. Even if an external parser were implemented, it would have to track
> git's development closely in case support for any new URL schemes are added.
>
> These patches introduce a new url-parse builtin command that exposes git's
> native URL parsing algorithms as a plumbing command, allowing other programs
> to then call upon git itself to parse the git URLs and their components.
>
> This should be quite useful for scripts. For example, a script might want to
> add remotes to repositories, naming them according to the domain name where
> the repository is hosted. This new builtin allows it to parse the git URL
> and extract its host name which can then be used as input for other
> operations. This would be difficult to implement otherwise due to git's
> support for scp style URLs.
>

All in all, having a URL parser as such is a good thing, thanks for working
on that.

There are, however, some notes and questions, up for discussion:

- are there any plans to integrate the parser into connect.c and fetch ?
  Speaking as a person, who manage to break the parsing of URLs once,
  with the good intention to improve things, I need to learn that
  test cases are important.
  Some work can be seen in t5601-clone.sh
  Especially, when dealing with literal IPv6 addresses, the ones with []
  and the simplified ssh syntax 'myhost:src' are interesting to test.
  Git itself strives to be RFC compliant when parsing URLs, but
  we do not fully guarantee to be "fully certified".
  And some features using the [] syntax to embedd a port number
  inside the simplified ssh syntax had not been documented,
  but used in practise, and are now part of the test suite.
  See "[myhost:123]:src" in t5601

- Or is this new tool just a helper, to verify "good" URL's,
  and not accepting our legacy parser quirks ?
  Then we still should see some IPv6 tests ?
  Or may be not, as we prefer hostnames these days ?

- One minor comment:
  in 02/13 we read:
        +enum protocol {
        +       PROTO_UNKNOWN = 0,
        +       PROTO_LOCAL,
        +       PROTO_FILE,
        +       PROTO_SSH,
        +       PROTO_GIT,
  The RFC 1738 uses the term "scheme" here, and using the very generic
  term "protocol" may lead to name clashes later.
  Would something like "git_scheme" or so be better ?

- One minor comment:
   In 13/13 we read:
        +       git url-parse "file:///" &&
        +       git url-parse "file://"

  I think that the "///" version is superflous, it should already
  be covered by the "//" version

Copy link

User Torsten Bögershausen <[email protected]> has been added to the cc: list.

Copy link

On the Git mailing list, Matheus Afonso Martins Moreira wrote (reply to this):

Thank you for your feedback.

> are there any plans to integrate the parser into connect.c and fetch ?

Yes.

That was my intention but I was not confident enough to touch connect.c
before getting feedback from the community, since it's critical code
and it is my first contribution.

I do want to merge all URL parsing in git into this one function though,
thereby creating a "single point of truth". This is so that if the algorithm
is modified the changes are visible to the URL parser builtin as well.

> Speaking as a person, who manage to break the parsing of URLs once,
> with the good intention to improve things, I need to learn that
> test cases are important.

Absolutely agree.

When adding test cases, I looked at the possibilities enumerated in urls.txt
and generated test cases based on those. I also looked at the urlmatch.h
test cases. However...

> Some work can be seen in t5601-clone.sh

... I did not think to check those.

> Especially, when dealing with literal IPv6 addresses,
> the ones with [] and the simplified ssh syntax 'myhost:src'
> are interesting to test.

You're right about that. I shall prepare an updated v2 patchset
with more test cases, and also any other changes/improvements
requested by maintainers.

> And some features using the [] syntax to embedd a port number
> inside the simplified ssh syntax had not been documented,
> but used in practise, and are now part of the test suite.
> See "[myhost:123]:src" in t5601

Indeed, I did not read anything of the sort when I checked it.
Would you like me to commit a note to this effect to urls.txt ?

> Or is this new tool just a helper, to verify "good" URL's,
> and not accepting our legacy parser quirks ?

It is my intention that this builtin be able to accept, parse
and decompose all types of URLs that git itself can accept.

> Then we still should see some IPv6 tests ?

I will add them!

> Or may be not, as we prefer hostnames these days ?

I would have to defer that choice to someone more experienced
with the codebase. Please advise on how to proceed.

> The RFC 1738 uses the term "scheme" here, and using the very generic
> term "protocol" may lead to name clashes later.
> Would something like "git_scheme" or so be better ?

Scheme does seem like a better word if it's the terminology used by RFCs.
I can change that in a new version if necessary.
That code is based on the existing connect.c parsing code though.

> I think that the "///" version is superflous, it should already
> be covered by the "//" version

I thought it was a good idea because of existing precedent:
my first approach to creating the test cases was to copy the
ones from t0110-urlmatch-normalization.sh which did have many
cases such as those. Then as I developed the code I came to
believe that it was not necessary: I call url_normalize
in the url_parse function and url_normalize is already being
tested. I think I just forgot to delete those lines.

Reading that file over once again, it does have IPv6 address
test cases. So I should probably go over it again.

Thanks again for the feedback,

  Matheus

Copy link

On the Git mailing list, Torsten Bögershausen wrote (reply to this):

On Mon, Apr 29, 2024 at 07:04:40PM -0300, Matheus Afonso Martins Moreira wrote:

> Thank you for your feedback.
>
> > are there any plans to integrate the parser into connect.c and fetch ?
>
> Yes.
>
> That was my intention but I was not confident enough to touch connect.c
> before getting feedback from the community, since it's critical code
> and it is my first contribution.

Welcome to the Git community.

I wasn't aware of t0110 as a test case...

>
> I do want to merge all URL parsing in git into this one function though,
> thereby creating a "single point of truth". This is so that if the algorithm
> is modified the changes are visible to the URL parser builtin as well.
>

That is a good thing to do. Be prepared for a longer journey, since we have
this legacy stuff to deal with. But I am happy to help with reviews, even
if that may take some days,

[]

> When adding test cases, I looked at the possibilities enumerated in urls.txt
> and generated test cases based on those. I also looked at the urlmatch.h
> test cases. However...
>
> > Some work can be seen in t5601-clone.sh
>
> ... I did not think to check those.
>
> > Especially, when dealing with literal IPv6 addresses,
> > the ones with [] and the simplified ssh syntax 'myhost:src'
> > are interesting to test.
>
> You're right about that. I shall prepare an updated v2 patchset
> with more test cases, and also any other changes/improvements
> requested by maintainers.
>
> > And some features using the [] syntax to embedd a port number
> > inside the simplified ssh syntax had not been documented,
> > but used in practise, and are now part of the test suite.
> > See "[myhost:123]:src" in t5601
>
> Indeed, I did not read anything of the sort when I checked it.
> Would you like me to commit a note to this effect to urls.txt ?

On short: please not.
This kind of syntax was never ment to be used.
The official "ssh://myhost:123/src" is recommended.
When IPv6 parsing was added, people discovered that it could be
used to "protect" the ':' from being a seperator between the hostname
and the path, and can be used to seperate the hostname from the port.
Once that was used in real live, it was too late to change it.
If we now get a better debug tool, it could mention that this is
a legacy feature, and recommend the longer "ssh://" syntax.

>
> > Or is this new tool just a helper, to verify "good" URL's,
> > and not accepting our legacy parser quirks ?
>
> It is my intention that this builtin be able to accept, parse
> and decompose all types of URLs that git itself can accept.
>
> > Then we still should see some IPv6 tests ?
>
> I will add them!
>
> > Or may be not, as we prefer hostnames these days ?
>
> I would have to defer that choice to someone more experienced
> with the codebase. Please advise on how to proceed.

Re-reading this email conversation,
I think that we should support (in the future),
what we support today.
Having a new parser tool means, that there is a chance to reject
those URLs with the note/hint, that they are depracted, and should
be replaced by a proper one.
From my point of view this means that all existing test case should pass
even with the new parser, as a general approach.
Deprecating things is hard, may take years, and may be done in a seperate
task/patch series. Or may be part of this one, in seperate commits.

>
> > The RFC 1738 uses the term "scheme" here, and using the very generic
> > term "protocol" may lead to name clashes later.
> > Would something like "git_scheme" or so be better ?
>
> Scheme does seem like a better word if it's the terminology used by RFCs.
> I can change that in a new version if necessary.
> That code is based on the existing connect.c parsing code though.
>
> > I think that the "///" version is superflous, it should already
> > be covered by the "//" version
>
> I thought it was a good idea because of existing precedent:
> my first approach to creating the test cases was to copy the
> ones from t0110-urlmatch-normalization.sh which did have many
> cases such as those. Then as I developed the code I came to
> believe that it was not necessary: I call url_normalize
> in the url_parse function and url_normalize is already being
> tested. I think I just forgot to delete those lines.
>
> Reading that file over once again, it does have IPv6 address
> test cases. So I should probably go over it again.
>
> Thanks again for the feedback,
>
>   Matheus
>

@@ -0,0 +1,59 @@
git-url-parse(1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ghanshyam Thakkar wrote (reply to this):

On Sun, 28 Apr 2024, Matheus Afonso Martins Moreira via GitGitGadget <[email protected]> wrote:
> +* Print the path:
> ++
> +------------
> +$ git url-parse --component path https://example.com/user/repo
> +/usr/repo

s/usr/user/

Thanks.

Copy link

User Ghanshyam Thakkar <[email protected]> has been added to the cc: list.

@@ -3,6 +3,7 @@
#include "hex-ll.h"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ghanshyam Thakkar wrote (reply to this):

On Sun, 28 Apr 2024, Matheus Afonso Martins Moreira via GitGitGadget <[email protected]> wrote:
> From: Matheus Afonso Martins Moreira <[email protected]>
> 
> Define general parsing function that supports all Git URLs
> including scp style URLs such as hostname:~user/repo.
> Has the same interface as the URL normalization function
> and uses the same data structures, facilitating its use.
> It's adapted from the algorithm used to process URLs in connect.c,
> so it should support the same inputs.
> 
> Signed-off-by: Matheus Afonso Martins Moreira <[email protected]>
> ---
>  urlmatch.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  urlmatch.h |  1 +
>  2 files changed, 91 insertions(+)
> 
> diff --git a/urlmatch.c b/urlmatch.c
> index 1d0254abacb..5a442e31fa2 100644
> --- a/urlmatch.c
> +++ b/urlmatch.c
> @@ -3,6 +3,7 @@
>  #include "hex-ll.h"
>  #include "strbuf.h"
>  #include "urlmatch.h"
> +#include "url.h"
>  
>  #define URL_ALPHA "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
>  #define URL_DIGIT "0123456789"
> @@ -438,6 +439,95 @@ char *url_normalize(const char *url, struct url_info *out_info)
>  	return url_normalize_1(url, out_info, 0);
>  }
>  
> +enum protocol {
> +	PROTO_UNKNOWN = 0,
> +	PROTO_LOCAL,
> +	PROTO_FILE,
> +	PROTO_SSH,
> +	PROTO_GIT,
> +};
> +
> +static enum protocol url_get_protocol(const char *name, size_t n)
> +{
> +	if (!strncmp(name, "ssh", n))
> +		return PROTO_SSH;
> +	if (!strncmp(name, "git", n))
> +		return PROTO_GIT;
> +	if (!strncmp(name, "git+ssh", n)) /* deprecated - do not use */
> +		return PROTO_SSH;
> +	if (!strncmp(name, "ssh+git", n)) /* deprecated - do not use */
> +		return PROTO_SSH;
> +	if (!strncmp(name, "file", n))
> +		return PROTO_FILE;
> +	return PROTO_UNKNOWN;
> +}
> +
> +char *url_parse(const char *url_orig, struct url_info *out_info)
> +{
> +	struct strbuf url;
> +	char *host, *separator;
> +	char *detached, *normalized;
> +	enum protocol protocol = PROTO_LOCAL;
> +	struct url_info local_info;
> +	struct url_info *info = out_info? out_info : &local_info;
> +	bool scp_syntax = false;
> +
> +	if (is_url(url_orig)) {
> +		url_orig = url_decode(url_orig);
> +	} else {
> +		url_orig = xstrdup(url_orig);
> +	}
> +
> +	strbuf_init(&url, strlen(url_orig) + sizeof("ssh://"));
> +	strbuf_addstr(&url, url_orig);
> +
> +	host = strstr(url.buf, "://");
> +	if (host) {
> +		protocol = url_get_protocol(url.buf, host - url.buf);
> +		host += 3;
> +	} else {
> +		if (!url_is_local_not_ssh(url.buf)) {
> +			scp_syntax = true;
> +			protocol = PROTO_SSH;
> +			strbuf_insertstr(&url, 0, "ssh://");
> +			host = url.buf + 6;
> +		}
> +	}
Interesting. 

    `
    $ ./git url-parse -c protocol file:/test/test
    ssh
    `

seems like only having a single slash after the 'protocol:' prints
'ssh' always (I think this may not even be a valid url). After this 'else'
block, the url turns into 'ssh://file/test/test'. Will examine the details
later. Not that it's your code's doing, and rather the result of
url_is_local_not_ssh(). But just wanted to point this out and ask if this
should error out or is this an intended behavior that I can't figure out. 

Thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Torsten Bögershausen wrote (reply to this):

[]
> Interesting.
>
>     `
>     $ ./git url-parse -c protocol file:/test/test
>     ssh
>     `
>
> seems like only having a single slash after the 'protocol:' prints
> 'ssh' always (I think this may not even be a valid url). After this 'else'
> block, the url turns into 'ssh://file/test/test'. Will examine the details
> later. Not that it's your code's doing, and rather the result of
> url_is_local_not_ssh(). But just wanted to point this out and ask if this
> should error out or is this an intended behavior that I can't figure out.

ssh is the correct answer, try something like

`git clone localhost:/home/myself/project/git.git`

It is the scp syntax, supported by Git as well.
From `man scp`

    scp copies files between hosts on a network.
    []
    The source and target may be specified as a local pathname,
    a remote host with optional path in the form
    [user@]host:[path],
    or a URI in the form scp://[user@]host[:port][/path].
    Local file names can be made explicit using absolute or relative pathnames
    to avoid scp treating file names containing ‘:’ as host specifiers.

So yes, they share similar problems
with the ':' that could mean different things when using the short form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants