Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL starting with double-slash are misinterpreted #44

Closed
s-ferri-fortop opened this issue Oct 6, 2023 · 1 comment · Fixed by #45
Closed

URL starting with double-slash are misinterpreted #44

s-ferri-fortop opened this issue Oct 6, 2023 · 1 comment · Fixed by #45

Comments

@s-ferri-fortop
Copy link
Contributor

When analyzing the following robots.txt, Protego parses the directive Disallow: //debug/* as if it was /*

User-agent: *
Disallow: //debug/*

This is due to the following line of code:

parts = urlparse(pattern)

The problem is that urlparse does not parse the URL as expected (i.e. as a path) and takes "debug" as the hostname:

from urllib.parse import urlparse
print(urlparse("//debug/*"))
### result: ParseResult(scheme='', netloc='debug', path='/*', params='', query='', fragment='')

According to Google's official documentation, the Allow and Disallow directives must be followed by relative paths starting with a / character.

Therefore, I see two possible solutions:

  1. avoid using urlparse on directives' patterns
  2. replace the starting double initial slashes with a single slash

Option 1
As is:

protego/src/protego.py

Lines 185 to 186 in 45e1948

parts = urlparse(pattern)
pattern = self._unquote(parts.path, ignore="/*$%")

To be:

pattern = self._unquote(pattern, ignore="/*$%")

Option 2
Add a re.sub at the beginning of the following method:

protego/src/protego.py

Lines 90 to 93 in 45e1948

def _prepare_pattern_for_regex(self, pattern):
"""Return equivalent regex pattern for the given URL pattern."""
pattern = re.sub(r"\*+", "*", pattern)
s = re.split(r"(\*|\$$)", pattern)

pattern = re.sub(r"^[/]{2,}", "*", pattern)
@s-ferri-fortop
Copy link
Contributor Author

While working on a fork, I have found another solution:

Adding two additional leading slashes if the pattern starts with "//" ensures that urlparse does not confuse the first folder with the hostname (netloc). At the same time, path is as expected (e.g.):

def _quote_pattern(self, pattern):
        if pattern.startswith("https://") or pattern.startswith("http://") :
            pattern = "/" + pattern
        elif pattern.startswith("//") :
            pattern = "//" + pattern

Urlparse will behave as follow:

input pattern: //debug/*
modified pattern: ////debug/*
ParseResult(scheme='', netloc='', path='//debug/*', params='', query='', fragment='')

I do not have experience with testing, so any help is appreciated, but I keep working on the pull request :)

s-ferri-fortop added a commit to s-ferri-fortop/protego that referenced this issue Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant