-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added tests and sub-domain validation #20
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! I hope to discuss more about the changes requested!
if string.startswith(to_remove): | ||
return string[len(to_remove):] | ||
else: | ||
return string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting doing it recursively.
while strings.startswith(to_remove):
Inputs like https://https://google.com
would be treated correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a valid URL to begin with-- what do you think the function should do?
validate_url('an invalid url that ends with google.com', 'google.com')
-- I'm thinking that should return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a valid URL but since the user is the one making the input we have to consider all test cases. If he uses https://https://bla.com
and we remove only one of the https://
we get https://bla.com
which breaks our logic. I think it breaks when it splits the URL using /
as the delimiter.
Not sure if the right decision is accepting invalid inputs and treating them (HTML does it a lot) or sending an error to the user, I think a regex following an RFC specification of how URLs should be is interesting.
return True | ||
return False | ||
|
||
for leader in ('https://', 'http://', 'www', 'ftp://'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think www.
would be the right choice instead of www
. It is not giving us any errors because the split on line 28 fixes it, but it would be interesting not to propagate any error to be fixed in another line.
if expected_domain not in url: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a code duplication form line 18-19, both checks could be deleted. They are making a weak check that will be covered correctly later on the same function.
for loc, url_piece in enumerate(url[::-1]): | ||
if loc == 0: | ||
continue | ||
elif loc == 1: | ||
if url_piece == expected_domain: | ||
return True | ||
if len(url_piece) > 3: | ||
return False | ||
|
||
elif loc == 2: | ||
if url_piece == expected_domain: | ||
return True | ||
if len(url_piece) > 3: | ||
return False | ||
|
||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if loc == 0:
is not a good practice, I see the intention of ignoring the top level domains (.com
, .net
, etc) but if this was going to be really needed would be a smarter choice to iterate through list[1:]
instead since you always ignore the first.
An even better approach is to remove all top level domains which could be given from the string: top_level_domains = ['.com', '.net', '.edu', ... ]
. It would be done the same way we remove the leaders
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I instead add in just the list of possible TLDs since there could be quite a few?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea but it looks like we could get a lot of false positives. Maybe since we are using only codeforces as a scrape target we could only consider .com
TLD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Ok, haha that'll greatly simplify things :)
Description
A complete description of what your pull request changes.
Added tests for
utils/urls.py
and validation for sub-domains (www.foobar.google.com
is considered part of thegoogle
domain whilewww.google.foobar.com
is not)Closes (Issues)
Closes #16
Additional Information
Added
nose==1.3.7
to thedev-requirements.txt