-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for regex flags in .re()
and .re_first()
methods
#225
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #225 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 5 5
Lines 291 292 +1
Branches 51 51
=========================================
+ Hits 291 292 +1
Continue to review full report at Codecov.
|
@Gallaecio @wRAR, could you take a look? :) |
* if the regex contains a named group called "extract" that will be returned | ||
* if the regex contains multiple numbered groups, all those will be returned (flattened) | ||
* if the regex doesn't contain any group the entire regex matching is returned | ||
""" | ||
if isinstance(regex, str): | ||
regex = re.compile(regex, re.UNICODE) | ||
flags |= re.UNICODE | ||
regex = re.compile(regex, flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep the re.UNICODE
to respect the old behavior, especially because in some docstrings you can find:
Apply the given regex and return the first unicode string which matches
However, this also means that it won't be possible to override this, and the re.UNICODE
will be always applied. I don't think this is a big issue and can be changed in the future, for example when deprecating Python 2.7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen that Python 2.7 has been already deprecated...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior.
On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and flags
) to the corresponding functions. That may even allow flags to work when passed along a compiled regular expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior.
Sorry, I think I wasn't clear enough and mixed things. What I meant was that we could do something like:
regex = re.compile(regex, flags or re.UNICODE)
or doing what I did (always applying re.UNICODE
). The first approach allows us to override the re.UNICODE
, but if we were applying re.UNICODE
always, it could be confusing to don't apply it when using other flags. The reference I made to Python 2.7 was that we could change this in the future but it would be a breaking change, however, it could come along with the Python 2.7 deprecation that would require a new major version. I didn't know it had been already deprecated before; after I saw that, I considered that maybe we need to think better about this decision.
On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and flags) to the corresponding functions. That may even allow flags to work when passed along a compiled regular expression.
It seems a good idea, but if it's not required I would love to keep this PR as-is (shorter). We can open a new issue if you want :)
>>> selector.xpath('//a[contains(@href, "image")]/text()').re_first(regex) | ||
'My image 1 ' | ||
|
||
As well as adding regex flags with the ``flags`` argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs an example and/or a link to the Python doc about the flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also be OK with just appending (see :mod:`re`)
to the first sentence of this entire section, right after the first mention of “regular expressions”, since at that point already users may wonder about regular expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only workaround I found was compiling the expression and using a regex flag there
I think this feature makes sense nonetheless, but know that you can define flags in the pattern itself:
>>> from parsel import Selector
>>> text = """
... <script>
... function example() {
...
KeyboardInterrupt
>>> text = """
... <script>
... function example() {
... "name": "Adrian",
... "points": 3,
... }
... </script>
... """
>>> sel = Selector(text=text)
>>> sel.css('script').re_first(r"(?s)example\(\) ({.*})")
'{\n "name": "Adrian",\n "points": 3,\n }'
``regex`` can be either a compiled regular expression or a string which | ||
will be compiled to a regular expression using ``re.compile(regex)``. | ||
will be compiled to a regular expression using ``re.compile()``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we don’t actually compile regular expressions, which could be counter-productive performance-wise given Python’s regular expression caching, what about some rewording instead?
``regex`` is a regular expression (see :mod:`re`), either as a string
or compiled.
* if the regex contains a named group called "extract" that will be returned | ||
* if the regex contains multiple numbered groups, all those will be returned (flattened) | ||
* if the regex doesn't contain any group the entire regex matching is returned | ||
""" | ||
if isinstance(regex, str): | ||
regex = re.compile(regex, re.UNICODE) | ||
flags |= re.UNICODE | ||
regex = re.compile(regex, flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior.
On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and flags
) to the corresponding functions. That may even allow flags to work when passed along a compiled regular expression.
There are some cases where I need to apply a regex to multiple lines and the only workaround I found was compiling the expression and using a regex flag there.
Look at this example where I want to extract the content of the JavaScript function
example()
(I know the function is not exactly a function, it is just an example):Doing this requires some extra steps that could be avoided by adding support for regex flags to the
re_first()
andre()
methods. And that's what I did. With this new implementation, you can directly use it like this:This could also help a lot when needing to use case-insensitive regexes:
Let me know your thoughts :)