Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for regex flags in .re() and .re_first() methods #225

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

noviluni
Copy link
Member

@noviluni noviluni commented Aug 7, 2021

There are some cases where I need to apply a regex to multiple lines and the only workaround I found was compiling the expression and using a regex flag there.

Look at this example where I want to extract the content of the JavaScript function example() (I know the function is not exactly a function, it is just an example):

>>> import re
>>> from parsel import Selector
>>> text = """
...: <script>
...:     function example() {
...:         "name": "Adrian",
...:         "points": 3,
...:     }
...: </script>
...: """
>>> sel = Selector(text=text)

# using regex strings doesn't work
>>> sel.css('script').re_first(r"example\(\) ({.*})")

# I need to compile the function:
>>> regex = re.compile(r"example\(\) ({.*})", flags=re.DOTALL)
>>> sel.css('script').re_first(regex)
'{\n        "name": "Adrian",\n         "points": 3,\n     }'

Doing this requires some extra steps that could be avoided by adding support for regex flags to the re_first() and re() methods. And that's what I did. With this new implementation, you can directly use it like this:

>>> sel.css('script').re_first(r"example\(\) ({.*})", flags=re.DOTALL)
'{\n        "name": "Adrian",\n         "points": 3,\n     }'

This could also help a lot when needing to use case-insensitive regexes:

>>> text = 'Price: 1000.00€'
>>> sel = Selector(text=text)

# The next works
>>> sel.re_first(r'Price: ([\d.]+)€')
'1000.00'

# however, when lowering the text it stops working:
>>> text2 = 'price: 1000.00€'
>>> sel2 = Selector(text=text2)
>>> sel2.re_first(r'Price: ([\d.]+)€')

# with the new implementation you can directly do:
>>> sel2.re_first(r'Price: ([\d.]+)€', flags=re.I)
'1000.00'

Let me know your thoughts :)

@codecov
Copy link

codecov bot commented Aug 7, 2021

Codecov Report

Merging #225 (df0f41b) into master (d20db09) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #225   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         5           
  Lines          291       292    +1     
  Branches        51        51           
=========================================
+ Hits           291       292    +1     
Impacted Files Coverage Δ
parsel/selector.py 100.00% <100.00%> (ø)
parsel/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d20db09...df0f41b. Read the comment docs.

@noviluni
Copy link
Member Author

noviluni commented Aug 7, 2021

@Gallaecio @wRAR, could you take a look? :)

* if the regex contains a named group called "extract" that will be returned
* if the regex contains multiple numbered groups, all those will be returned (flattened)
* if the regex doesn't contain any group the entire regex matching is returned
"""
if isinstance(regex, str):
regex = re.compile(regex, re.UNICODE)
flags |= re.UNICODE
regex = re.compile(regex, flags)
Copy link
Member Author

@noviluni noviluni Aug 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep the re.UNICODE to respect the old behavior, especially because in some docstrings you can find:

Apply the given regex and return the first unicode string which matches

However, this also means that it won't be possible to override this, and the re.UNICODE will be always applied. I don't think this is a big issue and can be changed in the future, for example when deprecating Python 2.7.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen that Python 2.7 has been already deprecated...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior.

On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and flags) to the corresponding functions. That may even allow flags to work when passed along a compiled regular expression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior.

Sorry, I think I wasn't clear enough and mixed things. What I meant was that we could do something like:

regex = re.compile(regex, flags or re.UNICODE)

or doing what I did (always applying re.UNICODE). The first approach allows us to override the re.UNICODE, but if we were applying re.UNICODE always, it could be confusing to don't apply it when using other flags. The reference I made to Python 2.7 was that we could change this in the future but it would be a breaking change, however, it could come along with the Python 2.7 deprecation that would require a new major version. I didn't know it had been already deprecated before; after I saw that, I considered that maybe we need to think better about this decision.

On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and flags) to the corresponding functions. That may even allow flags to work when passed along a compiled regular expression.

It seems a good idea, but if it's not required I would love to keep this PR as-is (shorter). We can open a new issue if you want :)

>>> selector.xpath('//a[contains(@href, "image")]/text()').re_first(regex)
'My image 1 '

As well as adding regex flags with the ``flags`` argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs an example and/or a link to the Python doc about the flags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be OK with just appending (see :mod:`re`) to the first sentence of this entire section, right after the first mention of “regular expressions”, since at that point already users may wonder about regular expressions.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only workaround I found was compiling the expression and using a regex flag there

I think this feature makes sense nonetheless, but know that you can define flags in the pattern itself:

>>> from parsel import Selector
>>> text = """
... <script>
... function example() {
... 
KeyboardInterrupt
>>> text = """
... <script>
...     function example() {
...         "name": "Adrian",
...         "points": 3,
...     }
... </script>
... """
>>> sel = Selector(text=text)
>>> sel.css('script').re_first(r"(?s)example\(\) ({.*})")
'{\n        "name": "Adrian",\n        "points": 3,\n    }'

Comment on lines 382 to +383
``regex`` can be either a compiled regular expression or a string which
will be compiled to a regular expression using ``re.compile(regex)``.
will be compiled to a regular expression using ``re.compile()``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we don’t actually compile regular expressions, which could be counter-productive performance-wise given Python’s regular expression caching, what about some rewording instead?

        ``regex`` is a regular expression (see :mod:`re`), either as a string 
        or compiled.

* if the regex contains a named group called "extract" that will be returned
* if the regex contains multiple numbered groups, all those will be returned (flattened)
* if the regex doesn't contain any group the entire regex matching is returned
"""
if isinstance(regex, str):
regex = re.compile(regex, re.UNICODE)
flags |= re.UNICODE
regex = re.compile(regex, flags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior.

On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and flags) to the corresponding functions. That may even allow flags to work when passed along a compiled regular expression.

tests/test_selector.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants