-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for regex flags in .re()
and .re_first()
methods
#225
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -119,7 +119,10 @@ def css(self, query: str) -> "SelectorList[_SelectorType]": | |
return self.__class__(flatten([x.css(query) for x in self])) | ||
|
||
def re( | ||
self, regex: Union[str, Pattern[str]], replace_entities: bool = True | ||
self, | ||
regex: Union[str, Pattern[str]], | ||
replace_entities: bool = True, | ||
flags: int = 0, | ||
) -> List[str]: | ||
""" | ||
Call the ``.re()`` method for each element in this list and return | ||
|
@@ -129,15 +132,22 @@ def re( | |
corresponding character (except for ``&`` and ``<``. | ||
Passing ``replace_entities`` as ``False`` switches off these | ||
replacements. | ||
|
||
It is possible to provide regex flags using the `flags` argument. They | ||
will be applied only if the provided regex is not a compiled regular | ||
expression. | ||
""" | ||
return flatten([x.re(regex, replace_entities=replace_entities) for x in self]) | ||
return flatten( | ||
[x.re(regex, replace_entities=replace_entities, flags=flags) for x in self] | ||
) | ||
|
||
@typing.overload | ||
def re_first( | ||
self, | ||
regex: Union[str, Pattern[str]], | ||
default: None = None, | ||
replace_entities: bool = True, | ||
flags: int = 0, | ||
) -> Optional[str]: | ||
pass | ||
|
||
|
@@ -147,6 +157,7 @@ def re_first( | |
regex: Union[str, Pattern[str]], | ||
default: str, | ||
replace_entities: bool = True, | ||
flags: int = 0, | ||
) -> str: | ||
pass | ||
|
||
|
@@ -155,6 +166,7 @@ def re_first( | |
regex: Union[str, Pattern[str]], | ||
default: Optional[str] = None, | ||
replace_entities: bool = True, | ||
flags: int = 0, | ||
) -> Optional[str]: | ||
""" | ||
Call the ``.re()`` method for the first element in this list and | ||
|
@@ -168,7 +180,7 @@ def re_first( | |
replacements. | ||
""" | ||
for el in iflatten( | ||
x.re(regex, replace_entities=replace_entities) for x in self | ||
x.re(regex, replace_entities=replace_entities, flags=flags) for x in self | ||
): | ||
return el | ||
return default | ||
|
@@ -358,28 +370,38 @@ def _css2xpath(self, query: str) -> Any: | |
return self._csstranslator.css_to_xpath(query) | ||
|
||
def re( | ||
self, regex: Union[str, Pattern[str]], replace_entities: bool = True | ||
self, | ||
regex: Union[str, Pattern[str]], | ||
replace_entities: bool = True, | ||
flags: int = 0, | ||
) -> List[str]: | ||
""" | ||
Apply the given regex and return a list of unicode strings with the | ||
matches. | ||
|
||
``regex`` can be either a compiled regular expression or a string which | ||
will be compiled to a regular expression using ``re.compile(regex)``. | ||
will be compiled to a regular expression using ``re.compile()``. | ||
Comment on lines
382
to
+383
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assuming we don’t actually compile regular expressions, which could be counter-productive performance-wise given Python’s regular expression caching, what about some rewording instead?
|
||
|
||
By default, character entity references are replaced by their | ||
corresponding character (except for ``&`` and ``<``). | ||
Passing ``replace_entities`` as ``False`` switches off these | ||
replacements. | ||
|
||
It is possible to provide regex flags using the `flags` argument. They | ||
will be applied only if the provided regex is not a compiled regular | ||
expression. | ||
""" | ||
return extract_regex(regex, self.get(), replace_entities=replace_entities) | ||
return extract_regex( | ||
regex, self.get(), replace_entities=replace_entities, flags=flags | ||
) | ||
|
||
@typing.overload | ||
def re_first( | ||
self, | ||
regex: Union[str, Pattern[str]], | ||
default: None = None, | ||
replace_entities: bool = True, | ||
flags: int = 0, | ||
) -> Optional[str]: | ||
pass | ||
|
||
|
@@ -389,6 +411,7 @@ def re_first( | |
regex: Union[str, Pattern[str]], | ||
default: str, | ||
replace_entities: bool = True, | ||
flags: int = 0, | ||
) -> str: | ||
pass | ||
|
||
|
@@ -397,6 +420,7 @@ def re_first( | |
regex: Union[str, Pattern[str]], | ||
default: Optional[str] = None, | ||
replace_entities: bool = True, | ||
flags: int = 0, | ||
) -> Optional[str]: | ||
""" | ||
Apply the given regex and return the first unicode string which | ||
|
@@ -407,9 +431,14 @@ def re_first( | |
corresponding character (except for ``&`` and ``<``). | ||
Passing ``replace_entities`` as ``False`` switches off these | ||
replacements. | ||
|
||
It is possible to provide regex flags using the `flags` argument. They | ||
will be applied only if the provided regex is not a compiled regular | ||
expression. | ||
""" | ||
return next( | ||
iflatten(self.re(regex, replace_entities=replace_entities)), default | ||
iflatten(self.re(regex, replace_entities=replace_entities, flags=flags)), | ||
default, | ||
) | ||
|
||
def get(self) -> str: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,15 +57,20 @@ def _is_listlike(x: Any) -> bool: | |
|
||
|
||
def extract_regex( | ||
regex: Union[str, Pattern[str]], text: str, replace_entities: bool = True | ||
regex: Union[str, Pattern[str]], | ||
text: str, | ||
replace_entities: bool = True, | ||
flags: int = 0, | ||
) -> List[str]: | ||
"""Extract a list of unicode strings from the given text/encoding using the following policies: | ||
* if the regex is a string it will be compiled using the provided flags | ||
* if the regex contains a named group called "extract" that will be returned | ||
* if the regex contains multiple numbered groups, all those will be returned (flattened) | ||
* if the regex doesn't contain any group the entire regex matching is returned | ||
""" | ||
if isinstance(regex, str): | ||
regex = re.compile(regex, re.UNICODE) | ||
flags |= re.UNICODE | ||
regex = re.compile(regex, flags) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I keep the
However, this also means that it won't be possible to override this, and the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've seen that Python 2.7 has been already deprecated... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior. On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I wasn't clear enough and mixed things. What I meant was that we could do something like: regex = re.compile(regex, flags or re.UNICODE) or doing what I did (always applying
It seems a good idea, but if it's not required I would love to keep this PR as-is (shorter). We can open a new issue if you want :) |
||
|
||
if "extract" in regex.groupindex: | ||
# named group | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs an example and/or a link to the Python doc about the flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also be OK with just appending
(see :mod:`re`)
to the first sentence of this entire section, right after the first mention of “regular expressions”, since at that point already users may wonder about regular expressions.