Releases: D4Vinci/Scrapling
v0.2.4
What's changed
Bugs Squashed
- Fixed a bug when retrieving response bytes after using the
network_idle
argument in both theStealthyFetcher
andPlayWrightFetcher
classes.
That was causing the following error message:
Response.body: Protocol error (Network.getResponseBody): No resource with given identifier found
- The PlayWright API sometimes returns empty status text with responses, so now
Scrapling
will calculate it manually if that happens. This affects both theStealthyFetcher
andPlayWrightFetcher
classes.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.3
What's changed
Bugs Squashed
- Fixed a bug with pip installation that prevented the stealth mode on PlayWright Fetcher from working entirely.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.2
What's changed
New features
- Now if you don't want to pass arguments to the generated
Adaptor
object and want to use the default values, you can use this import instead for cleaner codeOtherwise>> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher >> page = Fetcher.get('https://example.com', stealthy_headers=True)
>> from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher >> page = Fetcher(auto_match=False).get('https://example.com', stealthy_headers=True)
Bugs Squashed
- Fixed a bug with the
Response
object introduced with patch v0.2.1 yesterday that happened with some cases of nested selecting/parsing.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.1
What's changed
New features
- Now the
Response
object returned from all fetchers is the same as theAdaptor
object except it has these added attributes:status
,reason
,cookies
,headers
, andrequest_headers
. Allcookies
,headers
, andrequest_headers
are always of typedictionary
.
So your code can now become like:Instead of before>> from scrapling import Fetcher >> page = Fetcher().get('https://example.com', stealthy_headers=True) >> print(page.status) 200 >> products = page.css('.product')
But I have left the>> from scrapling import Fetcher >> fetcher = Fetcher().get('https://example.com', stealthy_headers=True) >> print(fetcher.status) 200 >> page = fetcher.adaptor >> products = page.css('.product')
.adaptor
property working for backward compatibility. - Now both the
StealthyFetcher
andPlayWrightFetcher
classes can take aproxy
argument with the fetch method which accepts a string or a dictionary. - Now the
StealthyFetcher
class has theos_randomize
argument with thefetch
method. If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
Bugs Squashed
- Fixed a bug that happens while passing headers with the
Fetcher
class. - Fixed a bug with parsing JSON responses passed from the fetcher-type classes.
Quality of life changes
- The text functionality behavior was to try to remove HTML comments before returning the text but that induced errors in some cases and made the code more complicated than needed. Now it has reverted to the default lxml behavior, you will notice a slight speed increase to all operations that counts on elements' text like selectors. Now if you want Scrapling to remove HTML comments from elements before returning the text to avoid the weird text-splitting behavior that's in lxml/parsel/scrapy, just keep the
keep_comments
argument set to True as it is by default.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
Version 0.2
What's changed
New features
- Introducing the
Fetchers
feature with 3 new main types to make Scrapling fetch pages for you with a LOT of options!- The
Fetcher
class for basic HTTP requests - The
StealthyFetcher
class is a completely stealthy fetcher that uses a stealthy modified version of Firefox. - The
PlayWrightFetcher
class that allows doing browser-based requests with Vanilla PlayWright, PlayWright with stealth mode made by me, Real browsers through CDP, and NSTBrowser's docker browserless!
- The
- Added the completely new
find_all
/find
methods to find elements easily on the page with dark magic! - Added the methods
filter
andsearch
to theAdaptors
class for easier bulk operations onAdaptor
object groups. - Added methods
css_first
andxpath_first
methods for easier usage. - Added the new class type
TextHandlers
which is used for bulk operations onTextHandler
objects like theAdaptors
class. - Added
generate_full_css_selector
andgenerate_full_xpath_selector
methods.
Bugs Squashed
- Now the
Adaptors
class version ofre_first
returns the first result that matches in allAdaptor
objects inside instead of the faulty logic of returning the results ofre_first
of allAdaptor
objects. - Now if the user selects a text-type content to be returned from selected elements (like css
::text
function) with any method like.css
or.xpath
. TheAdaptor
object will return theTextHandlers
class instead of returning a list of strings like before. So now you can dopage.css('something::text').re_first(r'regex_pattern').json()
instead ofpage.css('something::text')[0].re_first(r'regex_pattern').json()
- Now
Adaptor
/Adaptors
re/re_first arguments are consistent with theTextHandler
ones. So now you haveclean_match
andcase_sensitive
arguments. - Now the
auto_match
argument is enabled by default in the initialization ofAdaptor
but still you have to enable it while selecting elements if you want to enable it. (Not a bug but a design decision) - A lot of type-annotations corrections here and there for better auto-completion experience while you are coding with Scrapling.
Quality of life changes
- Renamed both
css_selector
andxpath_selector
methods togenerate_css_selector
andgenerate_xpath_selector
for clarity and to not interrupt the auto-completion while coding. - Restructured most of the old code into a
core
subpackage and other design decisions for cleaner and easier maintenance in the future. - Restructured the tests folder into a cleaner structure and added tests for the new features. Also now tox environments are cached on GitHub for faster automated tests with each commit.
v0.1.2
Changelog:
- Fixed a bug where the
keep_comments
argument is not working as intended. - Adjusted the text function to automatically remove HTML comments from elements before extracting its text to prevent Lxml different behavior, for example:
previously would result in this because of Lxml default behavior but now it would return the full text 'CONDITION: Excellent'
>>> page = Adaptor('<span>CONDITION: <!-- -->Excellent</span>', keep_comments=True) >>> page.css('span::text') ['CONDITION: ', 'Excellent']
This behavior is known with parsel\scrapy as well so wanted to handle it here. - Fixed a bug where the SQLite db file created by the library is not deleted when doing
pip uninstall scrapling
or similar.
v0.1.1
v0.1
Full Changelog: https://github.com/D4Vinci/Scrapling/commits/v0.1