20 Nov 11:35

D4Vinci

e9b0102

v0.2.4 Latest

Latest

What's changed

Bugs Squashed

Fixed a bug when retrieving response bytes after using the network_idle argument in both the StealthyFetcher and PlayWrightFetcher classes.
That was causing the following error message:

Response.body: Protocol error (Network.getResponseBody): No resource with given identifier found

The PlayWright API sometimes returns empty status text with responses, so now Scrapling will calculate it manually if that happens. This affects both the StealthyFetcher and PlayWrightFetcher classes.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

Assets 2

19 Nov 23:49

D4Vinci

v0.2.3

1473803

v0.2.3

What's changed

Bugs Squashed

Fixed a bug with pip installation that prevented the stealth mode on PlayWright Fetcher from working entirely.

Note

Assets 2

16 Nov 20:05

D4Vinci

v0.2.2

50cd40c

v0.2.2

What's changed

New features

Now if you don't want to pass arguments to the generated Adaptor object and want to use the default values, you can use this import instead for cleaner code

>> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher
>> page = Fetcher.get('https://example.com', stealthy_headers=True)

Otherwise

>> from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
>> page = Fetcher(auto_match=False).get('https://example.com', stealthy_headers=True)

Bugs Squashed

Fixed a bug with the Response object introduced with patch v0.2.1 yesterday that happened with some cases of nested selecting/parsing.

Note

Assets 2

15 Nov 16:27

D4Vinci

v0.2.1

773fcd5

v0.2.1

What's changed

New features

Now the Response object returned from all fetchers is the same as the Adaptor object except it has these added attributes: status, reason, cookies, headers, and request_headers. All cookies, headers, and request_headers are always of type dictionary.
So your code can now become like:

>> from scrapling import Fetcher
>> page = Fetcher().get('https://example.com', stealthy_headers=True)
>> print(page.status)
200
>> products = page.css('.product')

Instead of before

>> from scrapling import Fetcher
>> fetcher = Fetcher().get('https://example.com', stealthy_headers=True)
>> print(fetcher.status)
200
>> page = fetcher.adaptor
>> products = page.css('.product')

But I have left the .adaptor property working for backward compatibility.

Now both the StealthyFetcher and PlayWrightFetcher classes can take a proxy argument with the fetch method which accepts a string or a dictionary.
Now the StealthyFetcher class has the os_randomize argument with the fetch method. If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.

Bugs Squashed

Fixed a bug that happens while passing headers with the Fetcher class.
Fixed a bug with parsing JSON responses passed from the fetcher-type classes.

Quality of life changes

The text functionality behavior was to try to remove HTML comments before returning the text but that induced errors in some cases and made the code more complicated than needed. Now it has reverted to the default lxml behavior, you will notice a slight speed increase to all operations that counts on elements' text like selectors. Now if you want Scrapling to remove HTML comments from elements before returning the text to avoid the weird text-splitting behavior that's in lxml/parsel/scrapy, just keep the keep_comments argument set to True as it is by default.

Note

Assets 2

11 Nov 17:31

D4Vinci

v0.2

4b5f9e1

Version 0.2

What's changed

New features

Introducing the Fetchers feature with 3 new main types to make Scrapling fetch pages for you with a LOT of options!
- The Fetcher class for basic HTTP requests
- The StealthyFetcher class is a completely stealthy fetcher that uses a stealthy modified version of Firefox.
- The PlayWrightFetcher class that allows doing browser-based requests with Vanilla PlayWright, PlayWright with stealth mode made by me, Real browsers through CDP, and NSTBrowser's docker browserless!
Added the completely new find_all/find methods to find elements easily on the page with dark magic!
Added the methods filter and search to the Adaptors class for easier bulk operations on Adaptor object groups.
Added methods css_first and xpath_first methods for easier usage.
Added the new class type TextHandlers which is used for bulk operations on TextHandler objects like the Adaptors class.
Added generate_full_css_selector and generate_full_xpath_selector methods.

Bugs Squashed

Now the Adaptors class version of re_first returns the first result that matches in all Adaptor objects inside instead of the faulty logic of returning the results of re_first of all Adaptor objects.
Now if the user selects a text-type content to be returned from selected elements (like css ::text function) with any method like .css or .xpath. The Adaptor object will return the TextHandlers class instead of returning a list of strings like before. So now you can do page.css('something::text').re_first(r'regex_pattern').json() instead of page.css('something::text')[0].re_first(r'regex_pattern').json()
Now Adaptor/Adaptors re/re_first arguments are consistent with the TextHandler ones. So now you have clean_match and case_sensitive arguments.
Now the auto_match argument is enabled by default in the initialization of Adaptor but still you have to enable it while selecting elements if you want to enable it. (Not a bug but a design decision)
A lot of type-annotations corrections here and there for better auto-completion experience while you are coding with Scrapling.

Quality of life changes

Renamed both css_selector and xpath_selector methods to generate_css_selector and generate_xpath_selector for clarity and to not interrupt the auto-completion while coding.
Restructured most of the old code into a core subpackage and other design decisions for cleaner and easier maintenance in the future.
Restructured the tests folder into a cleaner structure and added tests for the new features. Also now tox environments are cached on GitHub for faster automated tests with each commit.

Assets 2

16 Oct 16:44

D4Vinci

v0.1.2

4ca41c9

v0.1.2

Changelog:

Fixed a bug where the keep_comments argument is not working as intended.
Adjusted the text function to automatically remove HTML comments from elements before extracting its text to prevent Lxml different behavior, for example:
```
>>> page = Adaptor('<span>CONDITION: Excellent</span>', keep_comments=True)
>>> page.css('span::text')
['CONDITION: ', 'Excellent']
```
previously would result in this because of Lxml default behavior but now it would return the full text 'CONDITION: Excellent'
This behavior is known with parsel\scrapy as well so wanted to handle it here.
Fixed a bug where the SQLite db file created by the library is not deleted when doing pip uninstall scrapling or similar.

Assets 2

14 Oct 00:21

D4Vinci

v0.1.1

f0a21f6

v0.1.1

Minor fixes

Assets 2

13 Oct 22:01

D4Vinci

v0.1

86e2adb

v0.1

Full Changelog: https://github.com/D4Vinci/Scrapling/commits/v0.1

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's changed

Bugs Squashed

What's changed

Bugs Squashed

What's changed

New features

Bugs Squashed

What's changed

New features

Bugs Squashed

Quality of life changes

What's changed

New features

Bugs Squashed

Quality of life changes

Releases: D4Vinci/Scrapling

v0.2.4

What's changed

Bugs Squashed

v0.2.3

What's changed

Bugs Squashed

v0.2.2

What's changed

New features

Bugs Squashed

v0.2.1

What's changed

New features

Bugs Squashed

Quality of life changes

Version 0.2

What's changed

New features

Bugs Squashed

Quality of life changes

v0.1.2

v0.1.1

v0.1