Using Headless Browsers #37

Cormanz · 2023-05-15T00:13:55Z

It seems like a good idea to integrate headless browsers using a Rust crate such as thirtyfour to allow for fast, performant, and easy access to existing plugins that rely on websites without needing API keys for setting up tools such as google, which can be complicated for some users. It could also make it much easier to implement new tools and APIs.

The text was updated successfully, but these errors were encountered:

jaykchen · 2023-05-15T02:17:38Z

I propose this combination:

https://github.com/rust-headless-chrome/rust-headless-chrome
get webpage, run javascript
https://github.com/kumabook/readability/tree/master
use it to get useful text from the webpage

My reason:

rust-headless-chrome is lighter, and more future-proof, it is being actively developed
need some lib to clean up webpages, I used browser reader mode plugins based on readability type of lib, they served me well as an end-user historically
I also used nodejs version of readability previously for my pet project, it worked fine

I'll spend some time gluing together something from both lib and show you the result in a day or two

jaykchen · 2023-05-16T01:33:40Z

@Cormanz I tinkered for some time, encountered some difficulties:

nodejs lib @mozilla/readability is mature and performs well, however their rust equivalents are not
the straight port https://crates.io/crates/readability is dated and the project is lack of activity
another more active and long standing project https://crates.io/crates/article_scraper/2.0.0-alpha.0 also has a port, but I haven't grasped how to get clean text content with an hour of fumbling, I'll spend more time and try to find out

Cormanz · 2023-05-16T01:57:41Z

Great work so far. I'll try some experimentation with these myself, too. Keep us posted on your progress!

jaykchen · 2023-05-17T17:27:26Z

I explored on 2 tracks:

manipulating the tag selection and content screening settings of the readability module of https://crates.io/crates/article_scraper/2.0.0-alpha.0
so that it can output more text than originally intended (just the main article body text)
it shall be a computationally less expensive approach if successful
it takes a lot of experiments to get workable settings
use https://github.com/rust-headless-chrome/rust-headless-chrome to convert the webpage to a long pdf page
then, use the pdftotext module of https://www.xpdfreader.com/ to convert the pdf to text
it's a hack approach, the result is good so far
story.verge.txt
story.verge.pdf
verge.txt
verge.pdf

my guess for its good result is:

when converting the webpage to pdf, the nesting of html DOM, etc. some layout information relevant to webpage rendering but not relevant to printed page layout is discarded, (hiding some text isn't the trick for a pdf document, right)
when converting pdf back to text, the pdftotext lib works with a "simplified" document and gets the only the text out

the draw back is:

convert webpage to pdf may fail
convert pdf to text may fail
conversion are more expensive

the unique value is:

when dealing with difficult webpages (abuse level use of js, challenging layout), an html parser will be beaten, the only way is to capture the page in some way and then parse it
there are projects that capture the webpage as an image and then do ocr
here we capture the webpage as pdf and then extract text, it's lighter and equally useful is dealing with difficult webpages

jaykchen · 2023-05-18T02:06:10Z

@Cormanz I further experimented on capture-to-pdf-convert-to-text, see code: https://github.com/jaykchen/smartgpt/blob/main/src/plugins/browse/headless/mod.rs
don't know thoroughly about the current browse module, just inserted to existing function as following:
pub async fn browse_url(
ctx: &mut CommandContext,
args: Vec,
) -> Result<ScriptValue, Box> {
let browse_info = ctx.plugin_data.get_data("Browse")?;

let params: [(&str, &str); 0] = [];
let url: String = args.get(0).ok_or(BrowseNoArgError)?.clone().try_into()?;

// let body = invoke::<String>(
//     browse_info,
//     "browse",
//     BrowseRequest {
//         url: url.to_string(),
//         params: params
//             .iter()
//             .map(|el| (el.0.to_string(), el.1.to_string()))
//             .collect::<Vec<_>>(),
//     },
// )
// .await?;

let content = headless::get_content_headless(&url).await.unwrap();

It's just an attempt to run something.
Will further the adaptation of readability algo as main approach.

Cormanz · 2023-05-18T02:09:17Z

Wow, that's fascinating! Might be worth integrating into the native browse plugin itself.

jaykchen · 2023-05-20T22:06:23Z

@Cormanz I've just made a draft pr, the caveat is that it works only on Linux, don't know how I can make it accessible on other platforms.
using readability lib, it works well in a small number of webpages centered around one big body of text; the official version returns nothing or little strange information on other pages, my adapted version returns mixed bag of data, which is not dependable yet

philip06 · 2023-05-21T16:35:36Z

Perhaps keep it as a separate drop in plugin for linux only? Windows and Mac users can always use docker or WSL if they wanna try it.

jaykchen · 2023-05-23T22:35:12Z

@philip06 found an alternative library, it's robust and it offers all platforms support, please give it a spin when you're available, thanks

philip06 · 2023-05-24T19:13:27Z

@philip06 found an alternative library, it's robust and it offers all platforms support, please give it a spin when you're available, thanks

Sure i'll try it out on windows and WSL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Headless Browsers #37

Using Headless Browsers #37

Cormanz commented May 15, 2023

jaykchen commented May 15, 2023

jaykchen commented May 16, 2023

Cormanz commented May 16, 2023

jaykchen commented May 17, 2023

jaykchen commented May 18, 2023

Cormanz commented May 18, 2023

jaykchen commented May 20, 2023

philip06 commented May 21, 2023 •

edited

Loading

jaykchen commented May 23, 2023

philip06 commented May 24, 2023

Using Headless Browsers #37

Using Headless Browsers #37

Comments

Cormanz commented May 15, 2023

jaykchen commented May 15, 2023

jaykchen commented May 16, 2023

Cormanz commented May 16, 2023

jaykchen commented May 17, 2023

jaykchen commented May 18, 2023

Cormanz commented May 18, 2023

jaykchen commented May 20, 2023

philip06 commented May 21, 2023 • edited Loading

jaykchen commented May 23, 2023

philip06 commented May 24, 2023

philip06 commented May 21, 2023 •

edited

Loading