Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Headless Browsers #37

Open
Cormanz opened this issue May 15, 2023 · 10 comments
Open

Using Headless Browsers #37

Cormanz opened this issue May 15, 2023 · 10 comments

Comments

@Cormanz
Copy link
Owner

Cormanz commented May 15, 2023

It seems like a good idea to integrate headless browsers using a Rust crate such as thirtyfour to allow for fast, performant, and easy access to existing plugins that rely on websites without needing API keys for setting up tools such as google, which can be complicated for some users. It could also make it much easier to implement new tools and APIs.

@jaykchen
Copy link

I propose this combination:

  1. https://github.com/rust-headless-chrome/rust-headless-chrome
    get webpage, run javascript
  2. https://github.com/kumabook/readability/tree/master
    use it to get useful text from the webpage

My reason:

  • rust-headless-chrome is lighter, and more future-proof, it is being actively developed
  • need some lib to clean up webpages, I used browser reader mode plugins based on readability type of lib, they served me well as an end-user historically
  • I also used nodejs version of readability previously for my pet project, it worked fine

I'll spend some time gluing together something from both lib and show you the result in a day or two

@jaykchen
Copy link

@Cormanz I tinkered for some time, encountered some difficulties:

@Cormanz
Copy link
Owner Author

Cormanz commented May 16, 2023

Great work so far. I'll try some experimentation with these myself, too. Keep us posted on your progress!

@jaykchen
Copy link

I explored on 2 tracks:

  1. manipulating the tag selection and content screening settings of the readability module of https://crates.io/crates/article_scraper/2.0.0-alpha.0
    so that it can output more text than originally intended (just the main article body text)
    it shall be a computationally less expensive approach if successful
    it takes a lot of experiments to get workable settings

  2. use https://github.com/rust-headless-chrome/rust-headless-chrome to convert the webpage to a long pdf page
    then, use the pdftotext module of https://www.xpdfreader.com/ to convert the pdf to text
    it's a hack approach, the result is good so far
    story.verge.txt
    story.verge.pdf
    verge.txt
    verge.pdf

my guess for its good result is:

  • when converting the webpage to pdf, the nesting of html DOM, etc. some layout information relevant to webpage rendering but not relevant to printed page layout is discarded, (hiding some text isn't the trick for a pdf document, right)
  • when converting pdf back to text, the pdftotext lib works with a "simplified" document and gets the only the text out

the draw back is:

  • convert webpage to pdf may fail
  • convert pdf to text may fail
  • conversion are more expensive

the unique value is:

  • when dealing with difficult webpages (abuse level use of js, challenging layout), an html parser will be beaten, the only way is to capture the page in some way and then parse it
  • there are projects that capture the webpage as an image and then do ocr
  • here we capture the webpage as pdf and then extract text, it's lighter and equally useful is dealing with difficult webpages

@jaykchen
Copy link

@Cormanz I further experimented on capture-to-pdf-convert-to-text, see code: https://github.com/jaykchen/smartgpt/blob/main/src/plugins/browse/headless/mod.rs
don't know thoroughly about the current browse module, just inserted to existing function as following:
pub async fn browse_url(
ctx: &mut CommandContext,
args: Vec,
) -> Result<ScriptValue, Box> {
let browse_info = ctx.plugin_data.get_data("Browse")?;

let params: [(&str, &str); 0] = [];
let url: String = args.get(0).ok_or(BrowseNoArgError)?.clone().try_into()?;

// let body = invoke::<String>(
//     browse_info,
//     "browse",
//     BrowseRequest {
//         url: url.to_string(),
//         params: params
//             .iter()
//             .map(|el| (el.0.to_string(), el.1.to_string()))
//             .collect::<Vec<_>>(),
//     },
// )
// .await?;

let content = headless::get_content_headless(&url).await.unwrap();

It's just an attempt to run something.
Will further the adaptation of readability algo as main approach.

@Cormanz
Copy link
Owner Author

Cormanz commented May 18, 2023

Wow, that's fascinating! Might be worth integrating into the native browse plugin itself.

@jaykchen
Copy link

@Cormanz I've just made a draft pr, the caveat is that it works only on Linux, don't know how I can make it accessible on other platforms.
using readability lib, it works well in a small number of webpages centered around one big body of text; the official version returns nothing or little strange information on other pages, my adapted version returns mixed bag of data, which is not dependable yet

@philip06
Copy link
Contributor

philip06 commented May 21, 2023

Perhaps keep it as a separate drop in plugin for linux only? Windows and Mac users can always use docker or WSL if they wanna try it.

@jaykchen
Copy link

@philip06 found an alternative library, it's robust and it offers all platforms support, please give it a spin when you're available, thanks

@philip06
Copy link
Contributor

@philip06 found an alternative library, it's robust and it offers all platforms support, please give it a spin when you're available, thanks

Sure i'll try it out on windows and WSL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants