-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chromedriver stalls after a few hundred pageloads #250
Comments
Additional information on this error: It seems to be a more or less random number of page loads. This error can occur relatively quickly, after just a few pageloads. The error occurs on version 91.0.4472.114 of chromedriver as well as on 91.0.4472.101. Attempting to close the driver returns this "chrome not reachable" error: Selenium message:chrome not reachable |
Now experiencing a similar issue with geckodriver. Problem seems to become more frequent with continued use. Makes me wonder if there's some kind of accumulating problem on the backend, like a cache filling up a little bit every time I use the driver. |
@chriscarrollsmith I think you are onto something.
The problem is exactly what you think it is. You are overfilling them (*). Chrome has some faults in its cache and memory limits. And you can only run so many selenium scripts at the same time with the task scheduler before you have to start manually killing ports. Fulton County SPMO team's best guess is that it is around 8-16 scripts with can be run simultaneously before errors start happening. We have a monstrosity of VM that is too big to fit in a docker container (its 1 TB of disk space total and yes we clean 25 to 100 Gb daily). I shall break this up into several posts to go over the three situations. (*)duration the script runs and waiting for the page to load, the number of available ports for java.exe processes, chrome drivers memory, chrome driver cache, and the disk space of where your driver is stored. P.S. If you're wondering why the heck an organization would want a 1 TB docker container, no one talks about the price of free open data. |
Fixing waiting for the page to load and fixing the duration the script runsYou need to add wait time after each command in Selenium using Sys.sleep() RSelenium still uses I believe Selenium 2 (or 3). So it does not automatically detect page loads as that is a newer feature. Typically with scripts, we automatically add in sleep Sys.sleep(3) after every line. With the first page load, we do about Sys.sleep(10)_ Because of the random nature, I suspect that this is the primary reason your script is erroring out. Why do you need sleep commands? If no one has ever told you... the truth is that most fraud and illegal activity of a corporation, nonprofit, or government cannot be found in an API or open dataset. You have to web scrape and collect it from webpages or documents. The problem is most organizations go to great lengths to protect themselves from web scraping and automation. Even people like me who have legal backing to web scrape their websites have to make sure the bots are not detected. So you have to slow down your bot. There is a good lecture by a professor at my old university about the topic that you can find on his course page: https://poloclub.github.io/cse6242-2022fall-campus/ The total test run time cannot exceed 5 hours and 45 minutes. If you going to have a longer script break up the code into two scripts. So you want to know the downside of big data? Having to wait 6 hours to download 100 Gb files of constantly changing data then remove all the sensitive information and provide it to the general public as a small 20 Mb file. Needless to say, my organization has found that while you can make each line of your selenium script take as long as you want, you generally should not have a test that lasts longer than 5 hours. |
Fixing errors with "the number of available ports for java.exe/chrome.exe processes"
This is the error message of a run-away(*) selenium web-driver that you would have to close using "Task Manager" or PowerShell in windows. Anytime there is a network outage or someone decided to upgrade an IT or what-not, you have to make sure it closes. In my organization, it is a cyber security risk if any automated process does not automatically close after erroring out.
(*)Like a run-away chemical reaction. |
Fixing chrome drivers memory, chrome driver cache, and the disk space where your driver is stored
This is probably the least likely reason (seeing as firefox gave you an issue), but I have hit the limit before, so just in case. While you can scale it up a little bit, this is one disadvantage of using chrome. The fix is to simply split up your R script or web driver call into more management chunks. So instead of doing all 100 million pages, maybe only do 100. Then save the work. close the Selenium driver. Pause for 1 minute. Then call the Selenium driver but this time starts back up at the 101th page. P.S. If this is twitter data unfortunately, twitter and facebook are jerks. |
Windows 10 Home 64 bit
RSelenium package version 1.7.7 installed through CRAN
My Chrome version is 91.0.4472.106, but I am using 91.0.4472.101 version of chromedriver installed through wdman
If I run a prolonged scrape (a few hundred page loads), my script will simply stall. No error message, no timeout, just... hangs until I manually stop the process. It seems like RSelenium is somehow losing its connection to the driver, because I am thereafter unable to do anything with the driver, including close it, short of closing RStudio and starting a new session. The problem is apparently with chromedriver rather than RSelenium, because I have switched to geckodriver/FireFox, and that combo is working fine. This problem arose in just the last couple weeks. It began immediately after I had to change my "chromever" parameter in the rsDriver() function, presumably because of an auto-update of either chromedriver or Chrome. Am not personally looking for an immediate fix, because I got geckodriver working fine. But thought I'd put this here as a starting point for troubleshooting should others experience the same issue.
The text was updated successfully, but these errors were encountered: