Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Keda and Selenium Grid mismatch queues #2442

Open
jorgegb95 opened this issue Oct 23, 2024 · 8 comments
Open

[🐛 Bug]: Keda and Selenium Grid mismatch queues #2442

jorgegb95 opened this issue Oct 23, 2024 · 8 comments

Comments

@jorgegb95
Copy link

What happened?

We have Selenium Grid with Keda enabled on an AKS cluster which is installed using terraform.
The problem we have detected is that if the number of queued requests exceeds the maximum that keda can scale to, it stops detecting if there are things queued but the pod hub does detect it.

Command used to start Selenium Grid with Docker (or Kubernetes)

# -- Repository: https://github.com/SeleniumHQ/docker-selenium/blob/trunk/charts/selenium-grid/values.yaml

hub:
  nameOverride: selenium-router
  imagePullPolicy: IfNotPresent
  nodeSelector:
    agentpool: npappspot
  tolerations:
    - key: app
      operator: Equal
      value: app
      effect: NoSchedule
    - key: kubernetes.azure.com/scalesetpriority
      operator: Equal
      value: spot
      effect: NoSchedule
  resources:
    requests:
      cpu: ${hub_cpu_requests}
      memory: ${hub_memory_requests}
    limits:
      cpu: ${hub_cpu_limits}
      memory: ${hub_memory_limits}
  extraEnvironmentVariables:
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: ${SE_SESSION_REQUEST_TIMEOUT}

autoscaling:
  enabled: true
  scaleOptions:
    minReplicaCount: 0
    maxReplicaCount: 40

tls:
  create: false

serviceAccount:
  create: true

ingress:
  enabled: true

chromeNode:
  enabled: true
  replicas: 0
  resources:
    requests:
      cpu: ${chrome_node_cpu_requests}
      memory: ${chrome_node_memory_requests}
    limits:
      cpu: ${chrome_node_cpu_limits}
      memory: ${chrome_node_memory_limits}
  nodeSelector:
    agentpool: npappspot
  tolerations:
    - key: app
      operator: Equal
      value: app
      effect: NoSchedule
    - key: kubernetes.azure.com/scalesetpriority
      operator: Equal
      value: spot
      effect: NoSchedule
  extraEnvironmentVariables:
    - name: SE_NODE_SESSION_TIMEOUT
      value: ${SE_NODE_SESSION_TIMEOUT}

firefoxNode:
  enabled: true
  replicas: 0
  resources:
    requests:
      cpu: ${firefox_node_cpu_requests}
      memory: ${firefox_node_memory_requests}
    limits:
      cpu: ${firefox_node_cpu_limits}
      memory: ${firefox_node_memory_limits}
  nodeSelector:
    agentpool: npappspot
  tolerations:
    - key: app
      operator: Equal
      value: app
      effect: NoSchedule
    - key: kubernetes.azure.com/scalesetpriority
      operator: Equal
      value: spot
      effect: NoSchedule
  extraEnvironmentVariables:
    - name: SE_NODE_SESSION_TIMEOUT
      value: ${SE_NODE_SESSION_TIMEOUT}


edgeNode:
  enabled: true
  replicas: 0
  resources:
    requests:
      cpu: ${edge_node_cpu_requests}
      memory: ${edge_node_memory_requests}
    limits:
      cpu: ${edge_node_cpu_limits}
      memory: ${edge_node_memory_limits}
  nodeSelector:
    agentpool: npappspot
  tolerations:
    - key: app
      operator: Equal
      value: app
      effect: NoSchedule
    - key: kubernetes.azure.com/scalesetpriority
      operator: Equal
      value: spot
      effect: NoSchedule
  extraEnvironmentVariables:
    - name: SE_NODE_SESSION_TIMEOUT
      value: ${SE_NODE_SESSION_TIMEOUT}

Relevant log output

No errors appear in the log. Keda indicates that there are 10 glued jobs, but in selenium there are 20 and when Selenium finish with those 10 (that appears on keda) it does not lift any more pods from the browser.

INFO	scaleexecutor	Remove a job by reaching the historyLimit	{"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "job.Name": "selenium-grid-selenium-chrome-node-vrjcm", "historyLimit": 0}

2024-10-23T11:40:39Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of running Jobs": 40}

2024-10-23T11:40:39Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of pending Jobs": 10}

Operating System

Aks Kubernetes version 1.28.9

Docker Selenium version (image tag)

4.25.0-20240922

Selenium Grid chart version (chart version)

0.36.1

Copy link

@jorgegb95, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

VietND96 commented Oct 23, 2024

Hi, are you using KEDA images 4.15.1 that delivered in this repo?
https://github.com/SeleniumHQ/docker-selenium/blob/trunk/.keda/README.md

@jorgegb95
Copy link
Author

We simply enabled autoscaling in the values as indicated in the documentation ‘autoscaling.enabled’ so we understand that it will install the latest available version of keda.

@VietND96
Copy link
Member

Ok, can you try chart version 0.36.4 and remove replace: 0 from each node config?

@VietND96 VietND96 added this to the 4.27.0 milestone Nov 5, 2024
@jorgegb95
Copy link
Author

We have tried installing version 0.37.0 of the chart and are seeing the selenium pod restart for a 143.

Add that we are launching about 100 chrome nodes and it is causing the 143 error, how could I debug to see which piece of selenium is giving the error?

@VietND96
Copy link
Member

VietND96 commented Nov 6, 2024

I am suspecting something around this config key.

global:
  seleniumGrid
    revisionHistoryLimit: 10
# -- Specify how many old ReplicaSets for this Deployment you want to retain. The rest will be garbage-collected in the background.

Can you try update this to value e.g 1000 to see how it behaves?

@jorgegb95
Copy link
Author

I have made the above changes and errors still occur. I have been monitoring and it seems to be a ‘Distributor’ problem. The log shows an error indicating that the liveness probe has failed with the following message: It seems the Distributor is delayed in processing a new session in the queue.

I have been checking and we have no connection (The cluster is a Azure Aks using kubenet) problems in the cluster or anything. I'm not creating an ingress for the selenium.

@VietND96
Copy link
Member

VietND96 commented Nov 6, 2024

Ok, that message is coming from liveness probe of Distributor. The probe tries to check Nodes should be spawned to serve requests in queue, ensure sessions count should not == 0 when queue > 0
However, you can set another method for probe check via

global:
  seleniumGrid:
    defaultNodeStartupProbe:httpGet
    defaultNodeLivenessProbe: httpGet
    defaultComponentLivenessProbe: httpGet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants