[🐛 Bug]: Keda and Selenium Grid mismatch queues #2442

jorgegb95 · 2024-10-23T11:48:42Z

What happened?

We have Selenium Grid with Keda enabled on an AKS cluster which is installed using terraform.
The problem we have detected is that if the number of queued requests exceeds the maximum that keda can scale to, it stops detecting if there are things queued but the pod hub does detect it.

Command used to start Selenium Grid with Docker (or Kubernetes)

# -- Repository: https://github.com/SeleniumHQ/docker-selenium/blob/trunk/charts/selenium-grid/values.yaml

hub:
  nameOverride: selenium-router
  imagePullPolicy: IfNotPresent
  nodeSelector:
    agentpool: npappspot
  tolerations:
    - key: app
      operator: Equal
      value: app
      effect: NoSchedule
    - key: kubernetes.azure.com/scalesetpriority
      operator: Equal
      value: spot
      effect: NoSchedule
  resources:
    requests:
      cpu: ${hub_cpu_requests}
      memory: ${hub_memory_requests}
    limits:
      cpu: ${hub_cpu_limits}
      memory: ${hub_memory_limits}
  extraEnvironmentVariables:
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: ${SE_SESSION_REQUEST_TIMEOUT}

autoscaling:
  enabled: true
  scaleOptions:
    minReplicaCount: 0
    maxReplicaCount: 40

tls:
  create: false

serviceAccount:
  create: true

ingress:
  enabled: true

chromeNode:
  enabled: true
  replicas: 0
  resources:
    requests:
      cpu: ${chrome_node_cpu_requests}
      memory: ${chrome_node_memory_requests}
    limits:
      cpu: ${chrome_node_cpu_limits}
      memory: ${chrome_node_memory_limits}
  nodeSelector:
    agentpool: npappspot
  tolerations:
    - key: app
      operator: Equal
      value: app
      effect: NoSchedule
    - key: kubernetes.azure.com/scalesetpriority
      operator: Equal
      value: spot
      effect: NoSchedule
  extraEnvironmentVariables:
    - name: SE_NODE_SESSION_TIMEOUT
      value: ${SE_NODE_SESSION_TIMEOUT}

firefoxNode:
  enabled: true
  replicas: 0
  resources:
    requests:
      cpu: ${firefox_node_cpu_requests}
      memory: ${firefox_node_memory_requests}
    limits:
      cpu: ${firefox_node_cpu_limits}
      memory: ${firefox_node_memory_limits}
  nodeSelector:
    agentpool: npappspot
  tolerations:
    - key: app
      operator: Equal
      value: app
      effect: NoSchedule
    - key: kubernetes.azure.com/scalesetpriority
      operator: Equal
      value: spot
      effect: NoSchedule
  extraEnvironmentVariables:
    - name: SE_NODE_SESSION_TIMEOUT
      value: ${SE_NODE_SESSION_TIMEOUT}


edgeNode:
  enabled: true
  replicas: 0
  resources:
    requests:
      cpu: ${edge_node_cpu_requests}
      memory: ${edge_node_memory_requests}
    limits:
      cpu: ${edge_node_cpu_limits}
      memory: ${edge_node_memory_limits}
  nodeSelector:
    agentpool: npappspot
  tolerations:
    - key: app
      operator: Equal
      value: app
      effect: NoSchedule
    - key: kubernetes.azure.com/scalesetpriority
      operator: Equal
      value: spot
      effect: NoSchedule
  extraEnvironmentVariables:
    - name: SE_NODE_SESSION_TIMEOUT
      value: ${SE_NODE_SESSION_TIMEOUT}

Relevant log output

No errors appear in the log. Keda indicates that there are 10 glued jobs, but in selenium there are 20 and when Selenium finish with those 10 (that appears on keda) it does not lift any more pods from the browser.

INFO	scaleexecutor	Remove a job by reaching the historyLimit	{"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "job.Name": "selenium-grid-selenium-chrome-node-vrjcm", "historyLimit": 0}

2024-10-23T11:40:39Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of running Jobs": 40}

2024-10-23T11:40:39Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of pending Jobs": 10}

Operating System

Aks Kubernetes version 1.28.9

Docker Selenium version (image tag)

4.25.0-20240922

Selenium Grid chart version (chart version)

0.36.1

github-actions · 2024-10-23T11:48:55Z

@jorgegb95, thank you for creating this issue. We will troubleshoot it as soon as we can.

Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

VietND96 · 2024-10-23T14:17:54Z

Hi, are you using KEDA images 4.15.1 that delivered in this repo?
https://github.com/SeleniumHQ/docker-selenium/blob/trunk/.keda/README.md

jorgegb95 · 2024-10-24T11:55:03Z

We simply enabled autoscaling in the values as indicated in the documentation ‘autoscaling.enabled’ so we understand that it will install the latest available version of keda.

VietND96 · 2024-10-24T17:45:32Z

Ok, can you try chart version 0.36.4 and remove replace: 0 from each node config?

jorgegb95 · 2024-11-06T11:13:09Z

We have tried installing version 0.37.0 of the chart and are seeing the selenium pod restart for a 143.

Add that we are launching about 100 chrome nodes and it is causing the 143 error, how could I debug to see which piece of selenium is giving the error?

VietND96 · 2024-11-06T12:19:48Z

I am suspecting something around this config key.

global:
  seleniumGrid
    revisionHistoryLimit: 10

# -- Specify how many old ReplicaSets for this Deployment you want to retain. The rest will be garbage-collected in the background.

Can you try update this to value e.g 1000 to see how it behaves?

jorgegb95 · 2024-11-06T15:11:24Z

I have made the above changes and errors still occur. I have been monitoring and it seems to be a ‘Distributor’ problem. The log shows an error indicating that the liveness probe has failed with the following message: It seems the Distributor is delayed in processing a new session in the queue.

I have been checking and we have no connection (The cluster is a Azure Aks using kubenet) problems in the cluster or anything. I'm not creating an ingress for the selenium.

VietND96 · 2024-11-06T15:47:56Z

Ok, that message is coming from liveness probe of Distributor. The probe tries to check Nodes should be spawned to serve requests in queue, ensure sessions count should not == 0 when queue > 0
However, you can set another method for probe check via

global:
  seleniumGrid:
    defaultNodeStartupProbe:httpGet
    defaultNodeLivenessProbe: httpGet
    defaultComponentLivenessProbe: httpGet

jorgegb95 added the needs-triaging label Oct 23, 2024

VietND96 added R-awaiting-answer and removed needs-triaging labels Nov 5, 2024

VietND96 added this to the 4.27.0 milestone Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 Bug]: Keda and Selenium Grid mismatch queues #2442

[🐛 Bug]: Keda and Selenium Grid mismatch queues #2442

jorgegb95 commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

VietND96 commented Oct 23, 2024 •

edited

Loading

jorgegb95 commented Oct 24, 2024

VietND96 commented Oct 24, 2024

jorgegb95 commented Nov 6, 2024

VietND96 commented Nov 6, 2024

jorgegb95 commented Nov 6, 2024

VietND96 commented Nov 6, 2024 •

edited

Loading

[🐛 Bug]: Keda and Selenium Grid mismatch queues #2442

[🐛 Bug]: Keda and Selenium Grid mismatch queues #2442

Comments

jorgegb95 commented Oct 23, 2024

What happened?

Command used to start Selenium Grid with Docker (or Kubernetes)

Relevant log output

Operating System

Docker Selenium version (image tag)

Selenium Grid chart version (chart version)

github-actions bot commented Oct 23, 2024

VietND96 commented Oct 23, 2024 • edited Loading

jorgegb95 commented Oct 24, 2024

VietND96 commented Oct 24, 2024

jorgegb95 commented Nov 6, 2024

VietND96 commented Nov 6, 2024

jorgegb95 commented Nov 6, 2024

VietND96 commented Nov 6, 2024 • edited Loading

VietND96 commented Oct 23, 2024 •

edited

Loading

VietND96 commented Nov 6, 2024 •

edited

Loading