-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource Reclaim Between Different Queues #3842
Comments
I think that there is something more - looks like job 2 has other issues. |
I mean that job is in pending status, there is no pod scheduled and waiting for resources. I think that it's more related to controller, not to the scheduler. |
Is the "Event" referring to $ kubectl describe vcjob job-b
......
Status:
Conditions:
Last Transition Time: 2024-11-26T09:39:23Z
Status: Pending
Min Available: 2
State:
Last Transition Time: 2024-11-26T09:39:23Z
Phase: Pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning PodGroupPending 15h vc-controller-manager PodGroup default:job-b unschedule,reason: 2/0 tasks in gang unschedulable: pod group is not ready, 2 minAvailable When I remove $ kubectl get vcjob
NAME STATUS MINAVAILABLE RUNNINGS AGE
job-a Running 2 5 15h
job-b Pending 2 15h
$ kubectl delete vcjob job-a
job.batch.volcano.sh "job-a" deleted
$ kubectl get vcjob
NAME STATUS MINAVAILABLE RUNNINGS AGE
job-b Pending 2 15h
$ kubectl get vcjob
NAME STATUS MINAVAILABLE RUNNINGS AGE
job-b Running 2 5 15h By the way, I'm using Volcano Scheduler version 1.9.0. |
I noticed this log later, and it seems like the overcommit plugin is kicking job-b out of the queue. The expectation was that once a job enters the Pending state, it shouldn’t be considered for preemption. After I removed the I1127 08:26:56.828424 1 enqueue.go:45] Enter Enqueue ...
I1127 08:26:56.828429 1 enqueue.go:63] Added Queue <second> for Job <default/job-b-7a171232-8367-4d99-b301-233e98264f25>
I1127 08:26:56.828438 1 enqueue.go:74] Added Job <default/job-b-7a171232-8367-4d99-b301-233e98264f25> into Queue <second>
I1127 08:26:56.828442 1 enqueue.go:63] Added Queue <first> for Job <default/job-a-b60497e4-2892-4687-929d-5284e94a8871>
I1127 08:26:56.828449 1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I1127 08:26:56.828459 1 overcommit.go:128] Resource in cluster is overused, reject job <default/job-b-7a171232-8367-4d99-b301-233e98264f25> to be inqueue
I1127 08:26:56.828483 1 enqueue.go:104] Leaving Enqueue ... |
Please describe your problem in detail
I am trying to test the effect of
queue deserved
with thereclaim
action, butjob-b
remains in the Pending state.The queue and job YAML configurations were modified based on this [Issue](#3729).
Here is part of the
volcano-scheduler
log. Could you please help me understand why the reclaim process is not triggered?Below are the related YAML configurations. If additional information is required, I can provide it.
Thank you.
scheduler-config.yaml
queue.yaml
job-a.yaml
job-b.yaml
Current Status
The cluster has approximately 48 cores, and other running Pods are using around 5 cores.
Any other relevant information
No response
The text was updated successfully, but these errors were encountered: