-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: Support scoring for devices and nodes in scheduling #4970
Comments
/wg device-management |
cc @dom4ha |
cc @catblade |
cc @tardieu |
Thanks for filing that John. Let me just share my high-level thoughts about it. I obviously sympathize with the goal and usecase, but I would like us to spend time thinking about implementation.
I didn't yet put enough thoughts into it, but I would like to explore a very different model of a cleaner decision tree. In that model, we stack-rank the preferences and on a given level we immediately reject all options with a score that is different than the highest for this particular scoring function. Obviously we have a compatibility problem, but given we're talking about new feature maybe it's the right moment to seriously consider it now. And once we prove it, maybe we will be able to somehow get rid the old model in the future... |
Thanks @wojtek-t, good points. A few thoughts. I would characterize your first point as "predictability" - as a user I have an idea of what's going to happen. We can try to make that a goal. One factor to consider is who needs to influence the decision, and how that affects predictability. I can think of (at least) four different roles that probably want some say in the scoring:
During the KEP process we will have to figure which of these we want to address, what the scope of control they should have, how to manage weights given to each, what APIs they need to influence decisions, and how to make all that comprehensible and predictable. It should be fun :) On your second point, were those issues there for solely local decisions (scoring nodes for a single Pod), or did they arise out of things like pod affinity? One thing I would put explicitly out-of-scope for this KEP is any kind of cross-pod affinity or gang scheduling. I think we need a different solution than our existing kube-scheduler for that. As a corollary, I also don't want to address optimizing for future workloads that may be coming along soon (a la Kueue). For example, if I know that the next workload will need a full GPU, and I can fit the current workload on a MIG on an in-use GPU, but with slightly less performance, ideally I would probably choose the MIG. But in this KEP, I don't want to try to do that. I think it's already hard enough when we only make decisions for the single Pod we're looking at. That said, I think that whatever we do should be useful to solutions like Kueue or a multi-Pod scheduler to leverage as needed. |
cc @jingxu97 |
If we want to support both the end user preference and cluster admin policy (to maintain high utilization), we probably cannot just stack rank, as we may end up either listening to the end user preference unconditionally, or follow admin policy ignoring the user preference (at least in the simplest implementation). Tuning weights might be indeed challenging, but on the other hand reaching higher predictability (sacrificing flexibility) could be achieved by picking more radical weights without changing the whole model. For instance all kinds of DRA factors could have much stronger weights than all other non-DRA factors, without stack ranking all existing plugin types. I bet that scheduler uses flat weights because it's hard to definitely point out which factors are more and which are less important, so unpredictability at some level is probably inevitable. Another small comment is that Autoscaler skips scoring, so it may give suboptimal predictions compared to the real scheduler future placements, but I'm not sure how much Autoscaler accuracy is a concern at this point. |
Thanks for your thoughts, so let me share my further thoughts (some of which I already had before, some are triggered by above comments).
Yes - it's exactly that. And I fully agree that there are multiple personas that would potentially like to affect that. In my mental model I had only 3 (workload owner, cluster admin and provider), but indeed we may need to think if provider shouldn't be split. My thinking about how those 3 affect the decisions are:
I'm happy to hear other thoughts on it, but this is how I was thinking about it.
It's a bit of both:
I have a bunch of thoughts about that too, but I'm fine with making at out-of-scope. If we build a reasonable building block here, it should be enough. And given we want to focus purely on "intra-node" aspect, I think we're fine here.
Technically that's true, but still see my performance point above - we may unnecessary use order of magnitude more resources in that model. If we want predictability, let's not hack around to achieve it, but rather let's do that properly.
CA is a bit different here, because it doesn't look only on a single node. It has an internal concept of scoring, but it's a bit different. I would like to achieve better unification here, but for the sake of making progress, I would also put it out of scope of now. |
I don't disagree, but will share my thoughts nevertheless :) CA indeed has its own scoring mechanism. In fact, it has more than one. The most complex one was indeed score based and good luck reasoning about this one:
More recently we've introduced a way of composing different expanders (i.e. CA scoring functions) in a way that each one can filter out some options that are preferred from its perspective, eventually leading to a single option being chosen. This is very similar to the decision tree proposal above and I agree it is much simpler to reason about (not only as a user, but also as CA maintainer & operator). I think there might be some opportunity to bring the two scoring mechanisms closer, especially if we build some kind of multi-pod scheduler. One might argue CA is already a multi-pod scheduler that can also create nodes. If there's no capacity in the cluster, the nodes created by CA are essentially the only option for pods to be scheduled, so scheduler scoring doesn't matter much in an autoscaled cluster. |
Enhancement Description
DRA supports the concept of "under specifying" a request. This gives the scheduler more flexibility to satisfy a request, increasing the likelihood of success in environments with scarce resources. For example, rather than asking for a specific model of a device, the user can ask for any one of a set of models, as long as it has some minimum specified amount of memory.
Currently, DRA uses a "first fit" algorithm during scheduling. This can lead to inefficient choices. Building on the example above, if the user asks for a device with at least 4GB of memory, if the first device found has 80GB of memory, it will be chosen, even if there is another option with exactly 4GB. If scoring were available, the scheduler could evaluate the "waste" associated with each possible option, and make a more efficient choice.
Scoring is also critical in other situations where there is optionality in how to satisfy a request. For instance, in #4816 the user is allowed to provide a list of preferences. While that works to choose the "best" option on a given node, in reality most nodes have homogeneous selections of devices. So, in a cluster with nodes that meet the first option in the list, and nodes that meet the second option in the list, either one could be chosen. If DRA could score the nodes based on whether they satisfy the first or second option, then preference could be given to the first option across nodes, not just within a node.
The last important place for scoring is to help with fragmentation and bin packing. This is especially relevant with the implementation of #4815 pending. The ability to dynamical choose partitions of a device can lead to fragmentation; scoring can alleviate that to some extent. It is not a complete solution to that problem, but can help.
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):/assign @johnbelamaric
/cc @pohly @klueska @mortent @alculquicondor @wojtek-t
/sig scheduling
The text was updated successfully, but these errors were encountered: