-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-NUMA systems for testing on Kubernetes test infrastructure #28211
Comments
@swatisehgal: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:
Please see the group list for a listing of the SIGs, working groups, and committees available. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig-node |
@swatisehgal Are you adding new tests or do you want to modify existing tests ? |
cc @dims |
e2e tests don't have permanent machines so there's nothing to add. You'll need to run a dedicated CI suite with these more expensive machines, we should not jump to machines this large in the standard e2e suites, I forget exactly what we use but it's something like 1-8 core machines. this sort of thing should already be configurable in the e2e scripts |
Alternatively we have some cluster e2e CI on AWS if that helps. |
We want to make sure that the existing tests and the ones we add in the future are actually executed and we reap the benefit of signals from CI. There is currently no way for us to identify regressions. The main problem is that the existing node-e2e tests related to Topology Manager and Memory Manager in spite of being configured (e.g. here) are skipped when they don't run on multi-NUMA machines (e.g. here) which happens 100% of the times as we don't have such nodes in the test-infra. |
This is useful information, Thanks! Other than creating a PR to propose introduction of a separate test-suite that runs on a dedicated lane with these machines in test-infra repo and getting it reviewed/approved, do I need to get approval for using these more expensive machines? |
After a big of digging I found |
https://instances.vantage.sh/aws/ec2/c5.24xlarge - this seems to be a bit cheaper? AWS Docs aren't very helpful unfortunately (or maybe it's just my search skills). |
I intended to point to c5.24xlarge in my previous comment, linked to it correctly but used the wrong machine name 🤦♀️ |
Sorry. I should have be clear about what I meant by tests. My question is toward prowjobs. Is there an intent to introduce new prowjobs to run e2e tests for topology/memory manager ? |
We already have a few prow jobs for resource managers:
These jobs refer to image-config-files (e .g. image-config-serial-resource-managers.yaml and image-config-serial-cpu-manager.yaml) where |
@BenTheElder Could you please elaborate what you mean by a dedicated test suite. We currently have node e2e tests for |
I think @BenTheElder refers to a separate job definition for topology manager with the image config spacifying the desired machine type like here:
Let's try to create one and make see how fast it runs. Once we have numbers - it will be easier conversation. |
yes |
Thanks @BenTheElder, this is extremely insightful. It explains the logistics of how we handling VMs for testing and the constraints we are dealing with.
Perfect, I will work on a PR and share it for a review. Thanks! |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale @swatisehgal is there anything left to do here? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@swatisehgal: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
kubernetes/enhancements#3545 (comment) I think this seems to have regressed. Looking at our tests, I am seeing the Multi-NUMA tests are being skipped. |
Once we get these working again, I think maybe we should consider failing the tests if the NUMA alignment constraint is not satisfied. That way we can be notified that there was an issue. |
I'm worried this can lead to permared and/or possibly the test being disabled (!!). But I have zero data supporting my (irrational) fear, so I won't push back too hard if we decide to go this direction |
Tests that require specific hardware should:
Tests that require specific hardware / non-standard integrations should not be in the default suite and that's fine. |
cross linking kubernetes/k8s.io#7339 |
SIG Node community is trying to avoid perma-beta status of features (as per discussion in SIG Node Meeting 2022/12/06) and Topology Manager and Memory Manger are both candidates for GA graduation. Currently, node e2e tests for both these components are skipped (examples: Topology Manager Node e2e test and Memory Manager Node e2e test) as we don't have multi-NUMA hardware in the test infrastructure.
In SIG Node CI meeting on 2022/12/07, it was highlighted that:
Is it possible to add either
n2-standard-32
orn2d-standard-32
to the test infrastructure as it is not ideal to skip e2e test and it could be a potential blocker for GA graduation of these features?The text was updated successfully, but these errors were encountered: