-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFE] QUADS Self-Scheduling Phase 1 via API #487
Comments
This is awesome! Just to give a high-level what we are looking to do with Quads with our baremetal CPT, here is a quick and dirty visual ;-) Our use-case for integration would be for leases <= 24 hours -- depending on the cluster size we are requesting. Basically these allocations will be ephemeral in nature. Not ideal for long investigations, but great for quick response to hardened performance automation. let us know if this fits the model you describe here! Thanks Will! |
Thanks @sadsfae and @jtaleric ! I think this is a great step in the right direction to enable dynamic allocations in the scale lab. After discussing with @jtaleric the way I have thought of the process is:
|
Great feedback so far, keep it coming. We have a work-in-progress design document we're mostly done with but needs internal review before we share it but I gave @jtaleric and @radez a preview. I think what you'd care about is the tenant API request workflow, it would look ideally something like this broken down into a GET request to obtain eligible systems and three API POSTs to get your own workload delivered. You'd receive JSON returns for each of the API requests below that can be fed into any automation you have, note the jump to using the generated API token, etc. 1) Query available, eligible systems (OPEN)
2) Request assignment (OPEN)
3) Acquire assignment (Authenticated)
4) Deploy Assignment (Authenticated)
More Details WIP(excuse the screenshot, didn't convert to markdown) |
This already exists in our metadata models and filters for https://github.com/redhat-performance/quads/blob/latest/docs/quads-host-metadata-search.md This can also be expanded upon and import anything of value via
We would consume all the frills and features of a normal deliberately scheduled future QUADS assignment so all that would be inclusive. As to what you'd receive it would not differ at all from what someone who receives a future assignment scheduled by us would receive, only that you can request it yourself with a few API calls (we need several because we do a lot including talking to JIRA and Foreman on the backend).
The world is your oyster, but we're just auto-delivering the hardware/networks here. Any additional hour zero work is in your purview only to action like deploying OCP and running your workloads. You will be able to release the systems (or extend them) with the same set of API's though. One thing to keep in mind is our provisioning release time is what it is, there's no significant speedups beyond our usage of asyncio (already in codebase and current QUADS) and what having multiple, concurrent gunicorn threads/listeners provides. Bare-metal / IPMI / boot / mechanics are just slow, prone to weird issues and often needs hands-on so only thing I'd add here is keep your expectations reasonable, we're not going to have within-the-hour deployments 100% of the time, it may take a few hours to get your systems once the dust clears, longer if hands-on is required to push them through validation like any normal QUADS assignment. |
Thanks for the great explanation @sadsfae ,
|
Yes, it is absolutely needed. We have no other way to ensure systems data, settings, etc. are cleaned. More importantly we need a way to physically test/validate network functionality. We have a series of fping tests that occur and other validation testing to ensure the hardware is working 100%, traffic passes on ports and also that data and settings from previous tenants are removed. We have to deploy our own Foreman RHEL because it sets up templates for all of the VLAN interfaces with deliberate IP schemes to facilitate this testing.
Yes but also no. There is no way to perform any validation at all without wiping the systems for a new tenant. We have an option called "no wipe" which skips all provisioning and even reboots, it simply performs the network VLAN and switch automation needed to "move" an assignment to a new cloud. We can allow this as an option via the API phases but you would have to be 100% sure they were the same systems you used before and nobody else used them since you used them last. otherwise you get stuck with effectively the same running OS/systems as whomever had those systems before you, then you'd have to burn extra time provisioning them. No-wipe is used more frequently in expansion scenarios when we can't be sure there isn't conflicting broadcast services on the internal VLANS that would hijack kickstart (like DHCP/PXE from an internal tenant installer or service) or on occasion when we need to resurrect an expired assignment, it's really not designed to be done on new assignments unless you can be sure of the systems integrity which I don't see really being a tenable or reliable situation in a self-service pool of changing hardware. We also have no way of knowing what system you may pick ahead of time in the future for your bastion node before you get your systems, as those systems availability would only based on what's available at the current time. The Foreman-deployed OS, while perfectly fine for a generic RHEL is more a vehicle for us to validate hardware and network functionality, ensure clean baselines and catch any hands-on issue that might occur with bare-metal which happens frequently with 1000's of something more than we'd like. There is just no getting around this and it would cause a lot more headache to try skipping it. A environment doesn't get released until all systems pass our sets of validation phases.
When machines finish their allocation they roll directly to the next tenant anyway if they have an active schedule but they still need to be wiped/validated/tested and then pass our hardware/network validation and release gating, meaning they do immediately go into this process anyway if they have another place to be. What you're asking in general doesn't save a lot of time anyway. Kickstart via Foreman are a fairly fast part of the QUADS workflow (network automation is the most speedy which takes 5-10 seconds or so per machine). It takes around 5-8 minutes or less not counting reboots for a modern system to fully kickstart with our local SDN mirror and it's all done in parallel via the asyncio-capable Foreman QUADS library. Even if we used image-based deployments it would still take a comparable amount of time because disk images have to be written and reboots still have to happen and it's more inflexible because we'd have to maintain many different flavors of images to account for hardware variety. In general this step is fundamental to ensuring we deliver and properly validate 100% functional systems to tenants. |
@sadsfae ack! Thanks for the detailed responses! My concern was if we are trying to have a highly dynamic transactional lease with QUADs -- there would be this enormous amount of time for the validate/provisioning. From the above description, I don't think we are talking hours, but maybe minutes depending on the size of the request. Initially, I envision the request size around 25 nodes, 1 bastion, 3 workers, 3 infra, 18 workers -- this would be our "largest" deployment to start with in this POC. The lease would be 24-48hours. We might want to have the ability to run 2-3 jobs in parallel so my hope would be the "dynamic pool" of machines is ~75 if we can accommodate that for a POC? |
Hey Joe, I would set your expectations to 45minutes to an hour if nothing goes wrong: from the time you finish all the API POST(s) needed until you receive fully validated, fresh hardware/networks. That's a reasonable goal. The amount of systems doesn't matter as much because almost everything is done in parallel but of course one machine not passing validation holds up the rest because we demand 100% validation integrity. Most of the time this entire process goes off without a hitch unless tenants really trash the systems before you get them.
You'll have whatever limits we need for testing as I think you'll be our first "beta" adopters, but afterwards we will be setting per-user limits for multiple concurrent self-scheduled requests, we simply don't want one user occupying all the self-schedule pool. We don't know what that limit is yet but we'll figure out something that allows what you're trying to do but stems hogging as well. I just don't know what our dynamic pool will look like when this is ready, it just depends on what's free at the time because regardless of the priority of this feature we need to operate the R&D and product scale business first and foremost. Yes, I think 75+ to let it really grind is a great number to aim for though and very reachable looking at usage so far this Summer. One other note about delivery expectations, there are certainly areas we can speed up the QUADS phases (move_and_rebuild and validate_environment) but our resources are first aiming for functionality so we'd expect the same or slightly faster delivery than QUADS 1.1.x due to moving to independent gunicorn listeners and nginx benefits. We will need to pull out some profiling tools after this is working well to tighten things up where it makes sense as future RFE's. One area we need to revisit is the Foreman library, while we are using asyncio we do limit API activity via semaphores because we've overloaded it in the past and never got back to digging into RoR and their Sinatra API too deeply. I think that this needs revisiting and there's likely tuning on the Foreman API side we need to do too beyond what we do with https://quads.dev/2019/10/04/concurrent-provisioning-with-asyncio/ Edit: I just checked a few 30+ node assignments that went out flawlessly in the last week and they are completely validated and released in around 40-45minutes so I think 1 hour is a good target, possibly two hours maximum assuming no hardware/switch issue requiring hands-on. |
Super helpful to set expectations, thanks Will! |
Hey for sure, no problem. Better to set a target we're pretty sure we can hit and then start profiling and looking to see where we can reduce the delivery time going forward. So much has changed design/architecture wise that I don't want to be too aggressive on estimates. Once we have 2.0 running in production we'll have a better baseline, ideally I'd like to aim for 30 minutes or less with as a start-to-end delivery goal. |
Thanks @grafuls -- so for our BM CPT we will focus on 2 paths I assume
Thanks for throwing this together! |
Hey Joe, this is how it'll likely work.
You'll first do an open GET request to obtain a JSON list of systems (max 10) and store this somewhere and you'll pass it along later as part of a bearer-auth authenticated POST payload once the first successful 201 response is returned with your temporary token and generated JIRA ticket number and cloud that's auto-allocated. We are using the concept of users like
The way we have it scoped now is you'll just pick the actual physical systems (max of 10) that gets returned. But we do have the capability to support filtering based on hardware, model or capability that will curate this API response return.
|
For self scheduling we are expecting users to not pass any cloud, for which Quads, on the backend, will auto select a cloud name that is available. We are also opening the possibility of passing the cloud name ( that should also be available ) to override the auto selection.
These are kind of separate concepts. curl http://quads2.example.com/api/v3/available?interfaces.vendor=Mellanox+Technologies&model=R640&disks.disk_type=nvme |
This is the skeleton feature that will enable future self-scheduling via the UI here: #98
Design flexible API scheduling framework using dedicated RBAC roles / token auth (perhaps we pre-define a set number of "slots" like we do cloud environments as users)
Design a threshold activation mechanism based on both cloud capacity % and per-model % usage, e.g. a host level metadata attribute like
ss_enabled: true/false
selfservice_manager.py
--enable-ss true/false
Design what the tenant workflow might look like, e.g.
The text was updated successfully, but these errors were encountered: