v1.41.0 Adoption of Slurm 24.05 and Improvements to GKE Support
What's Changed
Key New Features 🎉
New Modules 🧱
- resource-policy module implemented by @sharabiani in #3066
- gke-topology-scheduler module implemented by @sharabiani in #3080
- add GKE support for parallelstore through gke-storage module by @chengcongdu in #3120
Module Improvements 🔨
- Added compatibility check for GPUDirect and GKE version by @sharabiani in #3079
- Support template file for kueue configuration in kubectl-apply module by @sharabiani in #3111
- Implement xpk-gke-a3-megagpu blueprint by @sharabiani in #3108
- Use sackd for the login nodes by @mr0re1 in #3126
- gke-node-pool default name conflict fixed by @sharabiani in #3127
- improve dws_flex ux by @abbas1902 in #3122
- Include deployment name in Spack and Ramble bucket names (like startup-script) by @rohitramu in #3136
Improvements 🛠
- Create and use non-default service accounts in GKE by @annuay-google in #3123
- Added documentation on cloud-ops-agent installation and stackdriver removal by @jrossthomson in #3029
- Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles by @tpdownes in #3129
Deprecations 💤
- Freeze slurm-gcp v5 hybrid blueprints with the latest cluster toolkit version support by @harshthakkar01 in #3117
- Update Slurm-gcp v5 deprecation details by @harshthakkar01 in #3118
- Update badge for slurm-gcp v5 and slurm-gcp v6 by @harshthakkar01 in #3116
Version Updates ⏫
- Update A3-High NeMo to 24.07 and NCCL solution to latest recommended values by @akiki-liang0 in #3130
- Update Slurm-GCP to 6.8.2 by @tpdownes in #3132
Bug fixes 🐞
- Fixed the exact number constraint problem for additional vpcs in gpu_direct checks by @sharabiani in #3078
- Provide explicit project information by @wiktorn in #3060
- Chrome Remote Desktop: increase resilience of apt operations by @tpdownes in #3093
- Add mount parallelstore service to mount parallelstore for every reboot by @harshthakkar01 in #3125
New Contributors
- @akiki-liang0 made their first contribution in #3130
- @ighosh98 made their first contribution in #3124
Full Changelog: v1.40.1...v1.41.0