Releases: GoogleCloudPlatform/cluster-toolkit
v1.42.0: Filestore deletion protection, GCP maintenance as Slurm job, Docker daemon configuration
What's Changed
Key New Features 🎉
- Add support for custom Docker daemon configuration by @tpdownes in #3201
- Adopt local SSD storage for A3 docker images by @tpdownes in #3206
- Adopt google Terraform plugin v6.10.0 and drop support for 5.x by @tpdownes in #3189
- Add support to perform GCP maintenance as slurm job by @harshthakkar01 in #3152
- Add support for Filestore deletion protection by @tpdownes in #3183
Module Improvements 🔨
- Updating notebook module to use workbench_instance by @jrossthomson in #3139
- Initial commit for new logging output by @cdunbar13 in #3150
- SlurmGCP. "All or nothing" bulk insert on requests with placements by @mr0re1 in #3157
- Remove redundant provisioner for printing image name by @cdunbar13 in #3151
- Add direct Terraform support for Slurm SchedulerParameters and PrivateData by @tpdownes in #3164
- Add
use_job_duration
option by @abbas1902 in #3142 - Improvements for CloudSQL by @wiktorn in #3147
- Improve Error Message with Reservation Validation by @arajmane-g in #3174
Improvements 🛠
- Use local paths to embedded modules throughout Toolkit by @tpdownes in #3102
- Update default value for subnetwork_project to null by @alyssa-sm in #3193
- Gke update default taints for user node pools by @ankitkinra in #3200
- Update MTU for a3 mega for GKE based on best practices by @ankitkinra in #3175
- add training example for gke parallelstore blueprint by @chengcongdu in #3181
- Update maintenance.py to support additional format by @alyssa-sm in #3208
- Allow latest Terraform google plugin by @tpdownes in #3213
- update a3 machines local ssd to use nvme instead of scsi for better performance by @chengcongdu in #3232
- Improve fetching and caching job details by @harshthakkar01 in #3194
- SlurmGCP. Add
set -e
to prolog mux by @mr0re1 in #3215 - add gpu health check in prolog and epilog by @NinaCai in #3134
Deprecations 💤
- Delete the new-project module to support adoption of TPG v6 by @RachaelSTamakloe in #3171
- Delete Daos Example Blueprints to support adoption of TPG v6 by @RachaelSTamakloe in #3172
Version Updates ⏫
- Bump integration test to support Go 1.23 by @mohitchaurasia91 in #3154
- Bump go version 1.21 -> 1.22 by @mohitchaurasia91 in #3156
- Update bucket module within Slurm controller module by @tpdownes in #3161
- update vm-instance module to support TPG v6 by @RachaelSTamakloe in #3166
- Update IP address module within VPC module by @tpdownes in #3160
- update Batch module to be compatible with TPG v6 by @RachaelSTamakloe in #3187
- update HTCondor modules to be compatible with TPG v6 by @RachaelSTamakloe in #3186
- Update Slurm-GCP v5 to 5.12.1 by @tpdownes in #3185
- Update workload-identity submodule from v29 to v34 by @RachaelSTamakloe in #3196
- Update ml-slurm examples to use recent copies of pytorch and tensorflow by @tpdownes in #3226
- Make gke-node-pool compatible with TPG 6.x by @tpdownes in #3230
Bug fixes 🐞
- Refactor mount/mode setting for local SSD RAID by @tpdownes in #3214
- Fix a bug where try was hiding extraction of gpu driver version by @ankitkinra in #3257
- Fix the gpu_installation_config default for case where no customer input by @ankitkinra in #3259
- SlurmGCP. Fix bug that prevents resourcePolicies clean up. by @mr0re1 in #3266
New Contributors
- @linsword13 made their first contribution in #3211
- @NinaCai made their first contribution in #3134
Full Changelog: v1.41.0...v1.42.0
v1.41.0 Adoption of Slurm 24.05 and Improvements to GKE Support
What's Changed
Key New Features 🎉
New Modules 🧱
- resource-policy module implemented by @sharabiani in #3066
- gke-topology-scheduler module implemented by @sharabiani in #3080
- add GKE support for parallelstore through gke-storage module by @chengcongdu in #3120
Module Improvements 🔨
- Added compatibility check for GPUDirect and GKE version by @sharabiani in #3079
- Support template file for kueue configuration in kubectl-apply module by @sharabiani in #3111
- Implement xpk-gke-a3-megagpu blueprint by @sharabiani in #3108
- Use sackd for the login nodes by @mr0re1 in #3126
- gke-node-pool default name conflict fixed by @sharabiani in #3127
- improve dws_flex ux by @abbas1902 in #3122
- Include deployment name in Spack and Ramble bucket names (like startup-script) by @rohitramu in #3136
Improvements 🛠
- Create and use non-default service accounts in GKE by @annuay-google in #3123
- Added documentation on cloud-ops-agent installation and stackdriver removal by @jrossthomson in #3029
- Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles by @tpdownes in #3129
Deprecations 💤
- Freeze slurm-gcp v5 hybrid blueprints with the latest cluster toolkit version support by @harshthakkar01 in #3117
- Update Slurm-gcp v5 deprecation details by @harshthakkar01 in #3118
- Update badge for slurm-gcp v5 and slurm-gcp v6 by @harshthakkar01 in #3116
Version Updates ⏫
- Update A3-High NeMo to 24.07 and NCCL solution to latest recommended values by @akiki-liang0 in #3130
- Update Slurm-GCP to 6.8.2 by @tpdownes in #3132
Bug fixes 🐞
- Fixed the exact number constraint problem for additional vpcs in gpu_direct checks by @sharabiani in #3078
- Provide explicit project information by @wiktorn in #3060
- Chrome Remote Desktop: increase resilience of apt operations by @tpdownes in #3093
- Add mount parallelstore service to mount parallelstore for every reboot by @harshthakkar01 in #3125
New Contributors
- @akiki-liang0 made their first contribution in #3130
- @ighosh98 made their first contribution in #3124
Full Changelog: v1.40.1...v1.41.0
v1.40.1 Fix issue that affected GKE blueprints due to dynamic provisioning
What's Changed
Other changes
- Revert PR#3046 and add more line breaks for readability by @ankitkinra in #3115
Full Changelog: v1.40.0...v1.40.1
v1.40.0: A3 Mega and A3 High families supported in GKE
What's Changed
Important
All HPC VM images based upon CentOS 7 have been deprecated. This means that
referring to the "hpc-centos-7" family in the "cloud-hpc-image-public"
project will fail. We recommend migrating to the "hpc-rocky-linux-8" family
that is the new default throughout the Toolkit. If CentOS 7 is truly needed,
the final HPC CentOS 7 image can be used by its name: "hpc-centos-7-v20240712".
Key New Features 🎉
- GKE A3 High blueprint and GKE A3 Mega blueprint with automated GPU networking performance enhancements
- Add enable-maintenance-reservation flag in slurm to control reservation for scheduled maintenance by @harshthakkar01 in #2987
- adding documentation for versioned blueprint feature by @RachaelSTamakloe in #3055
- adding unit test for version blueprint caching mechanism by @RachaelSTamakloe in #3052
New Modules 🧱
- implement kubectl-apply module by @sharabiani in #2980
Module Improvements 🔨
- Default to zonal bulkInsert by @mr0re1 in #3005
- Add machine type availability checks by @annuay-google in #3003
- add support for enabling tcpx/o in a3 and a3mega vm, provide script for injecting rxdm sidecar and other required components into user workload by @chengcongdu in #3012
- support ghpc_stage function in kubectl-apply module by @sharabiani in #3036
- Validate Reservations in GKE Blueprints by @arajmane-g in #3024
- Fix multivpc missing region by @wiktorn in #3046
- Add initial_node_count support to gke-node-pool by @sharabiani in #3068
Improvements 🛠
- Update gVNIC driver in A3 Mega solution by @tpdownes in #2957
- Implement udev-based approach to mounting aperture devices by @tpdownes in #2955
- Update Debian 12 image in A3 Mega solution by @tpdownes in #2958
- adding module cache to prevent repeated module downloads during modul… by @RachaelSTamakloe in #3010
- add additional vpc validation for a3/a3mega machine by @chengcongdu in #3049
- Adds option to allow Kueue/Jobset to be installed on a GKE cluster via blueprints by @ankitkinra in #3017
- update readme for gpudirect by @chengcongdu in #3059
Deprecations 💤
- SlurmGCP V6. Remove CentOS7 image support. by @mr0re1 in #3038
- removing deprecated spack setup variables by @RachaelSTamakloe in #3040
- removing deprecated ramble setup variables by @RachaelSTamakloe in #3041
Version Updates ⏫
- Update NeMo 23.11 to 24.07 by @akiki-liang0 in #3090
Bug fixes 🐞
- Retry mounting daos container by @harshthakkar01 in #3045
- add argparse dependency to cloud build by @chengcongdu in #3057
- Allow users to provide a commit hash instead of git tag for Spack and Ramble installations by @rohitramu in #3073
- resolving error when var.initial_node_count is null by @RachaelSTamakloe in #3081
- A3 High blueprint prolog solution updates by @tpdownes in #3088
Other changes
- NeMo readme instructions for preloading gpt2 tokenizer by @koallison in #3075
New Contributors
- @koallison made their first contribution in #3075
- @akiki-liang0 made their first contribution in #3090
Full Changelog: v1.39.0...v1.40.0
v1.39.0: Slurm reservations during maintenance windows, Improved GKE Support, removed CentOS 7 references
What's Changed
Key New Features 🎉
- Add reservation support in slurm sync for scheduled maintenance by @harshthakkar01 in #2880
- Support multivpc with GKE by @sharabiani in #2797
- adding optional fields to redirect use of embedded modules to pull fr… by @RachaelSTamakloe in #2945
Module Improvements 🔨
- Make CloudSQL secret replication configurable by @dgouju in #2828
- GKE Blueprints to support reservations by @arajmane-g in #2891
- Expose maintenance interval as a blueprint setting for node pools in GKE by @annuay-google in #2971
- Support named placements in GKE node pools by @arajmane-g in #2969
- Add machine type availability checks to slurm-gcp-v6-nodeset by @annuay-google in #2962
- Revisit the Reservation Interface for GKE Blueprints by @arajmane-g in #2997
Improvements 🛠
- Add
sort_nodes.py
by @mr0re1 in #2853 - replacing centos7 with rocky8 in vm-instance modules by @RachaelSTamakloe in #2900
- replacing centos7 with rocky8 in nfs-server modules by @RachaelSTamakloe in #2901
- replacing centos7 with rocky8 in packer modules by @RachaelSTamakloe in #2899
- Update batch image to hpc-rocky-linux-8 by @ankitkinra in #2884
- OFE - various updates and fixes by @scott-nag in #2921
- Don't set
automaticRestart: false
by @mr0re1 in #2981
Bug fixes 🐞
- Add
slurmgcp-managed
infix to resource policy name by @mr0re1 in #2892 - Move pytest and other package installation to make by @annuay-google in #2890
- Prevent use of google provider 6.0 where breaking changes are in use by @tpdownes in #2978
- Fix local_ssd_config issue that forces node-pool recreation by @sharabiani in #2968
- kubernetes provider added to gke-cluster module by @sharabiani in #2985
- Fix for cleanup script. The last input is optional by @cdunbar13 in #2993
- Catch "None" fields in slurm job datetime data for BigQuery by @fdmalone in #2992
Other changes
- Use local-ssd for enroot temp space. by @samskillman in #3011
New Contributors
- @scott-nag made their first contribution in #2921
- @abbas1902 made their first contribution in #2956
- @fdmalone made their first contribution in #2992
Full Changelog: v1.38.0...v1.39.0
v1.38.0: Slurm GCP v6 for a3-highgpu-8g and added ability to disable automatic updates
What's Changed
Key New Features 🎉
- Add Slurm-GCP v6 based solution for provisioning a3-highgpu-8g compute nodes by @tpdownes in #2859
- Add
allow_automatic_updates
flag by @rohitramu in #2778 - Update slurm-gcp module to use custom endpoints. by @cdunbar13 in #2653
- Add local ssd RAID0 startup script by @alyssa-sm in #2720
New Modules 🧱
- Move GKE Modules to Core by @chengcongdu in #2758
Module Improvements 🔨
- Move
slurm_files
to the repo. by @mr0re1 in #2739 - Fix cleanup compute for different versions of gcloud by @cdunbar13 in #2794
- change default disk_type for GKE nodepool to null by @chengcongdu in #2818
- Add
instance_properties
var tonodeset
by @mr0re1 in #2843 - Enable local SSD formatting solution to set POSIX permissions by @tpdownes in #2863
- support for min_cpu_platform usage on vm-instance by @RachaelSTamakloe in #2873
Improvements 🛠
- Gke optional accelerator by @ankitkinra in #2736
- add test for gke n2 pool with default driver by @chengcongdu in #2811
- Update local ssd examples to use local ssd startup solution by @alyssa-sm in #2870
- Update a3-megagpu-8 example to use local ssd solution by @alyssa-sm in #2871
Deprecations 💤
Version Updates ⏫
Bug fixes 🐞
- Fix construction of
cloud.conf
by @mr0re1 in #2810 - SlurmGCP. Fix broken
--trace-api
flag. by @mr0re1 in #2817 - SlurmGCP6. Fix nodes stack in
down*
state. by @mr0re1 in #2856 - SlurmGCP. Fix bugs around nodeset zones by @mr0re1 in #2864
- Roll back changes in go.mod to release v1.37.2 by @nick-stroud in #2934
New Contributors
- @chengcongdu made their first contribution in #2758
- @ctk21 made their first contribution in #2761
- @arajmane-g made their first contribution in #2854
Full Changelog: v1.37.2...v1.38.0
v1.37.2 Fix SlurmGCP cleanup of resource policies
v1.37.1: Documentation update
Fix minor typographical errors in documentation
Full Changelog: v1.37.0...v1.37.1
v1.37.0
The HPC Toolkit has been rebranded to Cluster Toolkit. More details will follow shortly. The github repository has been renamed to match. This should not break existing workflows. References to the old name should seamlessly redirect to the updated repo. The binary has been renamed to gcluster
(formally ghpc
) but ghpc
has been symlinked and will continue to work. If any unexpected behavior is noticed as part of this transition, please report it.
What's Changed
Key New Features 🎉
- Rename binary
ghpc
->gcluster
by @mr0re1 in #2813 - Update references to HPC Toolkit to Cluster Toolkit by @alyssa-sm in #2829
Other changes
- Roll version number to v1.37.0 by @nick-stroud in #2839
Full Changelog: v1.36.1...v1.37.0
v1.36.1: Fix Slurm GCP Cloud Parameter Defaults
What's Changed
Bug fixes 🐞
- Hot fix to add defaults to cloud params by @nick-stroud in #2812
Full Changelog: v1.36.0...v1.36.1