Skip to content

Commit

Permalink
Merge branch 'develop' of github.com:annuay-google/cluster-toolkit in…
Browse files Browse the repository at this point in the history
…to annuay/fix-orphaned-resource-states
  • Loading branch information
annuay-google committed Oct 3, 2024
2 parents e53ee50 + 68d270c commit 19c5409
Show file tree
Hide file tree
Showing 7 changed files with 46 additions and 0 deletions.
1 change: 1 addition & 0 deletions modules/compute/gke-node-pool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,7 @@ limitations under the License.
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | Disk type for each node. | `string` | `null` | no |
| <a name="input_enable_gcfs"></a> [enable\_gcfs](#input\_enable\_gcfs) | Enable the Google Container Filesystem (GCFS). See [restrictions](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#gcfs_config). | `bool` | `false` | no |
| <a name="input_enable_secure_boot"></a> [enable\_secure\_boot](#input\_enable\_secure\_boot) | Enable secure boot for the nodes. Keep enabled unless custom kernel modules need to be loaded. See [here](https://cloud.google.com/compute/shielded-vm/docs/shielded-vm#secure-boot) for more info. | `bool` | `true` | no |
| <a name="input_gke_version"></a> [gke\_version](#input\_gke\_version) | GKE version | `string` | n/a | yes |
| <a name="input_guest_accelerator"></a> [guest\_accelerator](#input\_guest\_accelerator) | List of the type and count of accelerator cards attached to the instance. | <pre>list(object({<br/> type = optional(string)<br/> count = optional(number, 0)<br/> gpu_driver_installation_config = optional(list(object({<br/> gpu_driver_version = string<br/> })))<br/> gpu_partition_size = optional(string)<br/> gpu_sharing_config = optional(list(object({<br/> gpu_sharing_strategy = optional(string)<br/> max_shared_clients_per_gpu = optional(number)<br/> })))<br/> }))</pre> | `null` | no |
| <a name="input_host_maintenance_interval"></a> [host\_maintenance\_interval](#input\_host\_maintenance\_interval) | Specifies the frequency of planned maintenance events. | `string` | `""` | no |
| <a name="input_image_type"></a> [image\_type](#input\_image\_type) | The default image type used by NAP once a new node pool is being created. Use either COS\_CONTAINERD or UBUNTU\_CONTAINERD. | `string` | `"COS_CONTAINERD"` | no |
Expand Down
28 changes: 28 additions & 0 deletions modules/compute/gke-node-pool/gpu_direct.tf
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,12 @@ locals {
updated_workload_path = replace(local.workload_path_tcpx, ".yaml", "-tcpx.yaml")
rxdm_version = "v2.0.12" # matching nccl-tcpx-installer version v3.1.9
min_additional_networks = 4
major_minor_version_acceptable_map = {
"1.27" = "1.27.7-gke.1121000"
"1.28" = "1.28.8-gke.1095000"
"1.29" = "1.29.3-gke.1093000"
"1.30" = "1.30.2-gke.1023000"
}
}
"a3-megagpu-8g" = {
# Manifest to be installed for enabling TCPXO on a3-megagpu-8g machines
Expand All @@ -43,10 +49,25 @@ locals {
updated_workload_path = replace(local.workload_path_tcpxo, ".yaml", "-tcpxo.yaml")
rxdm_version = "v1.0.10" # matching nccl-tcpxo-installer version v1.0.4
min_additional_networks = 8
major_minor_version_acceptable_map = {
"1.28" = "1.28.9-gke.1250000"
"1.29" = "1.29.4-gke.1542000"
"1.30" = "1.30.4-gke.1129000"
}
}
}

min_additional_networks = try(local.gpu_direct_settings[var.machine_type].min_additional_networks, 0)

gke_version_regex = "(\\d+\\.\\d+)\\.(\\d+)-gke\\.(\\d+)" # GKE version format: 1.X.Y-gke.Z , regex output: ["1.X" , "Y", "Z"]

gke_version_parts = regex(local.gke_version_regex, var.gke_version)
gke_version_major = local.gke_version_parts[0]

major_minor_version_acceptable_map = try(local.gpu_direct_setting[var.machine_type].major_minor_version_acceptable_map, null)
minor_version_acceptable = try(contains(keys(local.major_minor_version_acceptable_map), local.gke_version_major), false) ? local.major_minor_version_acceptable_map[local.gke_version_major] : "1.0.0-gke.0"
minor_version_acceptable_parts = regex(local.gke_version_regex, local.minor_version_acceptable)
gke_gpudirect_compatible = local.gke_version_parts[1] > local.minor_version_acceptable_parts[1] || (local.gke_version_parts[1] == local.minor_version_acceptable_parts[1] && local.gke_version_parts[2] >= local.minor_version_acceptable_parts[2])
}

check "gpu_direct_check_multi_vpc" {
Expand All @@ -55,3 +76,10 @@ check "gpu_direct_check_multi_vpc" {
error_message = "To achieve optimal performance for ${var.machine_type} machine, at least ${local.min_additional_networks} additional vpc is recommended. You could configure it in the blueprint through modules/network/multivpc with network_count set as ${local.min_additional_networks}"
}
}

check "gke_version_requirements" {
assert {
condition = local.gke_gpudirect_compatible
error_message = "GPUDirect is not supported on GKE version ${var.gke_version} for ${var.machine_type} machine. For supported version details visit https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx#requirements"
}
}
5 changes: 5 additions & 0 deletions modules/compute/gke-node-pool/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -360,3 +360,8 @@ variable "initial_node_count" {
type = number
default = null
}

variable "gke_version" {
description = "GKE version"
type = string
}
1 change: 1 addition & 0 deletions modules/scheduler/gke-cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,7 @@ limitations under the License.
| <a name="output_cluster_id"></a> [cluster\_id](#output\_cluster\_id) | An identifier for the resource with format projects/{{project\_id}}/locations/{{region}}/clusters/{{name}}. |
| <a name="output_gke_cluster_endpoint"></a> [gke\_cluster\_endpoint](#output\_gke\_cluster\_endpoint) | GKE cluster endpoint. |
| <a name="output_gke_cluster_exists"></a> [gke\_cluster\_exists](#output\_gke\_cluster\_exists) | A static flag that signals to downstream modules that a cluster has been created. Needed by community/modules/scripts/kubernetes-operations. |
| <a name="output_gke_version"></a> [gke\_version](#output\_gke\_version) | GKE cluster's version. |
| <a name="output_instructions"></a> [instructions](#output\_instructions) | Instructions on how to connect to the created cluster. |
| <a name="output_k8s_service_account_name"></a> [k8s\_service\_account\_name](#output\_k8s\_service\_account\_name) | Name of k8s service account. |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
5 changes: 5 additions & 0 deletions modules/scheduler/gke-cluster/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -89,3 +89,8 @@ output "access_token" {
description = "Access token."
value = data.google_client_config.default.access_token
}

output "gke_version" {
description = "GKE cluster's version."
value = google_container_cluster.gke_cluster.master_version
}
1 change: 1 addition & 0 deletions modules/scheduler/pre-existing-gke-cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,4 +111,5 @@ limitations under the License.
|------|-------------|
| <a name="output_cluster_id"></a> [cluster\_id](#output\_cluster\_id) | An identifier for the gke cluster with format projects/{{project\_id}}/locations/{{region}}/clusters/{{name}}. |
| <a name="output_gke_cluster_exists"></a> [gke\_cluster\_exists](#output\_gke\_cluster\_exists) | A static flag that signals to downstream modules that a cluster exists. |
| <a name="output_gke_version"></a> [gke\_version](#output\_gke\_version) | GKE cluster's version. |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
5 changes: 5 additions & 0 deletions modules/scheduler/pre-existing-gke-cluster/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,8 @@ output "gke_cluster_exists" {
data.google_container_cluster.existing_gke_cluster
]
}

output "gke_version" {
description = "GKE cluster's version."
value = data.google_container_cluster.existing_gke_cluster.master_version
}

0 comments on commit 19c5409

Please sign in to comment.