Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move changes from experimental to develop #3265

Draft
wants to merge 49 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
83dc146
Updating vm instance to allow for RDMA nic-types (private preview only
cdunbar13 Aug 6, 2024
a916e29
Adding specific google-provider and updating modules for network profile
cdunbar13 Aug 20, 2024
7507dc5
Updated version to match functional version of google-private
cdunbar13 Aug 27, 2024
fcffeb4
Updating slurm to use new branch for experimental features
cdunbar13 Aug 29, 2024
5d19f30
Merge pull request #2974 from cdunbar13/private_preview
cdunbar13 Sep 3, 2024
3ddad0b
create some delta
annuay-google Sep 5, 2024
d6ea335
Updating private provider versions and slurm-gcp references
cdunbar13 Sep 9, 2024
61b7542
Update to RDMA VPC for reducing repeated code
cdunbar13 Sep 11, 2024
17987fc
Merge pull request #3021 from cdunbar13/experimental
cdunbar13 Sep 12, 2024
ac4964a
merge develop
annuay-google Sep 24, 2024
c8011ca
Update network_ip to use empty string
tpdownes Sep 26, 2024
3361e82
Merge pull request #3076 from GoogleCloudPlatform/fix_network_ip
tpdownes Sep 26, 2024
cb8c933
Merge branch 'experimental' of github.com:GoogleCloudPlatform/cluster…
annuay-google Sep 30, 2024
2a88693
Merge branch 'develop' of github.com:annuay-google/cluster-toolkit in…
annuay-google Sep 30, 2024
99493df
Merge pull request #3070 from annuay-google/merge-develop-to-experime…
cdunbar13 Sep 30, 2024
00c34d8
RDMA Support in GKE Modules
arajmane-g Oct 9, 2024
01f7bf6
Merge pull request #1 from annuay-google/a3u_rdma
arajmane-g Oct 9, 2024
5ca7d82
Address Feedback
arajmane-g Oct 11, 2024
658bbb1
Merge pull request #3114 from annuay-google/experimental
cdunbar13 Oct 11, 2024
321e6b9
Use template_subnetworks to generate output_subnets_gke
arajmane-g Oct 16, 2024
044d6b4
Merge pull request #3137 from arajmane-g/experimental
arajmane-g Oct 22, 2024
52ef947
Support NCCL and add blueprint
annuay-google Oct 28, 2024
9c86011
Delete blueprint
annuay-google Oct 28, 2024
651ac17
Merge pull request #3173 from arajmane-g/experimental
annuay-google Oct 28, 2024
7ea3677
Updating rdma-vpc to require users to enter a nic type to be associat…
cdunbar13 Nov 1, 2024
dbcef0b
Merge branch 'develop' of github.com:GoogleCloudPlatform/cluster-tool…
cdunbar13 Nov 1, 2024
a92939d
Merge pull request #3205 from cdunbar13/experimental
tpdownes Nov 1, 2024
1d982d2
Support Extended Reservations
arajmane-g Nov 5, 2024
b6ab698
delete terraform dir
annuay-google Nov 8, 2024
71b4b9e
add ssd config for a3u
annuay-google Nov 8, 2024
99e10b8
Merge pull request #3238 from annuay-google/annuay/add-ssd-config-for…
annuay-google Nov 8, 2024
a7830eb
GKE doesn't support shared extended reservations yet
arajmane-g Nov 11, 2024
2dedd06
Merge pull request #3218 from arajmane-g/experimental
annuay-google Nov 11, 2024
128bf7e
merge complete
annuay-google Nov 12, 2024
58aeb27
Invert SSD and NVME counts
annuay-google Nov 12, 2024
e419839
Merge pull request #3247 from annuay-google/sync-develop
ankitkinra Nov 13, 2024
cd02cea
upgrade kueue default version to v0.9.0 to support TAS
ighosh98 Nov 14, 2024
7385b1a
Update kueue error message
ighosh98 Nov 14, 2024
d4e2b20
Make variables.tf more modular and set default kueue version to v0.8.1
ighosh98 Nov 15, 2024
1d1cfbe
Merge pull request #3260 from ighosh98/experimental-kueue
ighosh98 Nov 15, 2024
42d35eb
merge develop
annuay-google Nov 18, 2024
5a82958
Merge pull request #3278 from annuay-google/exp-copy-annuay
annuay-google Nov 18, 2024
f5e8171
Updating RDMA-VPC to use v9.3 CFT modules
cdunbar13 Nov 20, 2024
700db8b
Update community/modules/network/rdma-vpc/vpc-submodule/versions.tf
cdunbar13 Nov 20, 2024
dbd2053
Merge pull request #3293 from cdunbar13/update-rdma-vpc
cdunbar13 Nov 20, 2024
467d116
Moving from private provider to google-beta 6.13.0
cdunbar13 Nov 26, 2024
75b952a
Merge pull request #3308 from cdunbar13/rdma-vpc-update
cdunbar13 Nov 26, 2024
49bd1c1
Changing empty network_ip from empty string to null
cdunbar13 Nov 26, 2024
ad9e1bd
Merge pull request #3309 from cdunbar13/experimental
cdunbar13 Nov 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ locals {
"a2-ultragpu-8g" = { type = "nvidia-a100-80gb", count = 8 },
"a3-highgpu-8g" = { type = "nvidia-h100-80gb", count = 8 },
"a3-megagpu-8g" = { type = "nvidia-h100-mega-80gb", count = 8 },
"a3-ultragpu-8g" = { type = "nvidia-h200-141gb", count = 8 },
"g2-standard-4" = { type = "nvidia-l4", count = 1 },
"g2-standard-8" = { type = "nvidia-l4", count = 1 },
"g2-standard-12" = { type = "nvidia-l4", count = 1 },
Expand Down
2 changes: 1 addition & 1 deletion community/modules/compute/pbspro-execution/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ No resources.
| <a name="input_machine_type"></a> [machine\_type](#input\_machine\_type) | Machine type to use for the instance creation | `string` | `"c2-standard-60"` | no |
| <a name="input_metadata"></a> [metadata](#input\_metadata) | Metadata, provided as a map | `map(string)` | `{}` | no |
| <a name="input_name_prefix"></a> [name\_prefix](#input\_name\_prefix) | Name prefix for PBS execution hostnames | `string` | `null` | no |
| <a name="input_network_interfaces"></a> [network\_interfaces](#input\_network\_interfaces) | A list of network interfaces. The options match that of the terraform<br/>network\_interface block of google\_compute\_instance. For descriptions of the<br/>subfields or more information see the documentation:<br/>https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance#nested_network_interface<br/><br/>**\_NOTE:\_** If `network_interfaces` are set, `network_self_link` and<br/>`subnetwork_self_link` will be ignored, even if they are provided through<br/>the `use` field. `bandwidth_tier` and `enable_public_ips` also do not apply<br/>to network interfaces defined in this variable.<br/><br/>Subfields:<br/>network (string, required if subnetwork is not supplied)<br/>subnetwork (string, required if network is not supplied)<br/>subnetwork\_project (string, optional)<br/>network\_ip (string, optional)<br/>nic\_type (string, optional, choose from ["GVNIC", "VIRTIO\_NET"])<br/>stack\_type (string, optional, choose from ["IPV4\_ONLY", "IPV4\_IPV6"])<br/>queue\_count (number, optional)<br/>access\_config (object, optional)<br/>ipv6\_access\_config (object, optional)<br/>alias\_ip\_range (list(object), optional) | <pre>list(object({<br/> network = string,<br/> subnetwork = string,<br/> subnetwork_project = string,<br/> network_ip = string,<br/> nic_type = string,<br/> stack_type = string,<br/> queue_count = number,<br/> access_config = list(object({<br/> nat_ip = string,<br/> public_ptr_domain_name = string,<br/> network_tier = string<br/> })),<br/> ipv6_access_config = list(object({<br/> public_ptr_domain_name = string,<br/> network_tier = string<br/> })),<br/> alias_ip_range = list(object({<br/> ip_cidr_range = string,<br/> subnetwork_range_name = string<br/> }))<br/> }))</pre> | `[]` | no |
| <a name="input_network_interfaces"></a> [network\_interfaces](#input\_network\_interfaces) | A list of network interfaces. The options match that of the terraform<br/>network\_interface block of google\_compute\_instance. For descriptions of the<br/>subfields or more information see the documentation:<br/>https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance#nested_network_interface<br/><br/>**\_NOTE:\_** If `network_interfaces` are set, `network_self_link` and<br/>`subnetwork_self_link` will be ignored, even if they are provided through<br/>the `use` field. `bandwidth_tier` and `enable_public_ips` also do not apply<br/>to network interfaces defined in this variable.<br/><br/>Subfields:<br/>network (string, required if subnetwork is not supplied)<br/>subnetwork (string, required if network is not supplied)<br/>subnetwork\_project (string, optional)<br/>network\_ip (string, optional)<br/>nic\_type (string, optional, choose from ["GVNIC", "VIRTIO\_NET", "RDMA", "IRDMA", "MRDMA"])<br/>stack\_type (string, optional, choose from ["IPV4\_ONLY", "IPV4\_IPV6"])<br/>queue\_count (number, optional)<br/>access\_config (object, optional)<br/>ipv6\_access\_config (object, optional)<br/>alias\_ip\_range (list(object), optional) | <pre>list(object({<br/> network = string,<br/> subnetwork = string,<br/> subnetwork_project = string,<br/> network_ip = string,<br/> nic_type = string,<br/> stack_type = string,<br/> queue_count = number,<br/> access_config = list(object({<br/> nat_ip = string,<br/> public_ptr_domain_name = string,<br/> network_tier = string<br/> })),<br/> ipv6_access_config = list(object({<br/> public_ptr_domain_name = string,<br/> network_tier = string<br/> })),<br/> alias_ip_range = list(object({<br/> ip_cidr_range = string,<br/> subnetwork_range_name = string<br/> }))<br/> }))</pre> | `[]` | no |
| <a name="input_network_self_link"></a> [network\_self\_link](#input\_network\_self\_link) | The self link of the network to attach the VM. | `string` | `"default"` | no |
| <a name="input_network_storage"></a> [network\_storage](#input\_network\_storage) | An array of network attached storage mounts to be configured. | <pre>list(object({<br/> server_ip = string,<br/> remote_mount = string,<br/> local_mount = string,<br/> fs_type = string,<br/> mount_options = string,<br/> client_install_runner = map(string)<br/> mount_runner = map(string)<br/> }))</pre> | `[]` | no |
| <a name="input_on_host_maintenance"></a> [on\_host\_maintenance](#input\_on\_host\_maintenance) | Describes maintenance behavior for the instance. If left blank this will default to `MIGRATE` except for when `placement_policy`, spot provisioning, or GPUs require it to be `TERMINATE` | `string` | `null` | no |
Expand Down
2 changes: 1 addition & 1 deletion community/modules/compute/pbspro-execution/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ variable "network_interfaces" {
subnetwork (string, required if network is not supplied)
subnetwork_project (string, optional)
network_ip (string, optional)
nic_type (string, optional, choose from ["GVNIC", "VIRTIO_NET"])
nic_type (string, optional, choose from ["GVNIC", "VIRTIO_NET", "RDMA", "IRDMA", "MRDMA"])
stack_type (string, optional, choose from ["IPV4_ONLY", "IPV4_IPV6"])
queue_count (number, optional)
access_config (object, optional)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ locals {
"a2-ultragpu-8g" = { type = "nvidia-a100-80gb", count = 8 },
"a3-highgpu-8g" = { type = "nvidia-h100-80gb", count = 8 },
"a3-megagpu-8g" = { type = "nvidia-h100-mega-80gb", count = 8 },
"a3-ultragpu-8g" = { type = "nvidia-h200-141gb", count = 8 },
"g2-standard-4" = { type = "nvidia-l4", count = 1 },
"g2-standard-8" = { type = "nvidia-l4", count = 1 },
"g2-standard-12" = { type = "nvidia-l4", count = 1 },
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ modules. For support with the underlying modules, see the instructions in the

| Name | Source | Version |
|------|--------|---------|
| <a name="module_slurm_nodeset_template"></a> [slurm\_nodeset\_template](#module\_slurm\_nodeset\_template) | github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template | 6.8.5 |
| <a name="module_slurm_nodeset_template"></a> [slurm\_nodeset\_template](#module\_slurm\_nodeset\_template) | github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template | b0575ab |

## Resources

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ locals {
"a2-ultragpu-8g" = { type = "nvidia-a100-80gb", count = 8 },
"a3-highgpu-8g" = { type = "nvidia-h100-80gb", count = 8 },
"a3-megagpu-8g" = { type = "nvidia-h100-mega-80gb", count = 8 },
"a3-ultragpu-8g" = { type = "nvidia-h200-141gb", count = 8 },
"g2-standard-4" = { type = "nvidia-l4", count = 1 },
"g2-standard-8" = { type = "nvidia-l4", count = 1 },
"g2-standard-12" = { type = "nvidia-l4", count = 1 },
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ locals {
}

module "slurm_nodeset_template" {
source = "github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template?ref=6.8.5"
source = "github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template?ref=b0575ab"

project_id = var.project_id
region = var.region
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ locals {
"a2-ultragpu-8g" = { type = "nvidia-a100-80gb", count = 8 },
"a3-highgpu-8g" = { type = "nvidia-h100-80gb", count = 8 },
"a3-megagpu-8g" = { type = "nvidia-h100-mega-80gb", count = 8 },
"a3-ultragpu-8g" = { type = "nvidia-h200-141gb", count = 8 },
"g2-standard-4" = { type = "nvidia-l4", count = 1 },
"g2-standard-8" = { type = "nvidia-l4", count = 1 },
"g2-standard-12" = { type = "nvidia-l4", count = 1 },
Expand Down
82 changes: 82 additions & 0 deletions community/modules/network/rdma-vpc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
## Description

This is an experimental VPC module.

Documentation will be updated at a later point.

## License

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

## Requirements

| Name | Version |
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 0.15.0 |

## Providers

No providers.

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_vpc"></a> [vpc](#module\_vpc) | ./vpc-submodule | n/a |

## Resources

No resources.

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_allowed_ssh_ip_ranges"></a> [allowed\_ssh\_ip\_ranges](#input\_allowed\_ssh\_ip\_ranges) | A list of CIDR IP ranges from which to allow ssh access | `list(string)` | `[]` | no |
| <a name="input_delete_default_internet_gateway_routes"></a> [delete\_default\_internet\_gateway\_routes](#input\_delete\_default\_internet\_gateway\_routes) | If set, ensure that all routes within the network specified whose names begin with 'default-route' and with a next hop of 'default-internet-gateway' are deleted | `bool` | `false` | no |
| <a name="input_deployment_name"></a> [deployment\_name](#input\_deployment\_name) | The name of the current deployment | `string` | n/a | yes |
| <a name="input_enable_iap_rdp_ingress"></a> [enable\_iap\_rdp\_ingress](#input\_enable\_iap\_rdp\_ingress) | Enable a firewall rule to allow Windows Remote Desktop Protocol access using IAP tunnels | `bool` | `false` | no |
| <a name="input_enable_iap_ssh_ingress"></a> [enable\_iap\_ssh\_ingress](#input\_enable\_iap\_ssh\_ingress) | Enable a firewall rule to allow SSH access using IAP tunnels | `bool` | `true` | no |
| <a name="input_enable_iap_winrm_ingress"></a> [enable\_iap\_winrm\_ingress](#input\_enable\_iap\_winrm\_ingress) | Enable a firewall rule to allow Windows Remote Management (WinRM) access using IAP tunnels | `bool` | `false` | no |
| <a name="input_enable_internal_traffic"></a> [enable\_internal\_traffic](#input\_enable\_internal\_traffic) | Enable a firewall rule to allow all internal TCP, UDP, and ICMP traffic within the network | `bool` | `true` | no |
| <a name="input_extra_iap_ports"></a> [extra\_iap\_ports](#input\_extra\_iap\_ports) | A list of TCP ports for which to create firewall rules that enable IAP for TCP forwarding (use dedicated enable\_iap variables for standard ports) | `list(string)` | `[]` | no |
| <a name="input_firewall_log_config"></a> [firewall\_log\_config](#input\_firewall\_log\_config) | Firewall log configuration for Toolkit firewall rules (var.enable\_iap\_ssh\_ingress and others) | `string` | `"DISABLE_LOGGING"` | no |
| <a name="input_firewall_rules"></a> [firewall\_rules](#input\_firewall\_rules) | List of firewall rules | `any` | `[]` | no |
| <a name="input_mtu"></a> [mtu](#input\_mtu) | The network MTU (default: 8896). Recommended values: 0 (use Compute Engine default), 1460 (default outside HPC environments), 1500 (Internet default), or 8896 (for Jumbo packets). Allowed are all values in the range 1300 to 8896, inclusively. | `number` | `8896` | no |
| <a name="input_network_address_range"></a> [network\_address\_range](#input\_network\_address\_range) | IP address range (CIDR) for global network | `string` | `"10.0.0.0/9"` | no |
| <a name="input_network_description"></a> [network\_description](#input\_network\_description) | An optional description of this resource (changes will trigger resource destroy/create) | `string` | `""` | no |
| <a name="input_network_name"></a> [network\_name](#input\_network\_name) | The name of the network to be created (if unsupplied, will default to "{deployment\_name}-net") | `string` | `null` | no |
| <a name="input_network_profile"></a> [network\_profile](#input\_network\_profile) | Profile name for VPC configuration | `string` | `null` | no |
| <a name="input_network_routing_mode"></a> [network\_routing\_mode](#input\_network\_routing\_mode) | The network routing mode (default "GLOBAL") | `string` | `"GLOBAL"` | no |
| <a name="input_nic_type"></a> [nic\_type](#input\_nic\_type) | NIC type for use in modules that use the output | `string` | `"MRDMA"` | no |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | Project in which the HPC deployment will be created | `string` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | The default region for Cloud resources | `string` | n/a | yes |
| <a name="input_secondary_ranges"></a> [secondary\_ranges](#input\_secondary\_ranges) | Secondary ranges that will be used in some of the subnets. Please see https://goo.gle/hpc-toolkit-vpc-deprecation for migration instructions. | `map(list(object({ range_name = string, ip_cidr_range = string })))` | `{}` | no |
| <a name="input_shared_vpc_host"></a> [shared\_vpc\_host](#input\_shared\_vpc\_host) | Makes this project a Shared VPC host if 'true' (default 'false') | `bool` | `false` | no |
| <a name="input_subnetworks_template"></a> [subnetworks\_template](#input\_subnetworks\_template) | Rules for creating subnetworks within the VPC | <pre>object({<br/> count = number<br/> name_prefix = string<br/> ip_range = string<br/> region = string<br/> private_access = optional(bool)<br/> })</pre> | <pre>{<br/> "count": 8,<br/> "ip_range": "192.168.0.0/16",<br/> "name_prefix": "subnet",<br/> "region": null<br/>}</pre> | no |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_network_id"></a> [network\_id](#output\_network\_id) | ID of the new VPC network |
| <a name="output_network_name"></a> [network\_name](#output\_network\_name) | Name of the new VPC network |
| <a name="output_network_self_link"></a> [network\_self\_link](#output\_network\_self\_link) | Self link of the new VPC network |
| <a name="output_subnetwork_interfaces"></a> [subnetwork\_interfaces](#output\_subnetwork\_interfaces) | Full list of subnetwork objects belonging to the new VPC network (compatible with vm-instance) |
| <a name="output_subnetwork_interfaces_gke"></a> [subnetwork\_interfaces\_gke](#output\_subnetwork\_interfaces\_gke) | Full list of subnetwork objects belonging to the new VPC network (compatible with gke-node-pool) |
| <a name="output_subnetwork_name_prefix"></a> [subnetwork\_name\_prefix](#output\_subnetwork\_name\_prefix) | Prefix of the RDMA subnetwork names |
| <a name="output_subnetworks"></a> [subnetworks](#output\_subnetworks) | Full list of subnetwork objects belonging to the new VPC network |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
Loading