Skip to content

Commit

Permalink
Generalize RES configuration of login nodes and user/group json
Browse files Browse the repository at this point in the history
Add new config variable:
ExternalLoginNodes

Resolves #250

=====
Fix SlurmFsxLustre ingress rule.

CDK creates egress rule without matching ingress rule.

Resolves #252

=====

Fix FSxZ egress rules

Compensate for a bug in ParallelCluster that requires egress rules.

Leave the bug open so that can remove rules when ParallelCluster bug is fixed.

Addresses #253

=====

Document FSx configuration

=====

Add IAM policy required to mount FSx

Add AttachRolePolicy, DetachRolePolicy for HeadNodePolicy.

Resolves #254

====

Fix SNS notification bug when CreateParallelCluster Lambda is missing parameter
  • Loading branch information
cartalla committed Sep 11, 2024
1 parent 2d84608 commit 482e804
Show file tree
Hide file tree
Showing 14 changed files with 660 additions and 421 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,9 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
for dst_sg_name, dst_sg in lustre_security_groups.items():
src_sg.connections.allow_to(dst_sg, ec2.Port.tcp(988), f"{src_sg_name} to {dst_sg_name} lustre")
src_sg.connections.allow_to(dst_sg, ec2.Port.tcp_range(1018, 1023), f"{src_sg_name} to {dst_sg_name} lustre")
# It shouldn't be necessary to do allow_to and allow_from, but CDK left off the ingress rule form lustre to lustre if I didn't add the allow_from.
dst_sg.connections.allow_from(src_sg, ec2.Port.tcp(988), f"{src_sg_name} to {dst_sg_name} lustre")
dst_sg.connections.allow_from(src_sg, ec2.Port.tcp_range(1018, 1023), f"{src_sg_name} to {dst_sg_name} lustre")

# Rules for FSx Ontap
for fsx_client_sg_name, fsx_client_sg in fsx_client_security_groups.items():
Expand All @@ -138,12 +141,21 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
fsx_client_sg.connections.allow_to(fsx_ontap_sg, ec2.Port.udp(4046), f"{fsx_client_sg_name} to {fsx_ontap_sg_name} Network status monitor for NFS")

for fsx_zfs_sg_name, fsx_zfs_sg in zfs_security_groups.items():
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.tcp(111), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.udp(111), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.tcp(2049), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.udp(2049), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.tcp_range(20001, 20003), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS mount, status monitor, and lock daemon")
fsx_client_sg.connections.allow_to(slurm_fsx_zfs_sg, ec2.Port.udp_range(20001, 20003), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS mount, status monitor, and lock daemon")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.tcp(111), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.udp(111), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.tcp(2049), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.udp(2049), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.tcp_range(20001, 20003), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS mount, status monitor, and lock daemon")
fsx_client_sg.connections.allow_to(fsx_zfs_sg, ec2.Port.udp_range(20001, 20003), f"{fsx_client_sg_name} to {fsx_zfs_sg_name} NFS mount, status monitor, and lock daemon")
# There is a bug in PC 3.10.1 that requires outbound traffic to be enabled even though ZFS doesn't.
# Remove when bug in PC is fixed.
# Tracked by https://github.com/aws-samples/aws-eda-slurm-cluster/issues/253
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.tcp(111), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.udp(111), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} rpc for NFS")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.tcp(2049), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.udp(2049), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} NFS server daemon")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.tcp_range(20001, 20003), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} NFS mount, status monitor, and lock daemon")
fsx_client_sg.connections.allow_from(fsx_zfs_sg, ec2.Port.udp_range(20001, 20003), f"{fsx_zfs_sg_name} to {fsx_client_sg_name} NFS mount, status monitor, and lock daemon")

for sg_name, sg in security_groups.items():
CfnOutput(self, f"{sg_name}Id",
Expand Down
67 changes: 67 additions & 0 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,16 @@ This project creates a ParallelCluster configuration file that is documented in
<a href="#timezone">TimeZone</a>: str
<a href="#additionalsecuritygroupsstackname">AdditionalSecurityGroupsStackName</a>: str
<a href="#resstackname">RESStackName</a>: str
<a href="#externalloginnodes">ExternalLoginNodes</a>:
- <a href="#tags">Tags</a>:
- Key: str
Values: [ str ]
SecurityGroupId: str
<a href="#domainjoinedinstance">DomainJoinedInstance</a>:
- <a href="#tags">Tags</a>:
- Key: str
Values: [ str ]
SecurityGroupId: str
<a href="#slurm">slurm</a>:
<a href="#parallelclusterconfig">ParallelClusterConfig</a>:
<a href="#version">Version</a>: str
Expand Down Expand Up @@ -212,6 +222,63 @@ This requires you to [configure security groups for external login nodes](../dep
The Slurm binaries will be compiled for the OS of the desktops and and environment modulefile will be created
so that the users just need to load the cluster modulefile to use the cluster.

### ExternalLoginNodes

An array of specifications for instances that should automatically be configured as Slurm login nodes.
Each array element contains one or more tags that will be used to select login node instances.
It also includes the security group id that must be attached to the login node to give it access to the slurm cluster.
The tags for a group of instances is an array with the tag name and an array of values.

A lambda function processes each login node specification.
It uses the tags to select running instances.
If the instances do not have the security group attached, then it will attach the security group.
It will then run a script each instance to configure it as a login node for the slurm cluster.
To use the cluster, users simply load the environment modulefile that is created by the script.

For example, to configure RES virtual desktops as Slurm login nodes the following configuration is added.

```
---
ExternalLoginNodes:
- Tags:
- Key: 'res:EnvironmentName'
Values: [ 'res-eda' ]
- Key: 'res:NodeType'
Values: ['virtual-desktop-dcv-host']
SecurityGroupId: <SlurmLoginNodeSGId>
```

### DomainJoinedInstance

A specifications for a domain joined instance that will be used to create and update users_groups.json.
It also includes the security group id that must be attached to the login node to give it access to the slurm head node so it can mount the slurm configuration file system.
The tags for the instance is an array with the tag name and an array of values.

A lambda function the specification.
It uses the tags to select a running instance.
If the instance does not have the security group attached, then it will attach the security group.
It will then run a script each instance to configure it to save all of the users and groups into a json file that
is used to create local users and groups on compute nodes when they boot.

For example, to configure the RES cluster manager, the following configuration is added.

```
---
DomainJoinedInstance:
- Tags:
- Key: 'Name'
Values: [ 'res-eda-cluster-manager' ]
- Key: 'res:EnvironmentName'
Values: [ 'res-eda' ]
- Key: 'res:ModuleName'
Values: [ 'cluster-manager' ]
- Key: 'res:ModuleId'
Values: [ 'cluster-manager' ]
- Key: 'app'
Values: ['virtual-desktop-dcv-host']
SecurityGroupId: <SlurmLoginNodeSGId>
```

## slurm

Slurm configuration parameters.
Expand Down
55 changes: 55 additions & 0 deletions docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -441,3 +441,58 @@ slurm:
ansys:
Count: 1
```

### Configure File Systems

The Storage/ExtraMounts parameter allows you to configure additional file systems to mount on compute nodes.
Note that the security groups for the file systems must allow connections from the compute nodes.

#### Lustre

The following example shows how to add an FSx for Lustre file system.
The mount information can be found from the FSx console.

```
storage:
ExtraMounts
- dest: /lustre
src: <FileSystemId>.fsx.<Region>.amazonaws.com@tcp:/<MountName>
StorageType: FsxLustre
FileSystemId: <FileSystemId>
type: lustre
options: relatime,flock
```

#### ONTAP

The following example shows how to add an FSx for NetApp ONTAP file system.
The mount information can be found from the FSx console.

```
storage:
ExtraMounts
- dest: /ontap
src: <SvmId>.<FileSystemId>.fsx.<Region>.amazonaws.com:/vol1
StorageType: FsxOntap
FileSystemId: <FileSystemId>
VolumeId: <VolumeId>
type: nfs
options: default
```

#### ZFS

The following example shows how to add an FSx for OpenZFS file system.
The mount information can be found from the FSx console.

```
storage:
ExtraMounts
- dest: /zfs
src: <FileSystemId>.fsx.<Region>.amazonaws.com:/fsx
StorageType: FsxOpenZfs
FileSystemId: <FileSystemId>
VolumeId: <VolumeId>
type: nfs
options: noatime,nfsvers=3,sync,nconnect=16,rsize=1048576,wsize=1048576
```
Loading

0 comments on commit 482e804

Please sign in to comment.