New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Llama training with FP8 #331

Draft

pbelevich wants to merge 1 commit into main from transformer-engine

+1,587 −0

Collaborator

pbelevich commented May 15, 2024

No description provided.


          Llama FSDP training with FP8

663b344

pbelevich force-pushed the transformer-engine branch from 87c6612 to cb07958 Compare

May 15, 2024 03:11

pbelevich requested a review from KeitaW

May 15, 2024 03:12

KeitaW reviewed

View reviewed changes

Collaborator

KeitaW left a comment •

edited

Loading

Overall, this looks a great feature addition to 10.FSDP. Do you think TE support can be added in the test case instead of creating new one?

3.test_cases/XX.transformer-engine/.gitignore

		@@ -0,0 +1,2 @@
		checkpoints
		slurm-*.out

Collaborator

KeitaW May 15, 2024

slurm-*.out should already be excluded by https://github.com/aws-samples/awsome-distributed-training/blob/main/.gitignore

3.test_cases/XX.transformer-engine/te_llama.py

@@ @@ -0,0 +1,183 @@ @@
+              # Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+              #
+              # See LICENSE for license information.

Collaborator

KeitaW May 15, 2024

You may want to refer original code https://github.com/NVIDIA/TransformerEngine/blob/16a469df6bbc77e1c32e48e8e5fd3082dbc2d18e/docs/examples/te_llama/te_llama.py

Collaborator Author

pbelevich commented May 15, 2024

@KeitaW thanks for the review! I was thinking about adding FP8 support to FSDP example, but there are two aspects why I decided to create a separate example for this:

Transformer Engine requires Nvidia's container to run(or as alternative relatively complicated process of building from source with CUDA headers, CUDNN etc). And I don't want to complicate FSDP example with it.
This example is bound to Llama model(taken from TE examples), but FSDP example supports multiple models that I don't want to rewrite with FP8.

So, in terms of importance this example is about LLama with FP8. FSDP training here is just kind of scaffolding.

pbelevich changed the title ~~Llama FSDP training with FP8~~ Llama training with FP8

sbhavani commented May 30, 2024

@pbelevich FYI the AWS DLC for PyTorch also includes TE

KeitaW force-pushed the transformer-engine branch from cb07958 to 82b3e89 Compare

June 3, 2024 22:53

KeitaW force-pushed the main branch from 8dc7dc0 to 44e448e Compare

June 3, 2024 22:53

KeitaW force-pushed the transformer-engine branch from 82b3e89 to 663b344 Compare

June 4, 2024 02:26

KeitaW force-pushed the main branch 3 times, most recently from 44e448e to 1209815 Compare

June 4, 2024 02:30

mhuguesaws reviewed

View reviewed changes

3.test_cases/XX.transformer-engine/0.transformer-engine.dockerfile

+              # The three must-be-built packages.
+              # Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error.
+              ENV EFA_INSTALLER_VERSION=1.30.0
+              ENV AWS_OFI_NCCL_VERSION=1.8.1-aws

Contributor

mhuguesaws Jul 22, 2024

Suggested change

      
            ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
          
            ARG AWS_OFI_NCCL_VERSION=1.9.2-aws

mhuguesaws reviewed

View reviewed changes

3.test_cases/XX.transformer-engine/0.transformer-engine.dockerfile

+              # The three must-be-built packages.
+              # Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error.
+              ENV EFA_INSTALLER_VERSION=1.30.0

Contributor

mhuguesaws Jul 22, 2024

Suggested change

      
            ENV EFA_INSTALLER_VERSION=1.30.0
          
            ARG EFA_INSTALLER_VERSION=1.33.0

mhuguesaws reviewed

View reviewed changes

3.test_cases/XX.transformer-engine/0.transformer-engine.dockerfile

+              # Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error.
+              ENV EFA_INSTALLER_VERSION=1.30.0
+              ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
+              ENV NCCL_TESTS_VERSION=master

Contributor

mhuguesaws Jul 22, 2024

Suggested change

      
            ENV NCCL_TESTS_VERSION=master
          
            ARG NCCL_TESTS_VERSION=2.13.9

mhuguesaws reviewed

View reviewed changes

3.test_cases/XX.transformer-engine/0.transformer-engine.dockerfile

+              # NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the
+              # aws-ofi-ccnl.
+              ####################################################################################################
+              ENV NCCL_VERSION=2.19.3-1

Contributor

mhuguesaws Jul 22, 2024

LEt's move that at the top.

mhuguesaws reviewed

View reviewed changes

3.test_cases/XX.transformer-engine/0.transformer-engine.dockerfile

+              ####################################################################################################
+              ENV OPEN_MPI_PATH=/opt/amazon/openmpi
+              # OpenMPI build script claims PMIX_VERSION, and complains if we use it.

Contributor

mhuguesaws Jul 22, 2024

Is this really needed?

mhuguesaws reviewed

View reviewed changes

3.test_cases/XX.transformer-engine/1.train_llama.sbatch

+              ## For G4dn and other G5, comment out all
+              export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
+              export FI_EFA_FORK_SAFE=1
+              export FI_LOG_LEVEL=1

Contributor

mhuguesaws Jul 22, 2024

Remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet