-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama training with FP8 #331
base: main
Are you sure you want to change the base?
Conversation
87c6612
to
cb07958
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks a great feature addition to 10.FSDP. Do you think TE support can be added in the test case instead of creating new one?
@@ -0,0 +1,2 @@ | |||
checkpoints | |||
slurm-*.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slurm-*.out
should already be excluded by https://github.com/aws-samples/awsome-distributed-training/blob/main/.gitignore
@@ -0,0 +1,183 @@ | |||
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
# | |||
# See LICENSE for license information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to refer original code https://github.com/NVIDIA/TransformerEngine/blob/16a469df6bbc77e1c32e48e8e5fd3082dbc2d18e/docs/examples/te_llama/te_llama.py
@KeitaW thanks for the review! I was thinking about adding FP8 support to FSDP example, but there are two aspects why I decided to create a separate example for this:
So, in terms of importance this example is about LLama with FP8. FSDP training here is just kind of scaffolding. |
@pbelevich FYI the AWS DLC for PyTorch also includes TE |
cb07958
to
82b3e89
Compare
82b3e89
to
663b344
Compare
44e448e
to
1209815
Compare
# The three must-be-built packages. | ||
# Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error. | ||
ENV EFA_INSTALLER_VERSION=1.30.0 | ||
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws | |
ARG AWS_OFI_NCCL_VERSION=1.9.2-aws |
|
||
# The three must-be-built packages. | ||
# Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error. | ||
ENV EFA_INSTALLER_VERSION=1.30.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENV EFA_INSTALLER_VERSION=1.30.0 | |
ARG EFA_INSTALLER_VERSION=1.33.0 |
# Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error. | ||
ENV EFA_INSTALLER_VERSION=1.30.0 | ||
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws | ||
ENV NCCL_TESTS_VERSION=master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENV NCCL_TESTS_VERSION=master | |
ARG NCCL_TESTS_VERSION=2.13.9 |
# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the | ||
# aws-ofi-ccnl. | ||
#################################################################################################### | ||
ENV NCCL_VERSION=2.19.3-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LEt's move that at the top.
#################################################################################################### | ||
ENV OPEN_MPI_PATH=/opt/amazon/openmpi | ||
|
||
# OpenMPI build script claims PMIX_VERSION, and complains if we use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed?
## For G4dn and other G5, comment out all | ||
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d | ||
export FI_EFA_FORK_SAFE=1 | ||
export FI_LOG_LEVEL=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
No description provided.