Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama training with FP8 #331

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Llama training with FP8 #331

wants to merge 1 commit into from

Conversation

pbelevich
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks a great feature addition to 10.FSDP. Do you think TE support can be added in the test case instead of creating new one?

@@ -0,0 +1,2 @@
checkpoints
slurm-*.out
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,183 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbelevich
Copy link
Collaborator Author

@KeitaW thanks for the review! I was thinking about adding FP8 support to FSDP example, but there are two aspects why I decided to create a separate example for this:

  1. Transformer Engine requires Nvidia's container to run(or as alternative relatively complicated process of building from source with CUDA headers, CUDNN etc). And I don't want to complicate FSDP example with it.
  2. This example is bound to Llama model(taken from TE examples), but FSDP example supports multiple models that I don't want to rewrite with FP8.

So, in terms of importance this example is about LLama with FP8. FSDP training here is just kind of scaffolding.

@pbelevich pbelevich changed the title Llama FSDP training with FP8 Llama training with FP8 May 15, 2024
@sbhavani
Copy link

@pbelevich FYI the AWS DLC for PyTorch also includes TE

# The three must-be-built packages.
# Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error.
ENV EFA_INSTALLER_VERSION=1.30.0
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
ARG AWS_OFI_NCCL_VERSION=1.9.2-aws


# The three must-be-built packages.
# Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error.
ENV EFA_INSTALLER_VERSION=1.30.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ENV EFA_INSTALLER_VERSION=1.30.0
ARG EFA_INSTALLER_VERSION=1.33.0

# Efa-installer>=1.29.1 required for nccl>=2.19.0 to avoid libfabric NCCL error.
ENV EFA_INSTALLER_VERSION=1.30.0
ENV AWS_OFI_NCCL_VERSION=1.8.1-aws
ENV NCCL_TESTS_VERSION=master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ENV NCCL_TESTS_VERSION=master
ARG NCCL_TESTS_VERSION=2.13.9

# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the
# aws-ofi-ccnl.
####################################################################################################
ENV NCCL_VERSION=2.19.3-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LEt's move that at the top.

####################################################################################################
ENV OPEN_MPI_PATH=/opt/amazon/openmpi

# OpenMPI build script claims PMIX_VERSION, and complains if we use it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed?

## For G4dn and other G5, comment out all
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants