aws-samples · KeitaW · Mar 16, 2024 · Mar 17, 2024 · Mar 17, 2024 · Mar 17, 2024
diff --git a/3.test_cases/torchtitan/README.md b/3.test_cases/torchtitan/README.md
@@ -0,0 +1,8 @@
+
+**Torchtitan** is a pioneering library for large-scale LLM training utilizing native PyTorch. It highlights PyTorch's latest distributed training features through a clean, minimalistic codebase.
+
+Characteristics of Torchtitan include:
+
+* User-friendly design, making it easy to understand, use, and extend for various training purposes.
+* Minimal modifications required to the model code for applying 1D, 2D, or upcoming 3D parallelism.
+* A modular approach over a monolithic codebase, facilitating quick start-ups.
diff --git a/3.test_cases/torchtitan/pretrain.sbatch b/3.test_cases/torchtitan/pretrain.sbatch
@@ -0,0 +1,85 @@
+#!/bin/bash
+
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+
+#SBATCH --job-name=pretrain
+#SBATCH --nodes=2
+#SBATCH --ntasks=2
+#SBATCH --gpus-per-node=8 # Number of GPU per node
+#SBATCH --output=logs/%x_%j.out # logfile for stdout
+#SBATCH --error=logs/%x_%j.err # logfile for stderr, remove it to merge both outputs
+#SBATCH --wait-all-nodes=1
+#SBATCH --exclusive
+set -euxo pipefail
+
+##################################################################
+############# Load environment variables #########################
+##################################################################
+# Load environment variables
+if [ ! -f .env ]
+then
+    echo "Please create a .env file with the required environment variables"
+    exit 1
+else
+    source .env
+fi
+
+##################################################################
+######### Define EFA/NCCL/Slurm environment variables ############
+##################################################################
+## EFA settings
+export FI_LOG_LEVEL=1
+export FI_PROVIDER=efa # change to eth if you want to use ENA for comparisons
+export FI_EFA_USE_HUGE_PAGE=0
+# https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352
+# https://github.com/pytorch/pytorch/issues/68893
+export NCCL_SOCKET_IFNAME=en
+export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+export NCCL_DEBUG=INFO
+export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
+export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
+export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
+export NODES=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
+export NODES_ARRAY=($NODES)
+export HEAD_NODE=${NODES_ARRAY[0]}
+export MASTER_ADDR=$(hostname --ip-address)
+export MASTER_PORT=$RANDOM
+export NNODES=$SLURM_JOB_NUM_NODES
+export NPROC=$SLURM_GPUS_PER_NODE
+export WORLD_SIZE=$(( $NNODES * $NPROC ))
+
+##################################################################
+############### Create train config ##############################
+##################################################################
+
+if [ ! -d ${FSX_PATH}/tmp ]; then
+    mkdir -p ${FSX_PATH}/tmp
+fi
+cat ${PWD}/train_configs/pretrain_llama3_70b.toml | envsubst > ${FSX_PATH}/tmp/pretrain_llama3_70b.toml
+
+##################################################################
+################# Set arguments ##################################
+##################################################################
+
+: "${CONTAINER_MOUNT:=$FSX_PATH:$FSX_PATH}"
+declare -a SRUN_ARGS=(
+    --container-image $ENROOT_IMAGE
+    --container-mounts $CONTAINER_MOUNT
+)
+declare -a TORCHRUN_ARGS=(
+    # change this to match the number of gpus per node:
+    --master_addr $MASTER_ADDR 
+    --master_port $RANDOM 
+    --nproc_per_node=8 
+    --nnodes $NNODES 
+    --nnodes=$SLURM_JOB_NUM_NODES 
+    --rdzv_backend=c10d 
+    --rdzv_endpoint=$(hostname)
+)
+declare -a TRAIN_ARGS=(
+    --job.config_file ${FSX_PATH}/tmp/pretrain_llama3_70b.toml
+)
+
+srun -l "${SRUN_ARGS[@]}" \
+    torchrun "${TORCHRUN_ARGS[@]}" ${PWD}/../torchtitan/train.py "${TRAIN_ARGS[@]}"
diff --git a/3.test_cases/torchtitan/slurm/torchtitan.dockerfile b/3.test_cases/torchtitan/slurm/torchtitan.dockerfile
@@ -0,0 +1,234 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+
+####################################################################################################
+# This is a sample Dockerfile, with optional stanzas. Please read through this Dockerfile,
+# understand what it does, then create your own Dockerfile.
+#
+# Sample build instructions:
+#
+#     docker build --progress=plain -t nvidia-pt-od:latest -f 0.nvcr-pytorch-aws.dockerfile .
+#     rm /fsx/nvidia-pt-od__latest.sqsh ; enroot import -o /fsx/nvidia-pt-od__latest.sqsh dockerd://nvidia-pt-od:latest
+#
+# Compute nodes (aka build nodes) are transient, so we need to keep the docker image on shared fs,
+# which head node can load into its local registry.
+#
+#     # Build node: save image to file
+#     docker save nvidia-pt-od:latest > /fsx/nvidia-pt-od__latest.tar
+#
+#     # Load image to local docker registry -> on head node, or new compute/build node.
+#     docker load < /fsx/nvidia-pt-od__latest.tar
+####################################################################################################
+FROM nvcr.io/nvidia/pytorch:24.04-py3
+ENV DEBIAN_FRONTEND=noninteractive
+
+# The three must-be-built packages.
+# Efa-installer>=1.29.0 required for nccl>=2.19.0 to avoid libfabric NCCL error.
+ARG EFA_INSTALLER_VERSION=1.31.0
+ARG AWS_OFI_NCCL_VERSION=v1.8.1-aws
+ARG NCCL_TESTS_VERSION=2.13.9
+ARG NCCL_VERSION=2.20.3-1
+
+RUN apt-get update -y
+RUN apt-get remove -y --allow-change-held-packages \
+    libmlx5-1 ibverbs-utils libibverbs-dev libibverbs1
+
+# We noticed that since 23.09, we can't just delete the whole /opt/hpcx/, otherwise `import torch`
+# complains about missing libuc?.so.
+RUN rm -rf /opt/hpcx/ompi \
+    && rm -rf /usr/local/mpi \
+    && rm -rf /opt/hpcx/nccl_rdma_sharp_plugin \
+    && ldconfig
+ENV OPAL_PREFIX=
+RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
+    git \
+    gcc \
+    vim \
+    kmod \
+    openssh-client \
+    openssh-server \
+    build-essential \
+    curl \
+    autoconf \
+    libtool \
+    gdb \
+    automake \
+    cmake \
+    apt-utils \
+    libhwloc-dev \
+    aptitude && \
+    DEBIAN_FRONTEND=noninteractive apt autoremove -y
+
+# EFA
+RUN apt-get update && \
+    cd /tmp && \
+    curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz  && \
+    tar -xf aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
+    cd aws-efa-installer && \
+    # ONLY add `--skip-kmod`, `--no-verify` and `--skip-limit-conf` flags to container image.
+    # Those three flags must NOT be used on the host.
+    #
+    # Explanations:
+    # - to build EFA in the Dockerfile, we added --skip-kmod and --no-verify. Without these flags,
+    #   the Dockerfile will fail to build. If installing EFA on the host and not in a container,
+    #   please remove these flags.
+    # - The --skip-limit-conf can be retained in Dockerfile, but it's redundant as the host already
+    #   has these limits set by efa_installer.
+    ./efa_installer.sh -y -g -d --skip-kmod --no-verify --skip-limit-conf && \
+    ldconfig && \
+    rm -rf /tmp/aws-efa-installer /var/lib/apt/lists/*
+ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
+ENV PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH
+
+
+####################################################################################################
+# [CUSTOM_NCCL_OPTION_1] Uncomment below stanza to install another NCCL version using the official
+# binaries.
+#
+# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the
+# aws-ofi-ccnl.
+####################################################################################################
+# RUN cd /opt && \
+#     wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb && \
+#     dpkg -i cuda-keyring_1.0-1_all.deb && \
+#     apt update && \
+#     apt install -y libnccl2==${NCCL_VERSION} libnccl-dev==${NCCL_VERSION} && \
+#     echo NCCL_SOCKET_IFNAME=^docker0,lo >> /etc/nccl.conf
+
+
+####################################################################################################
+# [CUSTOM_NCCL_OPTION_2] Install NCCL from source to the same location as the built-in ones. The
+# benefits of installing to the same location as the built-in version are:
+#
+# 1. There's only ever a single libnccl version offered by this image, preventing application from
+#    mistakenly chooses a wrong version.
+# 2. No longer needing extra settings for LD_LIBRARY_PATH or LD_PRELOAD.
+#
+# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the
+# aws-ofi-ccnl.
+####################################################################################################
+RUN cd /tmp \
+    && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION} \
+    && cd nccl \
+    && make -j src.build BUILDDIR=/usr \
+    # Build for p4 & p5.
+    NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90, -gencode=arch=compute_80,code=sm_80" \
+    && rm -rf /tmp/nccl \
+    && echo NCCL_SOCKET_IFNAME=^docker0,lo >> /etc/nccl.conf
+
+
+####################################################################################################
+# Rebuild OpenMPI with custom PMIX version. E.g., to match what host's Slurm is built with (see
+# /opt/pmix/ on host, or run pmix_info on host).
+#
+# May be needed on rare occassions when `srun --mpi=pmix --container-image=... <mpi_application>`
+# mysteriously crashes.
+#
+# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the
+# aws-ofi-ccnl.
+####################################################################################################
+ENV OPEN_MPI_PATH=/opt/amazon/openmpi
+
+# OpenMPI build script claims PMIX_VERSION, and complains if we use it.
+ENV CUSTOM_PMIX_VERSION=4.2.6
+RUN apt-get update && apt-get install -y libevent-dev \
+    && cd /tmp \
+    && wget https://github.com/openpmix/openpmix/releases/download/v${CUSTOM_PMIX_VERSION}/pmix-${CUSTOM_PMIX_VERSION}.tar.gz \
+    && tar -xzf pmix-${CUSTOM_PMIX_VERSION}.tar.gz \
+    && rm pmix-${CUSTOM_PMIX_VERSION}.tar.gz \
+    && cd pmix-${CUSTOM_PMIX_VERSION}/ \
+    && ./autogen.pl \
+    && ./configure --prefix=/opt/pmix \
+    && make -j \
+    && make install \
+    && echo /opt/pmix/lib > /etc/ld.so.conf.d/pmix.conf \
+    && ldconfig \
+    && cd / \
+    && rm -fr /tmp/pmix-${CUSTOM_PMIX_VERSION}/
+# To silence this runtime error message:
+# [p4de-st-p4de-2:110912] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
+ENV PMIX_GDS_MODULE=^ds12 \
+    PMIX_MCA_gds=^ds12
+
+# Rebuild openmpi with DLC style (which it remarks as "without libfabric"), with the above pmix.
+ENV OMPI_VERSION=4.1.6
+RUN rm -fr ${OPEN_MPI_PATH} \
+    && mkdir /tmp/openmpi \
+    && cd /tmp/openmpi \
+    && wget --quiet https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-${OMPI_VERSION}.tar.gz \
+    && tar zxf openmpi-${OMPI_VERSION}.tar.gz \
+    && rm openmpi-${OMPI_VERSION}.tar.gz \
+    && cd openmpi-${OMPI_VERSION} \
+    && ./configure --enable-orterun-prefix-by-default --prefix=$OPEN_MPI_PATH --with-cuda=${CUDA_HOME} --with-slurm --with-pmix=/opt/pmix \
+    && make -j $(nproc) all \
+    && make install \
+    && ldconfig \
+    && cd / \
+    && rm -rf /tmp/openmpi \
+    && ompi_info --parsable --all | grep mpi_built_with_cuda_support:value \
+    # Verify pmix from /opt/pmix/
+    && ldd /opt/amazon/openmpi/lib/openmpi/mca_pmix_ext3x.so | grep '/opt/pmix/lib/libpmix.so.* ' > /opt/amazon/openmpi-pmix.txt
+####################################################################################################
+
+
+## NCCL EFA Plugin
+#RUN mkdir -p /tmp && \
+#    cd /tmp && \
+#    curl -LO https://github.com/aws/aws-ofi-nccl/archive/refs/tags/v${AWS_OFI_NCCL_VERSION}.tar.gz && \
+#    tar -xzf /tmp/v${AWS_OFI_NCCL_VERSION}.tar.gz && \
+#    rm /tmp/v${AWS_OFI_NCCL_VERSION}.tar.gz && \
+#    mv aws-ofi-nccl-${AWS_OFI_NCCL_VERSION} aws-ofi-nccl && \
+#    cd /tmp/aws-ofi-nccl && \
+#    ./autogen.sh && \
+#    ./configure --prefix=/opt/amazon/efa \
+#        --with-libfabric=/opt/amazon/efa \
+#        --with-cuda=/usr/local/cuda \
+#        --enable-platform-aws \
+#        --with-mpi=/opt/amazon/openmpi && \
+#    make -j$(nproc) install && \
+#    rm -rf /tmp/aws-ofi/nccl
+
+###################################################
+## Install AWS-OFI-NCCL plugin
+RUN apt-get install libtool autoconf cmake nasm unzip pigz parallel nfs-common build-essential hwloc libhwloc-dev libjemalloc2 libnuma-dev numactl libjemalloc-dev preload htop iftop liblapack-dev libgfortran5 ipcalc wget curl devscripts debhelper check libsubunit-dev fakeroot pkg-config dkms -y
+RUN export OPAL_PREFIX="" \
+    && git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl \
+    && cd /opt/aws-ofi-nccl \
+    && git checkout ${AWS_OFI_NCCL_VERSION} \
+    && ./autogen.sh \
+    && ./configure --prefix=/opt/aws-ofi-nccl/install \
+    --with-mpi=/opt/amazon/openmpi \
+    --with-libfabric=/opt/amazon/efa \
+    --with-cuda=/usr/local/cuda \
+    --enable-platform-aws \
+    && make -j $(nproc) && make install
+
+
+# Do this to minimize the ld path env vars that users need to define when running this image.
+RUN echo "/usr/local/lib"      >> /etc/ld.so.conf.d/local.conf && \
+    echo "/opt/amazon/openmpi/lib" >> /etc/ld.so.conf.d/efa.conf && \
+    ldconfig
+
+ENV OMPI_MCA_pml=^cm,ucx            \
+    OMPI_MCA_btl=tcp,self           \
+    OMPI_MCA_btl_tcp_if_exclude=lo,docker0 \
+    OPAL_PREFIX=/opt/amazon/openmpi \
+    # https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352
+    # https://github.com/pytorch/pytorch/issues/68893
+    NCCL_SOCKET_IFNAME=^docker,lo
+
+ENV LD_LIBRARY_PATH="/usr/local/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"
+
+# NCCL-tests: always good to include this as a diagnostic tool.
+RUN git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests \
+    && cd /opt/nccl-tests \
+    && git checkout v${NCCL_TESTS_VERSION} \
+    && make MPI=1 \
+    MPI_HOME=/opt/amazon/openmpi \
+    CUDA_HOME=/usr/local/cuda \
+    NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_80,code=sm_80"
+
+
+RUN pip install accelerate appdirs loralib bitsandbytes datasets fire peft transformers>=4.40.0 sentencepiece wandb vllm gradio openai
+RUN pip install hydra-core huggingface_hub safetensors tiktoken blobfile>=2 tqdm torchao==0.1 lm_eval==0.4.*
+RUN pip uninstall -y transformer-engine
diff --git a/3.test_cases/torchtune/.gitignore b/3.test_cases/torchtune/.gitignore
@@ -0,0 +1,2 @@
+torchtune
+.env
diff --git a/3.test_cases/torchtune/README.md b/3.test_cases/torchtune/README.md
@@ -0,0 +1,28 @@
+# End-to-End LLM Model Development with Torchtune <!-- omit in toc -->
+
+This guide demonstrates the comprehensive process of developing a Large Language Model (LLM) from start to finish using [Torchtune](https://github.com/pytorch/torchtune). The journey of creating an LLM encompasses five pivotal steps:
+
+![LLMOps](docs/LLMOps.png)
+
+1. **(Continuous) Pretraining the Language Model**: Next, the language model undergoes pretraining on a vast corpus of text data. This step can be bypassed if starting with an already pretrained model. Pretraining is essential for the model to learn the general patterns and structures of language. Refer `torchtitan` test case for the large scale pretraining with the latest techniques such as 3D parallelism and `torch.compile`.
+
+2. **Instruction Tuning**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application.
+
+3. **Aligment**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application.
+
+4. **Evaluation**: Evaluating the LLM's performance is a critical step. It involves using various metrics to assess the model's accuracy and effectiveness. This step is vital for validating new techniques and objectively comparing different model releases.
+
+5. **Deployment**: Upon achieving the desired performance, the model is deployed as an API. This deployment enables the model's integration into applications, making it accessible to users and other systems.
+
+Following these steps allows for the iterative development and refinement of a Large Language Model to meet specific needs and ensure its successful deployment. This guide specifically addresses all steps except the initial data preparation. The pretraining phase is facilitated by Torchtitan, while Torchtune manages the fine-tuning and evaluation phases.
+
+**Torchtune** emerges as a PyTorch-native library dedicated to the easy authoring, fine-tuning, and experimentation with LLMs, proudly announcing its alpha release.
+
+Features of Torchtune encompass:
+
+* Native-PyTorch implementations of renowned LLMs using composable and modular building blocks.
+* Straightforward and adaptable training recipes for popular fine-tuning techniques such as LoRA and QLoRA, emphasizing a PyTorch-centric approach without the need for trainers or frameworks.
+* YAML configurations for simplifying the setup of training, evaluation, quantization, or inference recipes.
+* Comprehensive support for numerous popular dataset formats and prompt templates, ensuring a smooth start to training endeavors.
+
+This case study provides examples for two schedulers, Slurm and Kubernetes, with detailed instructions available in the `slurm` or `kubernetes` subdirectories.
diff --git a/3.test_cases/torchtune/docs/LLMOps.png b/3.test_cases/torchtune/docs/LLMOps.png
diff --git a/3.test_cases/torchtune/kubernetes/.gitkeep b/3.test_cases/torchtune/kubernetes/.gitkeep