Skip to content

Pull requests: aws-samples/awsome-distributed-training

Author
Filter by author
Loading
Label
Filter by label
Loading
Use alt + click/return to exclude labels
or + click/return for logical OR
Projects
Filter by project
Loading
Milestones
Filter by milestone
Loading
Reviews
Assignee
Filter by who’s assigned
Sort

Pull requests list

Update NCCL tests for both slurm and k8s
#506 opened Nov 23, 2024 by KeitaW Loading…
name SSM doc
#502 opened Nov 20, 2024 by sean-smith Loading…
Change aws ofi plugin version 1.13.0
#501 opened Nov 19, 2024 by mhuguesaws Loading…
Update pcluster architecture guidance enhancement New feature or request
#464 opened Oct 23, 2024 by KeitaW Draft
add GPU accounting for SMHP
#462 opened Oct 21, 2024 by KeitaW Loading…
fix nccl test eks
#455 opened Oct 9, 2024 by roywei Loading…
add nginx
#451 opened Oct 7, 2024 by KeitaW Draft
Update bionemo test case + propose to subdirectories per orchastrator documentation Improvements or additions to documentation
#396 opened Aug 5, 2024 by KeitaW Draft
Esm2 on Sagemaker Hyperpod
#387 opened Jul 25, 2024 by awsankur Loading…
update dependencies of PyTorch base image
#375 opened Jul 15, 2024 by KeitaW Loading…
Neuron distributed
#359 opened Jun 13, 2024 by KeitaW Loading…
End-to-End LLM Model Development with Torchtitan and Torchtune enhancement New feature or request
#341 opened May 20, 2024 by KeitaW Loading…
Llama training with FP8
#331 opened May 15, 2024 by pbelevich Draft
Add draft gpu troubles
#290 opened Apr 30, 2024 by mhuguesaws Draft
[WIP] torchtune usecase
#260 opened Apr 12, 2024 by pbelevich Draft
Bump pytorch dockerfile template
#211 opened Mar 12, 2024 by verdimrc Loading…
SMHP: slurm exporter to report gpu metrics
#181 opened Mar 6, 2024 by verdimrc Loading…
Update organization and tag to V1
#150 opened Feb 22, 2024 by perifaws Loading…
ProTip! Type g i on any issue or pull request to go back to the issue listing page.