[Bug] misaligned address
during in SyncBuffersHook
all_reduce when using bf16 with deepspeed
#1557
Open
2 tasks done
Labels
bug
Something isn't working
Prerequisite
Environment
python -c "from mmengine.utils.dl_utils import collect_env; print(collect_env())"
:Reproduces the problem - code sample
The bug is very strange. I have not found the minimal reproducible code yet. There are some strange observations:
bfloat16
.misalign address
error only occurs after the epoch due toSyncBuffersHook
.SyncBuffersHook
.I was fine-tuning
LLaVA
. The buffers includes rope embeddings.Reproduces the problem - command or script
See above
Reproduces the problem - error message
Additional information
No response
The text was updated successfully, but these errors were encountered: