FSDP Training with Sentence Transformer #3023

ShengYun-Peng · 2024-10-27T18:13:18Z

Given there are so many LLM-based models on top of MTEB benchmark nowadays, is there a canonical way to train with FSDP now? I'm trying to explore along this direction, but I just want to ask if there already exists some examples before I rebuild the wheel.

ShengYun-Peng · 2024-10-28T18:51:17Z

I took a stab on training with FSDP, and encountered quite a few issues: huggingface/accelerate#3201

tomaarsen · 2024-10-28T19:04:49Z

Hello!

There are some details here for me to get it running originally: https://sbert.net/docs/sentence_transformer/training/distributed.html#fsdp
But I stopped trying to get a neat and convenient integration once I realised that DDP outperformed FSDP for most small models. I'm definitely open to improving on it though.

Tom Aarsen

ShengYun-Peng · 2024-10-28T19:13:28Z

Thanks! I am working on this direction now and would like to hear your input! While you subclass the transformer trainer class and create the sentence transformer trainer, is there a guideline that you follow to write the customized trainer? I notice that you overwrite the compute loss, prepare inputs, and other class methods. Are you following some template or guidelines or just check the trainer source code line by line to make changes?

tomaarsen · 2024-10-29T11:07:13Z

I don't really check it line-by-line, but I'm somewhat familiar with the overall structure of the transformers Trainer. It's set up in quite a modular way, which means that it's rather feasible to subclass some "high level" methods like compute_loss and get_train_dataloader while leaving lower level methods like training_step, _inner_training_loop, etc. intact.

That's why the Sentence Transformers trainer file is only ~900 lines long, compared to ~5k for the base Trainer.

ShengYun-Peng · 2024-10-31T22:01:42Z

Hi @tomaarsen, if a model is wrapped, can we directly update the model in the loss function with loss_fn.model = self.model here before calling the loss_fn.forward? Basically, I'm curious about the purpose of override_model_in_loss method in the sentence transformer trainer

ShengYun-Peng · 2024-10-31T22:20:58Z

Based on my experiments, the evaluator cannot work out-of-the-box with fsdp, and it keeps throwing RuntimeError: 'weight' must be 2-D. I also recalled the doc said evaluator didn't work with fsdp. I'm curious why that is the case.

ShengYun-Peng · 2024-11-08T18:21:01Z

I have successfully finetuned llama3 for text embedding with FSDP and sentence-transformer with some modifications.

ShengYun-Peng · 2024-11-08T18:21:48Z

The core issue is to make the model in the loss function be aware of the FSDP setting. I may create a new PR if necessary.

tomaarsen · 2024-11-15T10:18:12Z

Hi @tomaarsen, if a model is wrapped, can we directly update the model in the loss function with loss_fn.model = self.model here before calling the loss_fn.forward? Basically, I'm curious about the purpose of override_model_in_loss method in the sentence transformer trainer

Apologies for the delay! The override_model_in_loss method is necessary because the losses in Sentence Transformers are a bit unusual: they are torch.nn.Module subclasses that are provided the model as an attribute. So, when trainer.model is wrapped/compiled, the loss.model isn't. As a result, when we call loss(features, labels), the loss just calls the original model.

This is why we have to override the original loss.model if the Trainer wrapped the model, so that the actual inference happens with the wrapped model.

As for the evaluator: FSDP splits the model up into pieces and separates it across the various devices. This is happening inside of the transformers Trainer code, which is fairly advanced. The evaluator on the other hand is much simpler, and lives in the Sentence Transformers project exclusively. It's only calculated on the first process to avoid running the same evaluations multiple times:

sentence-transformers/sentence_transformers/trainer.py

Lines 441 to 442 in 68dfbe6

    
           with nullcontext() if self.is_local_process_zero() else disable_logging(logging.INFO): 
        
               evaluator_metrics = self.evaluator(self.model)

Ideally, we'd have e.g. DDP support here where we can split the computations for the evaluator across devices, but that hasn't been implemented. FSDP would indeed be even nicer, but a lot more complex as well. Either way, the evaluator breaks with FSDP because we only run it on the first process.

Nice work on getting it to work! I'm open to PRs.

Tom Aarsen

ShengYun-Peng changed the title ~~FSDP Training~~ FSDP Training with Sentence Transformer Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP Training with Sentence Transformer #3023

FSDP Training with Sentence Transformer #3023

ShengYun-Peng commented Oct 27, 2024

ShengYun-Peng commented Oct 28, 2024

tomaarsen commented Oct 28, 2024

ShengYun-Peng commented Oct 28, 2024

tomaarsen commented Oct 29, 2024

ShengYun-Peng commented Oct 31, 2024

ShengYun-Peng commented Oct 31, 2024

ShengYun-Peng commented Nov 8, 2024

ShengYun-Peng commented Nov 8, 2024

tomaarsen commented Nov 15, 2024

FSDP Training with Sentence Transformer #3023

FSDP Training with Sentence Transformer #3023

Comments

ShengYun-Peng commented Oct 27, 2024

ShengYun-Peng commented Oct 28, 2024

tomaarsen commented Oct 28, 2024

ShengYun-Peng commented Oct 28, 2024

tomaarsen commented Oct 29, 2024

ShengYun-Peng commented Oct 31, 2024

ShengYun-Peng commented Oct 31, 2024

ShengYun-Peng commented Nov 8, 2024

ShengYun-Peng commented Nov 8, 2024

tomaarsen commented Nov 15, 2024