Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: _share_filename_: only available on CPU #3014

Open
msciancalepore98 opened this issue Oct 23, 2024 · 3 comments
Open

RuntimeError: _share_filename_: only available on CPU #3014

msciancalepore98 opened this issue Oct 23, 2024 · 3 comments

Comments

@msciancalepore98
Copy link

msciancalepore98 commented Oct 23, 2024

Hi,

I am trying to run a pretty simple test, with the following args:

args = SentenceTransformerTrainingArguments(
        # Required parameter:
        output_dir=output_dir.as_posix(),
        # Optional training parameters:
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=1e-3,
        warmup_ratio=0.1,
        dataloader_num_workers=2,
        use_mps_device=False,
        eval_strategy="steps",
        eval_steps=100,
        save_strategy="steps",
        save_steps=100,
        save_total_limit=2,
        logging_steps=100,
        run_name="mpnet-base-all-nli-triplet",  # Will be used in W&B if `wandb` is installed
    )

but I got the following error:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
  File ".../lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File ".../lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File ".../projects/Modules/Training/pretrain-text-encoder.py", line 270, in <module>
    train_model(model, ds_train, ds_val, output_dir, num_epochs=3, batch_size=16)
  File ".../projects/Modules/Training/pretrain-text-encoder.py", line 217, in train_model
    trainer.train()
  File ".../lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File ".../lib/python3.10/site-packages/transformers/trainer.py", line 2236, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File ".../lib/python3.10/site-packages/accelerate/data_loader.py", line 547, in __iter__
    dataloader_iter = self.base_dataloader.__iter__()
  File ".../lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 484, in __iter__
    return self._get_iterator()
  File ".../lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 415, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File ".../lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1138, in __init__
    w.start()
  File ".../lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File ".../lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File ".../lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File ".../lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File ".../lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File ".../lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File ".../lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File ".../lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 607, in reduce_storage
    metadata = storage._share_filename_cpu_()
  File ".../lib/python3.10/site-packages/torch/storage.py", line 437, in wrapper
    return fn(self, *args, **kwargs)
  File ".../lib/python3.10/site-packages/torch/storage.py", line 516, in _share_filename_cpu_
    return super()._share_filename_cpu_(*args, **kwargs)
RuntimeError: _share_filename_: only available on CPU

Of course, if I switch to num_workers=0, everything works..
use_mps_device True or False makes no difference (I am doing some tests locally).

If I try with the old .fit I have:

Traceback (most recent call last):
  File ".../autotagging/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File ".../autotagging/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File ".../Modules/Training/pretrain-text-encoder.py", line 227, in <module>
    train_model(model, ds_train, ds_val, output_dir, num_epochs=3, batch_size=16)
  File ".../Modules/Training/pretrain-text-encoder.py", line 175, in train_model
    model.fit(
  File ".../autotagging/lib/python3.10/site-packages/sentence_transformers/fit_mixin.py", line 260, in fit
    for batch in data_loader:
  File ".../autotagging/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 479, in __iter__
    self._iterator = self._get_iterator()
  File ".../autotagging/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 415, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File ".../autotagging/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1138, in __init__
    w.start()
  File ".../autotagging/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File ".../autotagging/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File ".../autotagging/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File ".../autotagging/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File ".../autotagging/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File ".../autotagging/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File ".../autotagging/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'FitMixin.fit.<locals>.identity'

I have torch == 2.5.0 and sentence_transformers == 3.2.1.

@msciancalepore98
Copy link
Author

OK after some more digging I found this and it actually fixes it.

multiprocessing_context="fork" if torch.backends.mps.is_available() else None

Unfortunately I must use the old .fit to use the good old dataloader interface and setting this field !

@tomaarsen
Copy link
Collaborator

tomaarsen commented Oct 28, 2024

Hmm, that looks like a tricky bug. Could you perhaps use the new Trainer if you use:

class CustomTrainer(SentenceTransformerTrainer):
    def get_train_dataloader(self):
        dataloader = super().get_train_dataloader()
        dataloader.multiprocessing_context="fork" if torch.backends.mps.is_available() else None
        return dataloader

# also for eval/test dataloaders
  • Tom Aarsen

@msciancalepore98
Copy link
Author

Yeah at the end I fixed it like that, would be worth to fix it mainstream or even updating the docs?
Nowadays lots of folks in companies use macbooks to do dry runs locally before going cuda mode :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants