Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data collator class type integrity is not intact throughout the runtime #34830

Open
2 of 4 tasks
kmehant opened this issue Nov 20, 2024 · 1 comment
Open
2 of 4 tasks
Labels

Comments

@kmehant
Copy link

kmehant commented Nov 20, 2024

System Info

transformers: v4.46.3
python: 3.11

Who can help?

trainer: @muellerzr @SunMarc

Anyone else in the community is as well open to comment or suggest. Thank you.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The original data type / class of the data collator meta information is completely lost after the below step which happens at couple of places as listed below.

data_collator = self._get_collator_with_removed_columns(data_collator, description="training")

data_collator = self._get_collator_with_removed_columns(data_collator, description="evaluation")

data_collator = self._get_collator_with_removed_columns(data_collator, description="test")

The meta information loss happens since the collator is completely replaced with RemoveColumnsCollator wrapper collator class when the original dataset is NOT of type datasets.Dataset, mostly meant to support datasets.IterableDataset and others.

This raises issues for complex usecases where we wish to inject custom behaviour as part of the Trainer object when the collator is of certain class type and etc. Since the current code completely removes that piece of information and changes it to RemoveColumnsCollator, its really hard to know what was the original datacollator class. There are workarounds like writing case specific code handling RemoveColumnsCollator with special care, however, given the growing transformers code, things could change in the future and break such case specific code. On the other hand, it would be great to actually handle this situation better by preserving the original collator class information.

I propose the following to options

  • Dynamically modify the RemoveColumnsCollator class to subclass from the original data collator class that was passed.

This is bit of a fancy way of doing by creating a custom class/type using Python's type() API.

OR

  • Monkey patch the data collator object's caller functions (like __call__ etc) to include the remove columns logic on top of it. This would mean to remove RemoveColumnsCollator completely and do a simple monkey patch.

Example of monkey patch being already adopted in existing HF code making it a good option for this fix following the existing code style

https://github.com/huggingface/accelerate/blob/d7b1b368e9f484a18636a71600566b757d5cf87e/src/accelerate/utils/operations.py#L819

I am happy to discuss and raise a PR to fix this behaviour.

Expected behavior

The class type information of the original data collator has to be intact and preserved throughout the runtime.

@kmehant kmehant added the bug label Nov 20, 2024
@kmehant kmehant changed the title Data collator class type integrity is not intact until the start/end of the training Data collator class type integrity is not intact throughout the runtime Nov 20, 2024
@SunMarc
Copy link
Member

SunMarc commented Nov 21, 2024

Thanks for the report. Just a quick question, you should still have access to the original data_collator in the data_collator field of RemoveColumnsCollator. Other than that, I don't mind doing both options you suggested, maybe the monkeypatch will better. cc @muellerzr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants