You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Anyone else in the community is as well open to comment or suggest. Thank you.
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
The original data type / class of the data collator meta information is completely lost after the below step which happens at couple of places as listed below.
The meta information loss happens since the collator is completely replaced with RemoveColumnsCollator wrapper collator class when the original dataset is NOT of type datasets.Dataset, mostly meant to support datasets.IterableDataset and others.
This raises issues for complex usecases where we wish to inject custom behaviour as part of the Trainer object when the collator is of certain class type and etc. Since the current code completely removes that piece of information and changes it to RemoveColumnsCollator, its really hard to know what was the original datacollator class. There are workarounds like writing case specific code handling RemoveColumnsCollator with special care, however, given the growing transformers code, things could change in the future and break such case specific code. On the other hand, it would be great to actually handle this situation better by preserving the original collator class information.
I propose the following to options
Dynamically modify the RemoveColumnsCollator class to subclass from the original data collator class that was passed.
This is bit of a fancy way of doing by creating a custom class/type using Python's type() API.
OR
Monkey patch the data collator object's caller functions (like __call__ etc) to include the remove columns logic on top of it. This would mean to remove RemoveColumnsCollator completely and do a simple monkey patch.
Example of monkey patch being already adopted in existing HF code making it a good option for this fix following the existing code style
kmehant
changed the title
Data collator class type integrity is not intact until the start/end of the training
Data collator class type integrity is not intact throughout the runtime
Nov 20, 2024
Thanks for the report. Just a quick question, you should still have access to the original data_collator in the data_collator field of RemoveColumnsCollator. Other than that, I don't mind doing both options you suggested, maybe the monkeypatch will better. cc @muellerzr
System Info
transformers: v4.46.3
python: 3.11
Who can help?
trainer: @muellerzr @SunMarc
Anyone else in the community is as well open to comment or suggest. Thank you.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The original data type / class of the data collator meta information is completely lost after the below step which happens at couple of places as listed below.
transformers/src/transformers/trainer.py
Line 975 in f297af5
transformers/src/transformers/trainer.py
Line 1071 in f297af5
transformers/src/transformers/trainer.py
Line 1113 in f297af5
The meta information loss happens since the collator is completely replaced with RemoveColumnsCollator wrapper collator class when the original dataset is NOT of type
datasets.Dataset
, mostly meant to supportdatasets.IterableDataset
and others.This raises issues for complex usecases where we wish to inject custom behaviour as part of the Trainer object when the collator is of certain class type and etc. Since the current code completely removes that piece of information and changes it to RemoveColumnsCollator, its really hard to know what was the original datacollator class. There are workarounds like writing case specific code handling
RemoveColumnsCollator
with special care, however, given the growing transformers code, things could change in the future and break such case specific code. On the other hand, it would be great to actually handle this situation better by preserving the original collator class information.I propose the following to options
RemoveColumnsCollator
class to subclass from the original data collator class that was passed.This is bit of a fancy way of doing by creating a custom class/type using Python's
type()
API.OR
__call__
etc) to include the remove columns logic on top of it. This would mean to remove RemoveColumnsCollator completely and do a simple monkey patch.Example of monkey patch being already adopted in existing HF code making it a good option for this fix following the existing code style
https://github.com/huggingface/accelerate/blob/d7b1b368e9f484a18636a71600566b757d5cf87e/src/accelerate/utils/operations.py#L819
I am happy to discuss and raise a PR to fix this behaviour.
Expected behavior
The class type information of the original data collator has to be intact and preserved throughout the runtime.
The text was updated successfully, but these errors were encountered: