You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am having difficult time getting my data pipeline to the throughput levels that I would like before starting training with the t5x library.
Initially I planned to use a mixture of ~40 tasks (1-2 TB text) for training and started doing some benchmarking following general TPU and dataset performance tips. Here are some useful guides that I tried to follow:
seqio (dataset_providers.py) and t5x (train.py) source code
All of my datasets/tasks are json line files (output from earlier dataflow jobs) varying from 200 to 1000 files.
I used colab notebooks or an E2 32 cpu instance during my benchmarking experiments where I mounted my bucket which has all the ~40 datasets that I plan to use. I sampled 16 different files as training files for each task source because it is recommended not to have to read too many files form GCS.
FileDataSource
I switched from FunctionDataSource to FileDataSource , This is mainly to use individuals files during sharding without needing to read all the data which I assume would be slower especially for larger datasets.
Please let me know if you have any feedback regarding the following comments and questions:
In my experiments reading from GCS vs local files didn't differ much. So streaming directly from GCS is probably the better option (not having to download TB size data) as long as bucket is in the same zone as TPU and number of files is not too much. Documents state (10s to 100s MB) and (10s to 100s files), in my case I have datasets with 200-1000 files (100 MB-1 GB range), should I reduce the number of files maybe by making each 1 GB, would this help pipeline performance?
I also experimented with TFExampleDataSource vs FileDataSource didn't see any performance gain from TFExample compared to json. Is there an absolute best way to store data for seqio pipeline performance, e.g. would registering a tfds be better - as explained here? In my experience dataflow jobs output number of files equal to the number of workers, so it can be much higher than 100s. Is this ok or should we keep the number of files in 128-256 range?
This is more of a T5X question but still might be related. My understanding is that when we get dataset from a mixture each task is iterated and if there is shard info specified that shard is returned as data, later same sample_fn is used for sampling from these task datasets with the given rates. I don't fully know how data parallelism plays together with model parallelism in t5x and maybe it might depend on the model size and # of tpus cores we have. Is it correct to assume each TPU core is a worker and data gets distributed to them when sharding? So would it make sense to have as many files as a multiple of core numbers (e.g. 8x for v3-8, 32x for v3-32). I also read that batch is automatically distributed across tpu cores when doing computation that is why I guess 8 x 128 is emphasized, then does it mean we don't need to necessarily care about number of files / sharding and still can use a single source file?
Notes from codelab:
The rule of thumb is to split your data across several (10s to 100s) larg-ish files (10s to 100s of MB). If you have too many files, thousands of files for example, the time to access each file might start getting in the way. If you have too few files, like one or two, then you are not getting the benefits of streaming from multiple files in parallel.
The text was updated successfully, but these errors were encountered:
I am having difficult time getting my data pipeline to the throughput levels that I would like before starting training with the t5x library.
Initially I planned to use a mixture of ~40 tasks (1-2 TB text) for training and started doing some benchmarking following general TPU and dataset performance tips. Here are some useful guides that I tried to follow:
All of my datasets/tasks are json line files (output from earlier dataflow jobs) varying from 200 to 1000 files.
I used colab notebooks or an E2 32 cpu instance during my benchmarking experiments where I mounted my bucket which has all the ~40 datasets that I plan to use. I sampled 16 different files as training files for each task source because it is recommended not to have to read too many files form GCS.
FileDataSource
I switched from
FunctionDataSource
toFileDataSource
, This is mainly to use individuals files during sharding without needing to read all the data which I assume would be slower especially for larger datasets.Here we can see the reading and deserialization performance of a single task source.
Single Task
Then I register my seqio tasks with full pipeline (including preprocessors) and test the performance of a single task.
Mixture
When I benchmark the performance of the mixture it drops significantly (10x).
Follow Up Thoughts
Please let me know if you have any feedback regarding the following comments and questions:
In my experiments reading from GCS vs local files didn't differ much. So streaming directly from GCS is probably the better option (not having to download TB size data) as long as bucket is in the same zone as TPU and number of files is not too much. Documents state (10s to 100s MB) and (10s to 100s files), in my case I have datasets with 200-1000 files (100 MB-1 GB range), should I reduce the number of files maybe by making each 1 GB, would this help pipeline performance?
I also experimented with
TFExampleDataSource
vsFileDataSource
didn't see any performance gain from TFExample compared to json. Is there an absolute best way to store data for seqio pipeline performance, e.g. would registering a tfds be better - as explained here? In my experience dataflow jobs output number of files equal to the number of workers, so it can be much higher than 100s. Is this ok or should we keep the number of files in 128-256 range?This is more of a T5X question but still might be related. My understanding is that when we get dataset from a mixture each task is iterated and if there is shard info specified that shard is returned as data, later same
sample_fn
is used for sampling from these task datasets with the given rates. I don't fully know how data parallelism plays together with model parallelism in t5x and maybe it might depend on the model size and # of tpus cores we have. Is it correct to assume each TPU core is a worker and data gets distributed to them when sharding? So would it make sense to have as many files as a multiple of core numbers (e.g. 8x for v3-8, 32x for v3-32). I also read that batch is automatically distributed across tpu cores when doing computation that is why I guess 8 x 128 is emphasized, then does it mean we don't need to necessarily care about number of files / sharding and still can use a single source file?Notes from codelab:
The text was updated successfully, but these errors were encountered: