-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training checkpoints (of large models) are unreadble due to metadata #57
Comments
I dug a little into this, I think it is related to torch using its own ZIP implementation PyTorchFileWriter. A ZIP file contains a metadata "Central Directory", which is a collection of headers that describe the files included in the ZIP, and the offset of where they are stored in the byte stream. These headers (one for each file in the zip) are located at the end of the file, with the offset to the start stored at the very end. So that a ZIP client can show you the contents of a file without unzipping the whole thing. An example header (the first one) looks like this, taken from an original PTL training checkpoint larger than 2GB (2.6GB):
In the corrupted checkpoint (the one where we added our metadata to), this first header is exactly the same. So far so good. So, the central directory is just a long list of these headers, one for each file, in the order of which they were added to the zip. Note that the header tells you the size of the file (Compressed Length and Uncompressed Length). Going through the headers, they are the same for original and corrupted at the beginning. But at some point the cumulative size of all files will become greater than 2GB. This is where the original and the corrupted headers start to differ. Original:
Corrupted:
Note the addition of an extra ZIP64 fields at the end, marked with the arrow. So the python ZipFile implementation is rewriting the headers of the PyTorch data in the central directory. I believe this is what causes the corruption. For whatever reason, the PyTorch ZIP implementation does not add or expect these fields in the central directory. The PyTorch ZIP implementation does seem to follow the ZIP64 spec, because it has ZIP64 headers at the end of the central directory, it just does not also add that extra field to each individual header:
|
Are there any updates on this? I am also experiencing the same issue. |
As a temporal solution you can 'patch' you training checkpoint to remove the metadata by doing: |
recommenting bc of a bug with the old steps Hi, I had this issue trying to run inference on a 9km model (the checkpoint is 3.3GB). With @gmertes help, the following steps resolved the issue. This is a script you can run which will fix your checkpoints. You just pass the checkpoint you want fixed as an input. set -xe
checkpoint=$1
file_name="ai-models.json"
file_zip_path=`unzip -l $checkpoint | grep $file_name | awk '{print $NF}'`
parent_dirs=`dirname $file_zip_path`
unzip -j $checkpoint $file_zip_path #copy the json out of the zip
mkdir -p $parent_dirs
mv $file_name $parent_dirs
zip -d $checkpoint $file_zip_path #delete the json inside the zip
zip $checkpoint $file_zip_path #update the path
unzip -l $checkpoint | grep $file_name #check if worked by printing the copies path within the zip
rm -rf $file_zip_path |
Can we deactivate saving of metadata for these checkpoints by default? Otherwise training of somewhat larger model becomes quite difficult. |
Patching this in #166 for transfer learning. |
What happened?
Training checkpoints for large models (num_channels equal or greater than 912) become unreadable by pytorch and hence can't be used to resume or fork runs.
Note - for now this issue can be overcome by:
These are temporary solutions, since if we also reach a point where our inference checkpoints are that large then we would also need a fix to be able to run inference
What are the steps to reproduce the bug?
run_id: previous_run_id
orfork_run_id: previous_run_id
RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted
Version
0.1.0
Platform (OS and architecture)
ATOS
Relevant log output
Accompanying data
No response
Organisation
No response
The text was updated successfully, but these errors were encountered: