Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training checkpoints (of large models) are unreadble due to metadata #57

Open
anaprietonem opened this issue Sep 16, 2024 · 6 comments
Open
Labels
bug Something isn't working

Comments

@anaprietonem
Copy link
Contributor

anaprietonem commented Sep 16, 2024

What happened?

Training checkpoints for large models (num_channels equal or greater than 912) become unreadable by pytorch and hence can't be used to resume or fork runs.

Note - for now this issue can be overcome by:

  1. patching an existing checkpoint: zip -d model.ckpt archive/anemoi-metadata/ai-models.json
  2. commenting out the line that writes the metadata to the training config: https://github.com/ecmwf/anemoi-training/blob/91d8c6e9a8ac275ded43f781aac031ccca36499d/src/anemoi/training/diagnostics/callbacks/__init__.py#L888C1-L889C1

These are temporary solutions, since if we also reach a point where our inference checkpoints are that large then we would also need a fix to be able to run inference

What are the steps to reproduce the bug?

  1. Update the config to use a model with num_channels equal or greater than 912.
  2. Run anemoi-training train --config-name filename
  3. Once finished, try to resume or fork the above run with run_id: previous_run_id or fork_run_id: previous_run_id
  4. Code crashes when loading the checkpoint with following error: RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

Version

0.1.0

Platform (OS and architecture)

ATOS

Relevant log output

Error executing job with overrides: []
Traceback (most recent call last):
  File "/etc/ecmwf/nfs/dh2_home_a/ecm1715/anemoi-training/src/anemoi/training/train/train.py", line 355, in main
    AnemoiTrainer(config).train()
  File "/etc/ecmwf/nfs/dh2_home_a/ecm1715/anemoi-training/src/anemoi/training/train/train.py", line 341, in train
    trainer.fit(
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 955, in _run
    self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 397, in _restore_modules_and_callbacks
    self.resume_start(checkpoint_path)
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 79, in resume_start
    loaded_checkpoint = self.trainer.strategy.load_checkpoint(checkpoint_path)
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 368, in load_checkpoint
    return self.checkpoint_io.load_checkpoint(checkpoint_path)
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/lightning_fabric/plugins/io/torch_io.py", line 83, in load_checkpoint
    return pl_load(path, map_location=map_location)
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/lightning_fabric/utilities/cloud_io.py", line 56, in _load
    return torch.load(f, map_location=map_location)  # type: ignore[arg-type]
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/torch/serialization.py", line 1004, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/perm/ecm1715/conda/envs/aifs-dev/lib/python3.10/site-packages/torch/serialization.py", line 456, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

Accompanying data

No response

Organisation

No response

@anaprietonem anaprietonem added the bug Something isn't working label Sep 16, 2024
@anaprietonem anaprietonem changed the title Writing metadata using zipfile on checkpoints files with a size equal or greater to 2GB Training checkpoints (of large models) are unreadble due to metadata Sep 16, 2024
@gmertes
Copy link
Member

gmertes commented Oct 1, 2024

I dug a little into this, I think it is related to torch using its own ZIP implementation PyTorchFileWriter.

A ZIP file contains a metadata "Central Directory", which is a collection of headers that describe the files included in the ZIP, and the offset of where they are stored in the byte stream. These headers (one for each file in the zip) are located at the end of the file, with the offset to the start stored at the very end. So that a ZIP client can show you the contents of a file without unzipping the whole thing.

An example header (the first one) looks like this, taken from an original PTL training checkpoint larger than 2GB (2.6GB):

A5A780B8 CENTRAL HEADER #1     02014B50
A5A780BC Created Zip Spec      00 '0.0'
A5A780BD Created OS            00 'MS-DOS'
A5A780BE Extract Zip Spec      00 '0.0'
A5A780BF Extract OS            00 'MS-DOS'
A5A780C0 General Purpose Flag  0808
         [Bit  3]              1 'Streamed'
         [Bit 11]              1 'Language Encoding'
A5A780C2 Compression Method    0000 'Stored'
A5A780C4 Last Mod Time         00000000 'Fri Nov 30 00:00:00 1979'
A5A780C8 CRC                   56EFB34A
A5A780CC Compressed Length     0003CFBF
A5A780D0 Uncompressed Length   0003CFBF
A5A780D4 Filename Length       0010
A5A780D6 Extra Length          0000
A5A780D8 Comment Length        0000
A5A780DA Disk Start            0000
A5A780DC Int File Attributes   0000
         [Bit 0]               0 'Binary Data'
A5A780DE Ext File Attributes   00000000
A5A780E2 Local Header Offset   00000000
A5A780E6 Filename              'archive/data.pkl'

In the corrupted checkpoint (the one where we added our metadata to), this first header is exactly the same. So far so good.

So, the central directory is just a long list of these headers, one for each file, in the order of which they were added to the zip. Note that the header tells you the size of the file (Compressed Length and Uncompressed Length).

Going through the headers, they are the same for original and corrupted at the beginning. But at some point the cumulative size of all files will become greater than 2GB. This is where the original and the corrupted headers start to differ.

Original:

A5A82E4C CENTRAL HEADER #2CF   02014B50
A5A82E50 Created Zip Spec      00 '0.0'
A5A82E51 Created OS            00 'MS-DOS'
A5A82E52 Extract Zip Spec      00 '0.0'
A5A82E53 Extract OS            00 'MS-DOS'
A5A82E54 General Purpose Flag  0808
         [Bit  3]              1 'Streamed'
         [Bit 11]              1 'Language Encoding'
A5A82E56 Compression Method    0000 'Stored'
A5A82E58 Last Mod Time         00000000 'Fri Nov 30 00:00:00 1979'
A5A82E5C CRC                   9EE748F7
A5A82E60 Compressed Length     00001000
A5A82E64 Uncompressed Length   00001000
A5A82E68 Filename Length       0010
A5A82E6A Extra Length          0000
A5A82E6C Comment Length        0000
A5A82E6E Disk Start            0000
A5A82E70 Int File Attributes   0000
         [Bit 0]               0 'Binary Data'
A5A82E72 Ext File Attributes   00000000
A5A82E76 Local Header Offset   812ABB94
A5A82E7A Filename              'archive/data/743'

Corrupted:

A5A90992 CENTRAL HEADER #2CF   02014B50
A5A90996 Created Zip Spec      2D '4.5'
A5A90997 Created OS            00 'MS-DOS'
A5A90998 Extract Zip Spec      2D '4.5'
A5A90999 Extract OS            00 'MS-DOS'
A5A9099A General Purpose Flag  0808
         [Bit  3]              1 'Streamed'
         [Bit 11]              1 'Language Encoding'
A5A9099C Compression Method    0000 'Stored'
A5A9099E Last Mod Time         00000000 'Fri Nov 30 00:00:00 1979'
A5A909A2 CRC                   F99F2265
A5A909A6 Compressed Length     00000004
A5A909AA Uncompressed Length   00000004
A5A909AE Filename Length       0010
A5A909B0 Extra Length          000C
A5A909B2 Comment Length        0000
A5A909B4 Disk Start            0000
A5A909B6 Int File Attributes   0000
         [Bit 0]               0 'Binary Data'
A5A909B8 Ext File Attributes   00000000
A5A909BC Local Header Offset   FFFFFFFF
A5A909C0 Filename              'archive/data/743'
A5A909D0 Extra ID #0001        0001 'ZIP64'.   <-----
A5A909D2   Length              0008.  <-----
A5A909D4   Offset to Central   0000000080AABF50 <-----
         Dir <-----

Note the addition of an extra ZIP64 fields at the end, marked with the arrow. So the python ZipFile implementation is rewriting the headers of the PyTorch data in the central directory. I believe this is what causes the corruption. For whatever reason, the PyTorch ZIP implementation does not add or expect these fields in the central directory.

The PyTorch ZIP implementation does seem to follow the ZIP64 spec, because it has ZIP64 headers at the end of the central directory, it just does not also add that extra field to each individual header:

A5A87114 ZIP64 END CENTRAL DIR 06064B50
         RECORD
A5A87118 Size of record        000000000000002C
A5A87120 Created Zip Spec      1E '3.0'
A5A87121 Created OS            03 'Unix'
A5A87122 Extract Zip Spec      2D '4.5'
A5A87123 Extract OS            00 'MS-DOS'
A5A87124 Number of this disk   00000000
A5A87128 Central Dir Disk no   00000000
A5A8712C Entries in this disk  00000000000003E2
A5A87134 Total Entries         00000000000003E2
A5A8713C Size of Central Dir   000000000000F05C
A5A87144 Offset to Central dir 00000000A5A780B8

A5A8714C ZIP64 END CENTRAL DIR 07064B50
         LOCATOR
A5A87150 Central Dir Disk no   00000000
A5A87154 Offset to Central dir 00000000A5A87114
A5A8715C Total no of Disks     00000001

@icedoom888
Copy link

Are there any updates on this? I am also experiencing the same issue.

@anaprietonem
Copy link
Contributor Author

anaprietonem commented Oct 2, 2024

Are there any updates on this? I am also experiencing the same issue.

As a temporal solution you can 'patch' you training checkpoint to remove the metadata by doing:
zip -d model.ckpt archive/anemoi-metadata/ai-models.json
after running that command you should be able to fork/resume your run since the checkpoint (model.ckpt) become readable by pytorch

@cathalobrien
Copy link
Contributor

cathalobrien commented Nov 1, 2024

recommenting bc of a bug with the old steps

Hi, I had this issue trying to run inference on a 9km model (the checkpoint is 3.3GB). With @gmertes help, the following steps resolved the issue. This is a script you can run which will fix your checkpoints. You just pass the checkpoint you want fixed as an input.

set -xe
checkpoint=$1
file_name="ai-models.json"
file_zip_path=`unzip -l $checkpoint | grep $file_name | awk '{print $NF}'`
parent_dirs=`dirname $file_zip_path`
unzip -j $checkpoint $file_zip_path #copy the json out of the zip
mkdir -p $parent_dirs
mv $file_name $parent_dirs
zip -d $checkpoint $file_zip_path #delete the json inside the zip
zip $checkpoint $file_zip_path #update the path
unzip -l $checkpoint | grep $file_name #check if worked by printing the copies path within the zip
rm -rf $file_zip_path

@ssmmnn11
Copy link
Member

Can we deactivate saving of metadata for these checkpoints by default? Otherwise training of somewhat larger model becomes quite difficult.

@icedoom888
Copy link

Patching this in #166 for transfer learning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants