Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LBFGS optimizer doesn't work for PINN training 🐛[BUG]: #492

Closed
hasethinvd opened this issue May 9, 2024 · 2 comments
Closed

LBFGS optimizer doesn't work for PINN training 🐛[BUG]: #492

hasethinvd opened this issue May 9, 2024 · 2 comments
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@hasethinvd
Copy link
Contributor

hasethinvd commented May 9, 2024

Version

24.01

On which installation method(s) does this occur?

Docker, Pip, Source

Describe the issue

After specifying the optimizer to be bfgs in config file, it overrides the max_steps to 0

Minimum reproducible example

#config
defaults :
  - modulus_default
  - arch:
      - fourier
      - modified_fourier
      - fully_connected
      - multiscale_fourier
  - scheduler: tf_exponential_lr
  - optimizer: bfgs
  - loss: sum


training:
  rec_results_freq: 1000
  max_steps : 150000

Relevant log output

[23:53:04] - lbfgs optimizer selected. Setting max_steps to 0
[23:53:05] - [step:     100000] lbfgs optimization in running
Error executing job with overrides: []
Traceback (most recent call last):
  File "/mount/data/test/eikonal/eikonal.py", line 313, in run
    slv.solve()
  File "/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py", line 173, in solve
    self._train_loop(sigterm_handler)
  File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 543, in _train_loop
    loss, losses = self._cuda_graph_training_step(step)
  File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 730, in _cuda_graph_training_step
    self.apply_gradients()
  File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 185, in bfgs_apply_gradients
    self.optimizer.step(self.bfgs_closure_func)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 379, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lbfgs.py", line 298, in step
    max_iter = group['max_iter']
KeyError: 'max_iter'

Environment details

No response

@hasethinvd hasethinvd added ? - Needs Triage Need team to review and classify bug Something isn't working labels May 9, 2024
@avidcoder123
Copy link

This issue is still active and needs fixing.

@ktangsali ktangsali self-assigned this Oct 15, 2024
@ktangsali
Copy link
Collaborator

This is an expected behavior of the LBFGS optimizer in Modulus-Sym. Inside Modulus-Sym, the optimizer will set the max_steps to zero. If the training is started from scratch, this issue should not show up and the training should run successfully. Reference:

[18:49:00] - attempting to restore from: outputs/helmholtz
[18:49:00] - optimizer checkpoint not found
[18:49:00] - model wave_network.0.pth not found
[18:49:00] - lbfgs optimizer selected. Setting max_steps to 0
/usr/local/lib/python3.10/dist-packages/modulus/sym/eq/derivatives.py:120: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
  with torch.cuda.amp.autocast(enabled=False):
[18:49:00] - [step:          0] lbfgs optimization in running
[18:49:58] - lbfgs optimization completed after 1000 steps
[18:49:58] - [step:          0] record constraint batch time:  5.987e-02s
[18:50:00] - [step:          0] record validators time:  2.309e+00s
[18:50:01] - [step:          0] saved checkpoint to outputs/helmholtz
[18:50:01] - [step:          0] loss:  1.007e+04
[18:50:01] - [step:          0] reached maximum training steps, finished training!

However, the above error occurs, if you switch the optimizer in the middle of training. For example, go from adam to bfgs after a few steps. While this is technically possible, Modulus-Sym does not currently allow such workflows. For such cases, its recommended to check the main Modulus library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants