`Trainer.fit` stopped: `max_steps=400000` reached. #40

zhibeiyou135 · 2024-02-08T06:07:39Z

Epoch 2: : 17671it [1:05:34, 4.49it/s, loss=2.15, v_num=3hqp]wandb: Network error (TransientError), entering retry loop.
Epoch 2: : 25832it [1:35:47, 4.49it/s, loss=2.23, v_num=3hqp]wandb: Network error (TransientError), entering retry loop.
Epoch 2: : 115816it [7:02:37, 4.57it/s, loss=2.1, v_num=3hqp]Epoch 2, global step 400000: 'val/AP' was not in top 1
self._num_logged_artifact() = 1
num_ckpt_logged_before = 1
num_new_cktps = 1
Trainer.fit stopped: max_steps=400000 reached.
Epoch 2: : 115816it [7:03:13, 4.56it/s, loss=2.1, v_num=3hqp]
wandb: Waiting for W&B process to finish... (success).

The provided code reached max_steps after only two epochs. Is there a problem somewhere? If I want to train for more epochs, what should I do?

The text was updated successfully, but these errors were encountered:

magehrig · 2024-02-10T18:35:41Z

Hi @zhibeiyou135

The config specifies that the maximum number of steps is 400k. The epoch counter is misleading as you have actually seen batch_size times the number of epochs that are shown in the terminal. This has to do with how dataloading happens here:

RVT/data/utils/stream_concat_datapipe.py

Lines 70 to 72 in af1786c

    
           streams = Zipper(*(Concater(*(self.augmentation_dp(x.to_iter_datapipe()) 
        
                                         for x in self.random_torch_shuffle_list(datapipe_list))) 
        
                              for _ in range(batch_size)))

If you want to increase the number of iterations, just increase max_steps to the value you want

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Trainer.fit` stopped: `max_steps=400000` reached. #40

`Trainer.fit` stopped: `max_steps=400000` reached. #40

zhibeiyou135 commented Feb 8, 2024

magehrig commented Feb 10, 2024

Trainer.fit stopped: max_steps=400000 reached. #40

Trainer.fit stopped: max_steps=400000 reached. #40

Comments

zhibeiyou135 commented Feb 8, 2024

magehrig commented Feb 10, 2024

`Trainer.fit` stopped: `max_steps=400000` reached. #40

`Trainer.fit` stopped: `max_steps=400000` reached. #40