🐛[BUG]: CorrDiff loss is scaled by hyper-parameter #605

chychen · 2024-07-17T01:56:23Z

Version

latest

On which installation method(s) does this occur?

Source

Describe the issue

CorrDiff loss is scaled by hyper-parameter, therefore we could not make a hyper-parameter search, because each run cannot be compared to the others.

example:

if batch_gpu_total = 1, loss_accum = L, when batch_gpu_total = 2, loss_accum = L/2
if batch_size_gpu = 1, loss_accum = L, when batch_size_gpu = 2, loss_accum = 2*L

why not just normalize it by batch_size_global? such as below

Now Implementation

      for round_idx in range(num_accumulation_rounds):
            with ddp_sync(ddp, (round_idx == num_accumulation_rounds - 1)):
                ...
                loss = loss.sum().mul(loss_scaling / batch_gpu_total)
                loss_accum += loss / num_accumulation_rounds
                loss.backward()

        loss_sum = torch.tensor([loss_accum], device=device)
        if dist.world_size > 1:
            torch.distributed.all_reduce(loss_sum, op=torch.distributed.ReduceOp.SUM)
        average_loss = loss_sum / dist.world_size
        if dist.rank == 0:
            wb.log({"training loss": average_loss}, step=cur_nimg)

Proposed Modification

      for round_idx in range(num_accumulation_rounds):
            with ddp_sync(ddp, (round_idx == num_accumulation_rounds - 1)):
                ...
                loss = loss.sum().mul(loss_scaling / batch_size_global) ### Modified
                loss_accum += loss ### Modified
                loss.backward()

        loss_sum = torch.tensor([loss_accum], device=device)
        if dist.world_size > 1:
            torch.distributed.all_reduce(loss_sum, op=torch.distributed.ReduceOp.SUM)
        average_loss = loss_sum / dist.world_size
        if dist.rank == 0:
            wb.log({"training loss": average_loss}, step=cur_nimg)

Minimum reproducible example

see README

Relevant log output

example:
- if `batch_gpu_total` = 1, `loss_accum` = L, when `batch_gpu_total` = 2, `loss_accum` = L/2
- if `batch_size_gpu` = 1, `loss_accum` = L, when `batch_size_gpu` = 2, `loss_accum` = 2*L

Environment details

No response

The text was updated successfully, but these errors were encountered:

mnabian · 2024-10-10T17:08:40Z

Hi @chychen , thanks for reporting the issue. I agree with the proposed modification. Could you please open a PR?

mnabian · 2024-10-17T22:05:18Z

@chychen did you have a chance to make a PR for this modification?

chychen · 2024-10-23T01:08:03Z

this is an 3-month old issue, seems like the latest version has already solve this issue.

chychen added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jul 17, 2024

NickGeneva assigned mnabian Aug 20, 2024

chychen closed this as completed Oct 23, 2024

chychen reopened this Oct 23, 2024

chychen closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[BUG]: CorrDiff loss is scaled by hyper-parameter #605

🐛[BUG]: CorrDiff loss is scaled by hyper-parameter #605

chychen commented Jul 17, 2024 •

edited

Loading

mnabian commented Oct 10, 2024

mnabian commented Oct 17, 2024

chychen commented Oct 23, 2024

🐛[BUG]: CorrDiff loss is scaled by hyper-parameter #605

🐛[BUG]: CorrDiff loss is scaled by hyper-parameter #605

Comments

chychen commented Jul 17, 2024 • edited Loading

Version

On which installation method(s) does this occur?

Describe the issue

Now Implementation

Proposed Modification

Minimum reproducible example

Relevant log output

Environment details

mnabian commented Oct 10, 2024

mnabian commented Oct 17, 2024

chychen commented Oct 23, 2024

chychen commented Jul 17, 2024 •

edited

Loading