Training Collapse

 hey, thanks for your excellent work, I'm currently following the open-sourced code and encountering a few questions about the training procedure:

1. I pull down the latest code from GitHub and run the stage1 training code on Imagenet from scratch on a 8-GPU A100 machine, but the training log seems abnormal. The recon-loss seems diverge and the visualization results turns bad. (See the appendix image in email)

2. The train code uses '-num_nodes 4', what does this hparams mean ?

3. The default train code saves checkpoints every n step, rather than topK 'val/recon_loss', should I use the topK checkpoints callback function?
<img width="2085" alt="training_2024-07-25 11 14 21" src="https://github.com/user-attachments/assets/e7209607-65f7-47d6-9d9d-91065aec26e8">

training：
<img width="640" alt="train-recon" src="https://github.com/user-attachments/assets/1c3157b8-7aa3-439f-8b00-dc67fd07a9f3">

validation：
<img width="640" alt="val-recon" src="https://github.com/user-attachments/assets/72176972-e0e0-46bd-81db-168353bfdac1">






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training Collapse #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training Collapse #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions