-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
hey, thanks for your excellent work, I'm currently following the open-sourced code and encountering a few questions about the training procedure:
-
I pull down the latest code from GitHub and run the stage1 training code on Imagenet from scratch on a 8-GPU A100 machine, but the training log seems abnormal. The recon-loss seems diverge and the visualization results turns bad. (See the appendix image in email)
-
The train code uses '-num_nodes 4', what does this hparams mean ?
-
The default train code saves checkpoints every n step, rather than topK 'val/recon_loss', should I use the topK checkpoints callback function?

Metadata
Metadata
Assignees
Labels
No labels