Skip to content

Training Collapse #18

@JewelChen2019

Description

@JewelChen2019

hey, thanks for your excellent work, I'm currently following the open-sourced code and encountering a few questions about the training procedure:

  1. I pull down the latest code from GitHub and run the stage1 training code on Imagenet from scratch on a 8-GPU A100 machine, but the training log seems abnormal. The recon-loss seems diverge and the visualization results turns bad. (See the appendix image in email)

  2. The train code uses '-num_nodes 4', what does this hparams mean ?

  3. The default train code saves checkpoints every n step, rather than topK 'val/recon_loss', should I use the topK checkpoints callback function?

training_2024-07-25 11 14 21

training:
train-recon

validation:
val-recon

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions