Reproducing test cases from paper #489

asglover · 2025-04-14T15:06:33Z

asglover
Apr 14, 2025

Hello,

My name is Austin Glover, I'm writing on behalf of the OpenEquivariance project.

Sorry to make an unrelated request in the middle of your big refactor / new-api roll out, but I was wondering if you could point me the configs and scripts you used to generate the results in your paper. (MD-17, water, LIPS)

We're trying to accelerate models like Nequip, and we extracted some the key convolutions and show a speedup, but want to evaluate the overall impact of the faster kernels at the user level (running models for inference, maybe training)

If you have some scripts that run the models-config + data combinations from the paper, that would be great. Or if you have recommendations for where to find other real world load cases, that would be appreciated!

asglover · 2025-04-14T15:08:44Z

asglover
Apr 14, 2025
Author

I'm sorry, I was just pointed to this repo.

https://github.com/mir-group/nequip-input-files

You can ignore this discussion.

0 replies

asglover · 2025-04-21T22:24:51Z

asglover
Apr 21, 2025
Author

Actually, I'd like to reopen this:

In the time since the paper was published, the yaml specification has changed a bit. Do you have any guidance about what I should watch out for as I try and convert the .yaml configs associated with the paper reproduction to .yaml configs which work with the software in its current state.

Is there a chance you could provide an example of a single one of the .yaml configs from this repo updated to work with the modern software state that I could use as a template for updating the rest of the configs?

0 replies

asglover · 2025-04-22T21:42:27Z

asglover
Apr 22, 2025
Author

Something in particular is that I'd like some advice on:

https://github.com/mir-group/nequip-input-files/blob/64e395afdeab93184374d770cc709e148cdd4e73/ccsdt/ccsd-aspirin-n950-l3.yaml#L127-L150

In the paper models, both global and per species shifts and scales are used. But with the modern software, when I try to convert the .yaml specification to the modern syntax, I trip this Error.

nequip/nequip/model/_scaling.py

Lines 137 to 148 in 81a4db4

    
           # Check for common double shift mistake with defaults 
        
           if "RescaleEnergyEtc" in config.get("model_builders", []): 
        
               # if the defaults are enabled, then we will get bad double shift 
        
               # THIS CHECK IS ONLY GOOD ENOUGH FOR EMITTING WARNINGS 
        
               has_global_shift = config.get("global_rescale_shift", None) is not None 
        
               if has_global_shift: 
        
                   if config.get(module_prefix + "_shifts", True) is not None: 
        
                       # using default of per_atom shift 
        
                       raise RuntimeError( 
        
                           "A global_rescale_shift was provided, but the default per-atom energy shift was not disabled." 
        
                       ) 
        
               del has_global_shift

I think it's saying that it's no longer valid to use the config used in the paper? Where trainable per species shifts and scales are used on top of global non-trainable shifts and scales. Could you share some insight about this behavior?

nequip/configs/full.yaml

Lines 370 to 387 in 81a4db4

    
           # # full block needed for per specie rescale 
        
           # global_rescale_shift: null 
        
           # global_rescale_shift_trainable: false 
        
           # global_rescale_scale: dataset_forces_rms 
        
           # global_rescale_scale_trainable: false 
        
           # per_species_rescale_shifts_trainable: false 
        
           # per_species_rescale_scales_trainable: true 
        
           # per_species_rescale_shifts: dataset_per_species_total_energy_mean 
        
           # per_species_rescale_scales: dataset_per_species_forces_rms 
        
           # # full block needed for global rescale 
        
           # global_rescale_shift: dataset_total_energy_mean 
        
           # global_rescale_shift_trainable: false 
        
           # global_rescale_scale: dataset_forces_rms 
        
           # global_rescale_scale_trainable: false 
        
           # per_species_rescale_trainable: false 
        
           # per_species_rescale_shifts: null 
        
           # per_species_rescale_scales: null

In the example, you only show one or the other, and combined with the error I'm getting, It makes me wonder if it's not valid to have both or if I'm setting something wrong in my config file. Below is my attempt at modifying the config file.

@cw-tan

root: /pscratch/sd/a/aglover/nequip-root-dir/ccsdt/aspirin-ccsd-n950-l3
run_name: aspirin-ccsd-n950-l3-run_name
# workdir: /pscratch/sd/a/aglover/nequip-work-dir/ccsdt/aspirin-ccsd-n950-l3

# requeue: true
seed: 0                                                                           # random number seed for numpy and torch
append: true                                                                      # set True if a restarted run should append to the plevious log file
default_dtype: float32                                                            # type of float to use, e.g. float32 and float64
allow_tf32: false                                                                  # whether to use TensorFloat32 if it is available

model_builders:
 - EnergyModel
 - PerSpeciesRescale
 - ForceOutput 
 - RescaleEnergyEtc  


# network
r_max: 4.0                                                                        # cutoff radius in length units

num_layers: 5                                                                     # number of interaction blocks, we found 5-6 to work best
chemical_embedding_irreps_out: 64x0e                                              # irreps for the chemical embedding of species
feature_irreps_hidden: 64x0o + 64x0e + 64x1o + 64x1e + 64x2o + 64x2e + 64x3o + 64x3e                # irreps used for hidden features, here we go up to lmax=2, with even and odd parities
irreps_edge_sh: 0e + 1o + 2e + 3o                                                 # irreps of the spherical harmonics used for edges. If a single integer, indicates the full SH up to L_max=that_integer
conv_to_output_hidden_irreps_out: 16x0e                                           # irreps used in hidden layer of output block

nonlinearity_type: gate                                                           # may be 'gate' or 'norm', 'gate' is recommended

nonlinearity_scalars:
  e: silu
  o: tanh
nonlinearity_gates:
  e: silu
  o: tanh

resnet: false                                                                     # set true to make interaction block a resnet-style update
num_basis: 8                                                                      # number of basis functions used in the radial basis
BesselBasis_trainable: true                                                       # set true to train the bessel weights
PolynomialCutoff_p: 6                                                             # p-exponent used in polynomial cutoff function

# radial network
invariant_layers: 3                                                               # number of radial layers, we found it important to keep this small, 1 or 2
invariant_neurons: 64                                                             # number of hidden neurons in radial function, smaller is faster
avg_num_neighbors: 10                                                             # number of neighbors to divide by, None => no normalization.
use_sc: true                                                                      # use self-connection or not, usually gives big improvement
# compile_model: false                                                              # whether to compile the constructed model to TorchScript


# data set
# the keys used need to be stated at least once in key_mapping, npz_fixed_field_keys or npz_keys
# key_mapping is used to map the key in the npz file to the NequIP default values (see data/_key.py)
# all arrays are expected to have the shape of (nframe, natom, ?) except the fixed fields
dataset: npz                                                                       # type of data set, can be npz or ase
dataset_file_name: /pscratch/sd/a/aglover/nequip-datasets/ccsd/aspirin_ccsd/aspirin_ccsd-train.npz # path to data set file 
key_mapping:
  z: atomic_numbers                                                  # atomic species, integers
  E: total_energy                                                           # total potential eneriges to train to
  F: forces                                                                   # atomic forces to train to
  R: pos                                                                      # raw atomic positions
npz_fixed_field_keys:                                                              # fields that are repeated across different examples
  - atomic_numbers

chemical_symbols:
  - H 
  - C 
  - O 

# logging
# wandb: true                                                                        # we recommend using wandb for logging, we'll turn it off here as it's optional
# wandb_project: nequip-paper-results-ccsdt                                                             # project name used in wandb
# wandb_resume: true                                                                 # if true and restart is true, wandb run data will be restarted and updated.
                                                                                   # if false, a new wandb run will be generated
verbose: info                                                                      # the same as python logging, e.g. warning, info, debug, error. case insensitive
log_batch_freq: 1000000                                                                  # batch frequency, how often to print training errors withinin the same epoch
log_epoch_freq: 1                                                                  # epoch frequency, how often to print and save the model
save_checkpoint_freq: -1                                                            # frequency to save the intermediate checkpoint. no saving when the value is not positive.
save_ema_checkpoint_freq: -1                                                        # frequency to save the intermediate ema checkpoint. no saving when the value is not positive.

# training
n_train: 950                                                                       # number of training data
n_val: 50                                                                          # number of validation data
learning_rate: 0.01                                                                # learning rate, we found values between 0.01 and 0.005 to work best 
batch_size: 5                                                                      # batch size, we found it important to keep this small for most applications (1-5)
max_epochs: 1000000                                                                # stop training after _ number of epochs
train_val_split: random                                                            # can be random or sequential
shuffle: true                                                                      # If true, the data loader will shuffle the data, usually a good idea
metrics_key: validation_loss                                                       # metrics used for scheduling and saving best model. Options: loss, or anything that appears in the validation batch step header, such as f_mae, f_rmse, e_mae, e_rmse
use_ema: true        # if true, use exponential moving average on weights for val/test, usually helps a lot with training, in particular for energy errors
ema_decay: 0.99                                                                    # ema weight, commonly set to 0.999
ema_use_num_updates: true                                                          # whether to use number of updates when computing averages
report_init_validation: true                                                          

# early stopping based on metrics values. 
# LR, wall and any keys printed in the log file can be used. 
# The key can start with Training or Validation. If not defined, the validation value will be used.
early_stopping_patiences:                                                          # stop early if a metric value stopped decreasing for n epochs
  validation_loss: 1000                                                              # 
early_stopping_lower_bounds:                                                       # stop early if a metric value is lower than the bound
  LR: 1.0e-6                                                                        # 

# loss function
loss_coeffs:                                                                       # different weights to use in a weighted loss functions
  forces: 1000                                                                      # for MD applications, we recommed a force weight of 100 and an energy weight of 1
  total_energy: 1                                                                  # alternatively, if energies are not of importance, a force weight 1 and an energy weight of 0 also works.

# output metrics
metrics_components:
  - - forces                               # key
    - rmse                                 # "rmse" or "mse"
    - PerSpecies: True                     # if true, per species contribution is counted separately
      report_per_component: False          # if true, statistics on each component (i.e. fx, fy, fz) will be counted separately
  - - forces
    - mae
    - PerSpecies: True
      report_per_component: False
  - - total_energy
    - mae
  - - total_energy 
    - rmse

# optimizer, may be any optimizer defined in torch.optim
# the name `optimizer_name`is case sensitive
optimizer_name: Adam                                                               # default optimizer is Adam in the amsgrad mode
optimizer_amsgrad: true
optimizer_betas: !!python/tuple
  - 0.9
  - 0.999
optimizer_eps: 1.0e-08
optimizer_weight_decay: 0

# lr scheduler, currently only supports the two options listed below, if you need more please file an issue
# first: on-plateau, reduce lr by factory of lr_scheduler_factor if metrics_key hasn't improved for lr_scheduler_patience epoch
lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 50
lr_scheduler_factor: 0.8

# the deafult is to scale the energies and forces by scaling them by the force standard deviation and to shift the energy by its mean
# in certain cases, it can be useful to have a trainable shift/scale and to also have species-dependent shifts/scales for each atom

# global energy shift. When "dataset_energy_mean" (the default), the mean energy of the dataset. When None, disables the global shift. When a number, used directly.
global_rescale_shift: dataset_total_energy_mean
# whether the shift of the final global energy rescaling should be trainable
global_rescale_shift_trainable: false

# global energy scale. When "dataset_force_rms", the RMS of force components in the dataset. When "dataset_energy_std", the stdev of energies in the dataset. When None, disables the global scale. When a number, used directly.
# If not provided, defaults to either dataset_force_rms or dataset_energy_std, depending on whether forces are being trained.
global_rescale_scale: dataset_forces_rms
# whether the scale of the final global energy rescaling should be trainable
global_rescale_scale_trainable: false

# whether to apply a shift and scale, defined per-species, to the atomic energies
per_species_rescale_scales_trainable: true

# if the PerSpeciesScaleShift is enabled, whether the shifts and scales are trainable
per_species_rescale_shifts_trainable: true

# optional initial atomic energy shift for each species. order should be the same as the allowed_species used in train.py. Defaults to zeros.
# per_species_rescale_shifts: [0.0, 0.0, 0.0]

# optional initial atomic energy scale for each species. order should be the same as the allowed_species used in train.py. Defaults to ones.
per_species_rescale_scales: [1.0, 1.0, 1.0]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducing test cases from paper #489

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reproducing test cases from paper #489

Uh oh!

asglover Apr 14, 2025

Replies: 3 comments

Uh oh!

asglover Apr 14, 2025 Author

Uh oh!

Uh oh!

asglover Apr 21, 2025 Author

Uh oh!

asglover Apr 22, 2025 Author

asglover
Apr 14, 2025

asglover
Apr 14, 2025
Author

asglover
Apr 21, 2025
Author

asglover
Apr 22, 2025
Author