You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, Dear developer.
I have a question about how to train the large datasets and how to set the parameters
Background
I have 10 normal organic molecules with elements H C N O S, each one have 20000 conformers. if I train each molecule individual, I can get accurate energy and forces about (0.1-0.3 kcal/mol) I have to say, the individual model is very accurate. But when I combine them as a big datasets(200000 points) and want to train only one model, some probelms happend, the total energy is about 1~1.5 kcal/mol. and For each single molecules tests, there are 3 molecules error is very large(about 3kcal/mol). I think one model should accquire the same accuracy for the organic molecules, just like if we train all md17 datasets for one model.
Question
How to train a large datasets and set training parameters for large datasets? below is my config.yaml.
# a simple example config file# Two folders will be used during the training: 'root'/process and 'root'/'run_name'# run_name contains logfiles and saved models# process contains processed data sets# if 'root'/'run_name' exists, 'root'/'run_name'_'year'-'month'-'day'-'hour'-'min'-'s' will be used instead.root: results/l2run_name: 10monomerseed: 123# model seeddataset_seed: 456# data set seedappend: true # set true if a restarted run should append to the previous log filedefault_dtype: float32 # type of float to use, e.g. float32 and float64# networkr_max: 6.0# cutoff radius in length units, here Angstrom, this is an important hyperparamter to scannum_layers: 4# number of interaction blocks, we find 3-5 to work bestl_max: 2# the maximum irrep order (rotation order) for the network's features, l=1 is a good default, l=2 is more accurate but slowerparity: true # whether to include features with odd mirror parity; often turning parity off gives equally good results but faster networks, so do consider thisnum_features: 32# the multiplicity of the features, 32 is a good default for accurate network, if you want to be more accurate, go larger, if you want to be faster, go lowernonlinearity_type: gate # may be 'gate' or 'norm', 'gate' is recommended# scalar nonlinearities to use — available options are silu, ssp (shifted softplus), tanh, and abs.# Different nonlinearities are specified for e (even) and o (odd) parity;# note that only tanh and abs are correct for o (odd parity)# silu typically works best for evennonlinearity_scalars:
e: siluo: tanhnonlinearity_gates:
e: siluo: tanh# radial network basisnum_basis: 8# number of basis functions used in the radial basis, 8 usually works bestBesselBasis_trainable: true # set true to train the bessel weightsPolynomialCutoff_p: 6# p-exponent used in polynomial cutoff function, smaller p corresponds to stronger decay with distance# radial networkinvariant_layers: 3# number of radial layers, usually 1-3 works best, smaller is fasterinvariant_neurons: 128# number of hidden neurons in radial function, smaller is fasteravg_num_neighbors: auto # number of neighbors to divide by, null => no normalization, auto computes it based on datasetuse_sc: true # use self-connection or not, usually gives big improvement# data set# the keys used need to be stated at least once in key_mapping, npz_fixed_field_keys or npz_keys# key_mapping is used to map the key in the npz file to the NequIP default values (see data/_key.py)# all arrays are expected to have the shape of (nframe, natom, ?) except the fixed fields# note that if your data set uses pbc, you need to also pass an array that maps to the nequip "pbc" keydataset: ase # type of data set, can be npz or asedataset_file_name: datasets/train10_ext.xyz # path to data set filevalidation_dataset: asevalidation_dataset_file_name: datasets/test10_ext.xyz # path to data set filekey_mapping:
z: atomic_numbers # atomic species, integersE: total_energy # total potential eneriges to train toF: forces # atomic forces to train toR: pos # raw atomic positionsnpz_fixed_field_keys: # fields that are repeated across different examples
- atomic_numbers# A list of atomic types to be found in the data. The NequIP types will be named with the chemical symbols, and inputs with the correct atomic numbers will be mapped to the corresponding types.chemical_symbols:
- H
- C
- N
- O
- S# logging# wandb: false # we recommend using wandb for logging# wandb_project: toluene-example # project name used in wandbverbose: info # the same as python logging, e.g. warning, info, debug, error; case insensitivelog_batch_freq: 10# batch frequency, how often to print training errors withinin the same epochlog_epoch_freq: 1# epoch frequency, how often to printsave_checkpoint_freq: 100# frequency to save the intermediate checkpoint. no saving of intermediate checkpoints when the value is not positive.save_ema_checkpoint_freq: -1# frequency to save the intermediate ema checkpoint. no saving of intermediate checkpoints when the value is not positive.# trainingn_train: 200000# number of training datan_val: 20000# number of validation datalearning_rate: 0.005# learning rate, we found values between 0.01 and 0.005 to work best - this is often one of the most important hyperparameters to tunebatch_size: 5# batch size, we found it important to keep this small for most applications including forces (1-5); for energy-only training, higher batch sizes work bettervalidation_batch_size: 10# batch size for evaluating the model during validation. This does not affect the training results, but using the highest value possible (<=n_val) without running out of memory will speed up your training.max_epochs: 10000# stop training after _ number of epochs, we set a very large number, as e.g. 1million and then just use early stopping and not train the full number of epochstrain_val_split: random # can be random or sequential. if sequential, first n_train elements are training, next n_val are val, else random, usually random is the right choiceshuffle: true # if true, the data loader will shuffle the data, usually a good ideametrics_key: validation_loss # metrics used for scheduling and saving best model. Options: `set`_`quantity`, set can be either "train" or "validation, "quantity" can be loss or anything that appears in the validation batch step header, such as f_mae, f_rmse, e_mae, e_rmseuse_ema: true # if true, use exponential moving average on weights for val/test, usually helps a lot with training, in particular for energy errorsema_decay: 0.99# ema weight, typically set to 0.99 or 0.999ema_use_num_updates: true # whether to use number of updates when computing averagesreport_init_validation: true # if True, report the validation error for just initialized model# early stopping based on metrics values.early_stopping_patiences: # stop early if a metric value stopped decreasing for n epochsvalidation_loss: 50early_stopping_lower_bounds: # stop early if a metric value is lower than the boundLR: 1.0e-5# loss functionloss_coeffs:
forces: 1# if using PerAtomMSELoss, a default weight of 1:1 on each should work welltotal_energy:
- 1
- PerAtomMSELoss# output metricsmetrics_components:
- - forces # key
- mae # "rmse" or "mae"
- - forces
- rmse
- PerSpecies: False # if true, per species contribution is counted separatelyreport_per_component: False # if true, statistics on each component (i.e. fx, fy, fz) will be counted separately
- - total_energy
- mae
- - total_energy
- rmse
- PerAtom: False # if true, energy is normalized by the number of atoms# optimizer, may be any optimizer defined in torch.optim# the name `optimizer_name`is case sensitiveoptimizer_name: Adam # default optimizer is Adamoptimizer_amsgrad: false# lr scheduler, currently only supports the two options listed in full.yaml, i.e. on-pleteau and cosine annealing with warm restarts, if you need more please file an issue# here: on-plateau, reduce lr by factory of lr_scheduler_factor if metrics_key hasn't improved for lr_scheduler_patience epochlr_scheduler_name: ReduceLROnPlateaulr_scheduler_patience: 100lr_scheduler_factor: 0.5# we provide a series of options to shift and scale the data# these are for advanced use and usually the defaults work very well# the default is to scale the atomic energy and forces by scaling them by the force standard deviation and to shift the energy by the mean atomic energy# in certain cases, it can be useful to have a trainable shift/scale and to also have species-dependent shifts/scales for each atom# whether the shifts and scales are trainable. Defaults to False. Optionalper_species_rescale_shifts_trainable: falseper_species_rescale_scales_trainable: false# initial atomic energy shift for each species. default to the mean of per atom energy. Optional# the value can be a constant float value, an array for each species, or a string that defines a statistics over the training datasetper_species_rescale_shifts: dataset_per_atom_total_energy_mean# initial atomic energy scale for each species. Optional.# the value can be a constant float value, an array for each species, or a stringper_species_rescale_scales: dataset_forces_rms# if explicit numbers are given for the shifts/scales, this parameter must specify whether the given numbers are unitless shifts/scales or are in the units of the dataset. If ``True``, any global rescalings will correctly be applied to the per-species values.# per_species_rescale_arguments_in_dataset_units: True
Can Allegro speed up the training process? Can you give a yaml parameter setting guide for large datasets about nequip and Allegro.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello, Dear developer.
I have a question about how to train the large datasets and how to set the parameters
Background
I have 10 normal organic molecules with elements H C N O S, each one have 20000 conformers. if I train each molecule individual, I can get accurate energy and forces about (0.1-0.3 kcal/mol) I have to say, the individual model is very accurate. But when I combine them as a big datasets(200000 points) and want to train only one model, some probelms happend, the total energy is about 1~1.5 kcal/mol. and For each single molecules tests, there are 3 molecules error is very large(about 3kcal/mol). I think one model should accquire the same accuracy for the organic molecules, just like if we train all md17 datasets for one model.
Question
Beta Was this translation helpful? Give feedback.
All reactions