Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics
nnRNN - NeurIPS 2019
expRNN code taken from here
EURNN tests based on code taken from here
Changes from paper:
- Testing of Adam optimizer using betas (0.0, 0.9) on expRNN and nnRNN
- Added grad clipping
- Note the large improvements in nnRNN
- expRNN did not achieve improvements through new optimizer, but was improved by searching higher learning rates
Fixed # of params (~1.32 M) | Fixed # hidden units (N=1024) | |||
Model | TPTB = 150 | TPTB = 300 | TPTB = 150 | TPTB = 300 |
RNN | 2.89 ± 0.002 | 2.90 ± 0.002 | 2.89 ± 0.002 | 2.90 ± 0.002 |
RNN-orth | 1.62 ± 0.004 | 1.66 ± 0.006 | 1.62 ± 0.004 | 1.66 ± 0.006 |
EURNN | 1.61 ± 0.001 | 1.62 ± 0.001 | 1.69 ± 0.001 | 1.68 ± 0.001 |
expRNN | 1.43 ± 0.002 | 1.44 ± 0.002 | 1.45 ± 0.002 | 1.48 ± 0.008 |
nnRNN | 1.40 ± 0.003 | 1.42 ± 0.003 | 1.40 ± 0.003 | 1.42 ± 0.003 |
Fixed # of params (~1.32 M) | Fixed # hidden units (N=1024) | |||
Model | TPTB = 150 | TPTB = 300 | TPTB = 150 | TPTB = 300 |
RNN | 40.01 ± 0.026 | 39.97 ± 0.025 | 40.01 ± 0.026 | 39.97 ± 0.025 |
RNN-orth | 66.29 ± 0.07 | 65.53 ± 0.09 | 66.29 ± 0.07 | 65.53 ± 0.09 |
EURNN | 65.68 ± 0.002 | 65.55 ± 0.002 | 64.01 ± 0.002 | 64.20 ± 0.003 |
expRNN | 69.02 ± 0.0005 | 68.98 ± 0.0003 | 68.69 ± 0.0004 | 68.57 ± 0.0004 |
nnRNN | 69.89 ± 0.001 | 69.54 ± 0.001 | 69.89 ± 0.001 | 69.54 ± 0.001 |
Model | Hidden Size | Optimizer | LR | Orth. LR | δ | T decay | Recurrent init |
RNN | 128 | RMSprop α=0.9 | 0.001 | Glorot Normal | |||
RNN-orth | 128 | RMSprop α=0.99 | 0.0002 | Random Orth | |||
EURNN | 128 | RMSprop α=0.5 | 0.001 | ||||
EURNN | 256 | RMSprop α=0.5 | 0.001 | ||||
expRNN | 128 | RMSprop α=0.99 | 0.001 | 0.0001 | Henaff | ||
expRNN | 176 | RMSprop α=0.99 | 0.001 | 0.0001 | Henaff | ||
nnRNN | 128 | RMSprop α = 0.99 | 0.0005 | 10-6 | 0.0001 | 10-6 | Cayley |
Model | Hidden Size | Optimizer | LR | Orth. LR | δ | T decay | Recurrent init |
RNN | 512 | RMSprop α=0.9 | 0.0001 | Glorot Normal | |||
RNN-orth | 512 | RMSprop α=0.99 | 5*10-5 | Random orth | |||
EURNN | 512 | RMSprop α=0.9 | 0.0001 | ||||
EURNN | 1024 | RMSprop α=0.9 | 0.0001 | ||||
expRNN | 512 | RMSprop α=0.99 | 0.0005 | 5*10-5 | Cayley | ||
expRNN | 722 | RMSprop α=0.99 | 5*10-5 | Cayley | |||
nnRNN | 512 | RMSprop α=0.99 | 0.0002 | 2*10-5 | 0.1 | 0.0001 | Cayley |
LSTM | 512 | RMSprop α=0.99 | 0.0005 | Glorot Normal | |||
LSTM | 257 | RMSprop α=0.9 | 0.0005 | Glorot Normal |
Model | Hidden Size | Optimizer | LR | Orth. LR | δ | T decay | Recurrent init | Grad Clipping Value |
Length=150 | ||||||||
RNN | 1024 | RMSprop α=0.9 | 10-5 | Glorot Normal | ||||
RNN-orth | 1024 | RMSprop α=0.9 | 0.0001 | Cayley | ||||
EURNN | 1024 | RMSprop α=0.9 | 0.001 | |||||
EURNN | 2048 | RMSprop α=0.9 | 0.001 | |||||
expRNN | 1024 | RMSprop α=0.9 | 0.001 | Cayley | ||||
expRNN | 1386 | RMSprop α=0.9 | 0.008 | 0.0008 | Cayley | |||
nnRNN | 1024 | Adam β = (0.0,0.9) | 0.002 | 0.0002 | 0.0001 | 10-5 | Cayley | 10 |
Length=300 | ||||||||
RNN | 1024 | RMSprop α=0.9 | 10-5 | Glorot Normal | ||||
RNN-orth | 1024 | RMSprop α=0.9 | 0.0001 | Cayley | ||||
EURNN | 1024 | RMSprop α=0.9 | 0.001 | |||||
EURNN | 2048 | RMSprop α=0.9 | 0.001 | |||||
expRNN | 1024 | RMSprop α=0.9 | 0.001 | Cayley | ||||
expRNN | 1386 | RMSprop α=0.9 | 0.001 | Cayley | ||||
nnRNN | 1024 | Adam β = (0.0, 0.9) | 0.002 | 0.0002 | 0.0001 | 10-6 | Cayley | 5 |
python copytask.py [args]
Options:
- net-type : type of RNN to use in test
- nhid : number if hidden units
- cuda : use CUDA
- T : delay between sequence lengths
- labels : number of labels in output and input, maximum 8
- c-length : sequence length
- onehot : onehot labels and inputs
- vari : variable length
- random-seed : random seed for experiment
- batch : batch size
- lr : learning rate for optimizer
- lr_orth : learning rate for orthogonal optimizer
- alpha : alpha value for optimizer (always RMSprop)
- rinit : recurrent weight matrix initialization options: [xavier, henaff, cayley, random orth.]
- iinit : input weight matrix initialization, options: [xavier, kaiming]
- nonlin : non linearity type, options: [None, tanh, relu, modrelu]
- alam : strength of penalty on (δ in the paper)
- Tdecay : weight decay on upper triangular matrix values
python sMNIST.py [args]
Options:
- net-type : type of RNN to use in test
- nhid : number if hidden units
- epochs : number of epochs
- cuda : use CUDA
- permute : permute the order of the input
- random-seed : random seed for experiment (excluding permute order which has independent seed)
- batch : batch size
- lr : learning rate for optimizer
- lr_orth : learning rate for orthogonal optimizer
- alpha : alpha value for optimizer (always RMSprop)
- rinit : recurrent weight matrix initialization options: [xavier, henaff, cayley, random orth.]
- iinit : input weight matrix initialization, options: [xavier, kaiming]
- nonlin : non linearity type, options: [None, tanh, relu, modrelu]
- alam : strength of penalty on (δ in the paper)
- Tdecay : weight decay on upper triangular matrix values
- save_freq : frequency in epochs to save data and network
Adapted from here
python language_task.py [args]
Options:
- net-type : type of RNN to use in test
- emsize : size of word embeddings
- nhid : number if hidden units
- epochs : number of epochs
- bptt : sequence length for back propagation
- cuda : use CUDA
- seed : random seed for experiment (excluding permute order which has independent seed)
- batch : batch size
- log-interval : reporting interval
- save : path to save final model and test info
- lr : learning rate for optimizer
- lr_orth : learning rate for orthogonal optimizer
- rinit : recurrent weight matrix initialization options: [xavier, henaff, cayley, random orth.]
- iinit : input weight matrix initialization, options: [xavier, kaiming]
- nonlin : non linearity type, options: [None, tanh, relu, modrelu]
- alam : strength of penalty on (δ in the paper)
- Tdecay : weight decay on upper triangular matrix values
- optimizer : choice of optimizer between RMSprop and Adam
- alpha : alpha value for optimizer (always RMSprop)
- betas : beta values for adam optimizer