My notes and solutions for CS224N 2019
- Course page CS224N
- Lecture videos 2019 Youtube
- Stanford online hub
- Updating Word Vectors - Pitfalls
- Computation Graphs and Backpropagation
[downstream gradient] = [upstream gradient] * [local gradient]
- Implementation
- You should understand backprop
- Regularization
- Non-linearities
- Parameter Initialization
- Optimizers
- Meanining of word in context
- Semi-supervised. Unsupervised step learns embedings (eg. word2vec) + RNN LM
- learned embedding representations (without context) are joined with LM embeddings (hidden states) to generate the final representation that gets trained in supervised fashion for final talk, like NER
- 2 biLSTM LM
- Task-specific weighted average of hidden states of all layers
- Character representation of words to build initial representation. 2048 char n-gram filters and 2 highway layers, 512 dim projection
- 4096 dim hidden/cell LSTM states with 512 dim projection to next input
- Residual connections
- Learns task-specific combinations of biLM representations
- Uses all levels of hiddend layers of the LM (unlike tagLM that just uses the middle layer) for representation
- The pre-trained LM model is kept frozen. A new model is fine tuned.
- Universal LM for text classification
- Pre-trained LM on big corpus
- Then fine-tune the pre-trained LM on specific domain
- So it uses the same pre-trained model for the end task
- Finally, uses the same LM with text classification objective. Freezes the LM final softmax and introduces a new classification layer.
- Different learning rates for each layer
- Slanted Triagular Learning Rate Schedule (STLR)
- Aim for parallelism. RNN are sequencial.
- The Annotated Tranformer
- Only Attention. Attention everywhere.
- Multihead attention. Q, K, V have dimension reduced by apply W matrixes and then concatenate multiple attentions and pipe through linear layer.
- Transformer Block
- Multihead
- 2-layer FF NNet
- Residual connection
- Layer Normalization
- LayerNorm
- Bidirectional
- GPT is left-to-right
- ELMo is bidirectional LM, but trained separatedly (no attention between models)
- BERT bidirectional with attention
- Objective:
- Predict masked words
- Second loss function. Next sentence prediction (classification: Given SentA and SentB IsSentBNexSentence: Yes/No)
- Crucial for NMT (memory)
- Representation
- Same path length between all pices
- Added casuality by masking words in the decoder
- Residuals carry postion information
- Motifs
- Tranlation invariance
- How far you are aprt when comparing things