CS224N-2019

My notes and solutions for CS224N 2019

Course links

Course page CS224N
Lecture videos 2019 Youtube
Stanford online hub

Detailed Course Lectures:

04 - Backpropagation:

Updating Word Vectors - Pitfalls
Computation Graphs and Backpropagation
- [downstream gradient] = [upstream gradient] * [local gradient]
Implementation
You should understand backprop
Regularization
Non-linearities
Parameter Initialization
Optimizers

13 - Contextual Word Embeddings:

TagLM

Meanining of word in context
Semi-supervised. Unsupervised step learns embedings (eg. word2vec) + RNN LM
learned embedding representations (without context) are joined with LM embeddings (hidden states) to generate the final representation that gets trained in supervised fashion for final talk, like NER

ELMo

2 biLSTM LM
Task-specific weighted average of hidden states of all layers
Character representation of words to build initial representation. 2048 char n-gram filters and 2 highway layers, 512 dim projection
4096 dim hidden/cell LSTM states with 512 dim projection to next input
Residual connections
Learns task-specific combinations of biLM representations
Uses all levels of hiddend layers of the LM (unlike tagLM that just uses the middle layer) for representation
The pre-trained LM model is kept frozen. A new model is fine tuned.

ULMfit

Universal LM for text classification
Pre-trained LM on big corpus
Then fine-tune the pre-trained LM on specific domain
So it uses the same pre-trained model for the end task
Finally, uses the same LM with text classification objective. Freezes the LM final softmax and introduces a new classification layer.
Different learning rates for each layer
Slanted Triagular Learning Rate Schedule (STLR)

Transformer

Aim for parallelism. RNN are sequencial.
The Annotated Tranformer
Only Attention. Attention everywhere.
Multihead attention. Q, K, V have dimension reduced by apply W matrixes and then concatenate multiple attentions and pipe through linear layer.
Transformer Block
- Multihead
- 2-layer FF NNet
- Residual connection
- Layer Normalization
LayerNorm
Bidirectional
- GPT is left-to-right
- ELMo is bidirectional LM, but trained separatedly (no attention between models)
- BERT bidirectional with attention
Objective:
- Predict masked words
- Second loss function. Next sentence prediction (classification: Given SentA and SentB IsSentBNexSentence: Yes/No)

14 - Transformers and Self-Attrntion

Attention

Crucial for NMT (memory)
Representation

Self-Attention

Same path length between all pices
Added casuality by masking words in the decoder

Multi-head

Residuals carry postion information

Self-Similarity

Motifs

Relative Attention

Tranlation invariance
How far you are aprt when comparing things

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Assignments		Assignments
notes		notes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS224N-2019

Course links

Detailed Course Lectures:

04 - Backpropagation:

13 - Contextual Word Embeddings:

TagLM

ELMo

ULMfit

Transformer

14 - Transformers and Self-Attrntion

Attention

Self-Attention

Multi-head

Self-Similarity

Relative Attention

About

Releases

Packages

Languages

GuiliGomes/CS224n-2019

Folders and files

Latest commit

History

Repository files navigation

CS224N-2019

Course links

Detailed Course Lectures:

04 - Backpropagation:

13 - Contextual Word Embeddings:

14 - Transformers and Self-Attrntion

About

Topics

Resources

Stars

Watchers

Forks

Languages