diff --git a/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/1. Transformer Models/Readme.md b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/1. Transformer Models/Readme.md
index 6e3e59c4..57270512 100644
--- a/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/1. Transformer Models/Readme.md
+++ b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/1. Transformer Models/Readme.md
@@ -7,4 +7,26 @@ the translation and as we know, with large sequences, the information tends to g
LSTMs and GRUs can help to overcome the vanishing gradient problem, but even those will fail to process long sequences.
-2.
\ No newline at end of file
+2. In a conventional Encoder-decoder architeture, the model would again take T timesteps to compute the translation.
+
+
+## Transformers - Basics
+```buildoutcfg
+TLDR:
+1. In RNNs, parallel computing is difficult to implement.
+2. For long sequences in RNN, there is loss of information.
+3. RNNs face the problem of vanishing gradient.
+4. Transformer architecture is the solution.
+```
+
+1. Transformers are based on attention and don't require any sequential computation per layer, only one single step is needed.
+2. Additionally, the gradient steps that need to be taken from the last output to the first input in a transformer is just one.
+3. Transformers don't suffer from vanishing gradients problems that are related to the length of the sequences.
+
+4. Transformer differs from sequence to sequence by using multi-head attention layers instead of recurrent layers.
+
+
+5. Transformers also use positional encoding to capture sequential information. The positional encoding out puts values to be added to the embeddings. That's where every input word that is given to the model you have some of the information about it's order and the position.
+
+
+6. Unlike the recurrent layer, the multi-head attention layer computes the outputs of each inputs in the sequence independently then it allows us to parallelize the computation. But it fails to model the sequential information for a given sequence. That is why you need to incorporate the positional encoding stage into the transformer model.
diff --git a/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/2. basic encoder-decoder.png b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/2. basic encoder-decoder.png
new file mode 100644
index 00000000..f322ae99
Binary files /dev/null and b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/2. basic encoder-decoder.png differ
diff --git a/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/3. transformer model.png b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/3. transformer model.png
new file mode 100644
index 00000000..03310820
Binary files /dev/null and b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/3. transformer model.png differ
diff --git a/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/4. multi-head attention.png b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/4. multi-head attention.png
new file mode 100644
index 00000000..be380235
Binary files /dev/null and b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/4. multi-head attention.png differ
diff --git a/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/5. positional encoding.png b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/5. positional encoding.png
new file mode 100644
index 00000000..3725ce96
Binary files /dev/null and b/Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/images/5. positional encoding.png differ