sequence_models.tex

\ifx\PREAMBLE\undefined
\input{preamble}
\begin{document}
\fi
\newcommand{\angless}[2]{#1^{\left\langle #2\right\rangle}}
\newcommand{\bracketss}[2]{#1^{\left[ #2\right]}}
\newcommand{\vect}[1]{\mathbf{\boldsymbol{#1}}}
\newcommand{\pcond}[2]{P\left(\left.#1\right\vert #2\right)}
\chapter{Sequence Models}
Examples of sequence model use cases:
\begin{itemize}
  \item Speech recognition: audio clip $\rightarrow$ transcript
  \item Music generation: genre of music $\rightarrow$ music clip
  \item Sentiment classification: ``there is nothing to like in this movie'' $\rightarrow$ one star
  \item DNA sequence analysis: DNA sequence $\rightarrow$ sequences encoding proteins
  \item Machine translation: French sentence $\rightarrow$ English sentence
  \item Video activity recognition: video of people running $\rightarrow$ running
  \item Named entity recognition: text $\rightarrow$ names in the text
\end{itemize}
Notation (using named entity recognition as an example):
\begin{itemize}
\item Input $x$: \textit{Harry Porter} and \textit{Hermione Granger} invented a new spell. $\angless{x}{t}$ is the $t^{th}$ word.
\item Output $y$ (whether each word is part of a name): 110110000. $\angless{y}{t}$ for the $t^{th}$ word.
\item Length of the sequences $T_x=9, T_y=9$
\item Same notation as before for training example index: $X^{(i)\langle t\rangle}$, $Y^{(i)\langle t\rangle}$, $T_x^{(i)}$, $T_y^{(i)}$.
\item Content of $\angless{x}{t}$: define a vocabulary containing all words. Each word is represented by a one-hot vector (one at the index of the word in the vocabulary) whose dimension is the vocabulary size. The vocabulary contains an \textit{unknown} item to represent words not in the vocabulary.
\end{itemize}
\section{Recurrent Neural Networks}
\subsection{Basic RNN}
\subsubsection{Problems of Standard NN}
\begin{itemize}
  \item Input/output of different training examples have different dimensions.
  \item Features learned across different positions of text are not shared.
  \item Large number of parameters because input size is large.
\end{itemize}
\subsubsection{RNN Structure}
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]  (neuron0) {$\angless{a}{0}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \ifnum\i=4
    \node[input] (neuron\i) [right=of neuron\last] {$\cdots$};
    \else\ifnum\i=5
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{T_x}$};
    \node[input]  (input\i)   [below=of neuron\i]     {$\angless{x}{T_x}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{T_y}$};
    \else
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{\i}$};
    \node[input]  (input\i)   [below=of neuron\i]     {$\angless{x}{\i}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{\i}$};
    \fi
    \fi
  }

  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (neuron\last) -- (neuron\i);
    \ifnum\i=4
    \else
    \draw[thick, -latex] (input\i) -- (neuron\i);
    \draw[thick, -latex] (neuron\i) -- (output\i);
    \fi
  }
  \end{tikzpicture} 
\end{center}
\begin{itemize}
  \item The activation for $\angless{x}{t}$ is fed as an input for the next item in the sequence $\angless{x}{t+1}$.\footnote{Later items in the sequence are not used for making predictions. The issue will be solved by BRNN (B for bidirectional).}
  \item Parameters are shared among time steps.
  \item $\angless{a}{0}$ is a fake time 0 activation, usually all zero.
  \item Here we assume $T_x=T_y$. 
\end{itemize}
\subsubsection{Calculation}
\begin{align*}
\angless{a}{t}&=g_a\left(W_{aa}\angless{a}{t-1}+W_{ax}\angless{x}{t}+b_a\right)=g_a\left(W_{a}\left[\angless{a}{t-1}, \angless{x}{t}\right]+b_a\right)\\
\angless{\hat{y}}{t}&=g_y\left(W_{ya}\angless{a}{t}+b_y\right)=g_y\left(W_{y}\angless{a}{t}+b_y\right)
\end{align*}
\begin{itemize}
\item $W_a\equiv\left[W_{aa}, W_{ax}\right], W_y\equiv W_{ya}$
\item $g_a$ is usually $\tanh$ or Relu.
\item $g_y$ depends on output $\hat{y}$ (e.g. sigmoid for binary output).
\item The calculation of each unit can be summarized with the following figure:
\begin{center}
  \begin{tikzpicture}[
    rect/.style={rectangle, draw=black, thick, minimum size=5mm},
    noboundary/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[noboundary]  (at1)                                {$\angless{a}{t-1}$};
  \node[noboundary]  (prodat1waa)  [right=1cm of at1]     {$\bigotimes$};
  \node[noboundary]  (waa)         [below=of prodat1waa]  {$W_{aa}$};
  \node[noboundary]  (aplus)       [right=of prodat1waa]  {$\bigoplus$};
  \node[noboundary]  (prodxtwax)   [below=of aplus]       {$\bigotimes$};
  \node[noboundary]  (ba)          [above=of aplus]       {$b_a$};
  \node[noboundary]  (wax)         [right=of prodxtwax]   {$W_{ax}$};
  \node[noboundary]  (xt)          [below=of prodxtwax]   {$\angless{x}{t}$};
  \node[rect]        (tanh)        [right=of aplus]       {$\tanh$};
  \node[noboundary]  (at)          [right=of tanh]        {};
  \node[noboundary]  (atout)       [right=2cm of at]      {$\angless{a}{t}$};
  \node[noboundary]  (prodatwya)   [above=of at]          {$\bigotimes$};
  \node[noboundary]  (wya)         [right=of prodatwya]   {$W_{ya}$};
  \node[noboundary]  (yplus)       [above=of prodatwya]   {$\bigoplus$};
  \node[noboundary]  (by)          [right=of yplus]       {$b_y$};
  \node[rect]        (softmax)     [above=of yplus]       {softmax};
  \node[noboundary]  (yt)          [above=of softmax]     {$\angless{\hat{y}}{t}$};
  
  \draw[-latex, thick]  (waa) -- (prodat1waa);
  \draw[-latex, thick]  (at1) -- (prodat1waa);
  \draw[-latex, thick]  (prodat1waa) -- (aplus);
  \draw[-latex, thick]  (wax) -- (prodxtwax);
  \draw[-latex, thick]  (xt) -- (prodxtwax);
  \draw[-latex, thick]  (prodxtwax) -- (aplus);
  \draw[-latex, thick]  (ba) -- (aplus);
  \draw[-latex, thick]  (aplus) -- (tanh);
  \draw[-latex, thick]  (tanh) -- (atout);
  \draw[-latex, thick]  (at.center) -- (prodatwya);
  \draw[-latex, thick]  (wya) -- (prodatwya);
  \draw[-latex, thick]  (prodatwya) -- (yplus);
  \draw[-latex, thick]  (by) -- (yplus);
  \draw[-latex, thick]  (yplus) -- (softmax);
  \draw[-latex, thick]  (softmax) -- (yt);

  \draw[thick, red] (1.05, -1.7) rectangle (5, 1.7);
  \draw[thick, green] (1, -1.75) rectangle (7.6, 3.7);

  \end{tikzpicture} 
\end{center}
\end{itemize}
\subsubsection{Backpropagation through Time}
Loss function:
\begin{align*}
  \angless{\mathcal{L}}{t}\left(\angless{\hat{y}}{t}, \angless{y}{t}\right)&=-\angless{y}{t}\log\angless{\hat{y}}{t}-\left(1-\angless{y}{t}\right)\log\left(1-\angless{\hat{y}}{t}\right)\\
  \mathcal{L}&=\displaystyle\sum_{t=1}^{T_x}\angless{\mathcal{L}}{t}\left(\angless{\hat{y}}{t}, \angless{y}{t}\right)
\end{align*}
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]  (neuron0) {$\angless{a}{0}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \ifnum\i=4
    \node[input] (neuron\i) [right=of neuron\last] {$\cdots$};
    \else\ifnum\i=5
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{T_x}$};
    \node[input]  (input\i)   [below=of neuron\i]     {$\angless{x}{T_x}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{T_y}$};
    \node[input]  (cost\i)    [above=of output\i]     {$\angless{\mathcal{L}}{T_y}$};
    \else
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{\i}$};
    \node[input]  (input\i)   [below=of neuron\i]     {$\angless{x}{\i}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{\i}$};
    \node[input]  (cost\i)    [above=of output\i]     {$\angless{\mathcal{L}}{\i}$};
    \fi
    \fi
  }
  \node[input]    (cost)      [above=of cost3]        {$\mathcal{L}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] ($(neuron\last.east) + (0, 0.1)$) -- ($(neuron\i.west) + (0, 0.1)$);
    \draw[thick, latex-, red] ($(neuron\last.east) + (0, -0.1)$) -- ($(neuron\i.west) + (0, -0.1)$);
    \ifnum\i=4
    \else
    \draw[thick, -latex] (input\i) -- (neuron\i);
    \draw[thick, -latex] ($(neuron\i.north) + (-0.1,0)$) -- ($(output\i.south) + (-0.1,0)$);
    \draw[thick, latex-, red] ($(neuron\i.north) + (0.1,0)$) -- ($(output\i.south) + (0.1,0)$);
    \draw[thick, -latex] ($(output\i.north) + (-0.1,0)$) -- ($(cost\i.south) + (-0.1,0)$);
    \draw[thick, latex-, red] ($(output\i.north) + (0.1,0)$) -- ($(cost\i.south) + (0.1,0)$);
    \draw[thick, -latex] ($(cost\i.north) + (-0.1, 0)$) -- (cost.south);
    \draw[thick, latex-, red] ($(cost\i.north) + (0.1, 0)$) -- (cost.south);
    \draw[thick, -latex, red];
    \fi
  }
  \end{tikzpicture} 
\end{center}
\subsection{Different Types of RNN}
\subsubsection{Many-to-many}
The RNN we saw above is a many-to-many architecture, satisfying $T_x=T_y$:

\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]  (neuron0) {$\angless{a}{0}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \ifnum\i=4
    \node[input] (neuron\i) [right=of neuron\last] {$\cdots$};
    \else\ifnum\i=5
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{T_x}$};
    \node[input]  (input\i)   [below=of neuron\i]     {$\angless{x}{T_x}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{T_y}$};
    \else
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{\i}$};
    \node[input]  (input\i)   [below=of neuron\i]     {$\angless{x}{\i}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{\i}$};
    \fi
    \fi
  }

  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (neuron\last) -- (neuron\i);
    \ifnum\i=4
    \else
    \draw[thick, -latex] (input\i) -- (neuron\i);
    \draw[thick, -latex] (neuron\i) -- (output\i);
    \fi
  }
  \end{tikzpicture} 
\end{center}
\subsubsection{Many-to-one}
The assumption $T_x=T_y$ may not always hold. For example, sentiment classification is a many-to-one architecture:
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]  (neuron0) {$\angless{a}{0}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \ifnum\i=4
    \node[input] (neuron\i) [right=of neuron\last] {$\cdots$};
    \else\ifnum\i=5
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{T_x}$};
    \node[input]  (input\i)   [below=of neuron\i]     {$\angless{x}{T_x}$};
    \else
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{\i}$};
    \node[input]  (input\i)   [below=of neuron\i]     {$\angless{x}{\i}$};
    \fi
    \fi
  }
  \node[input] (output) [above=of neuron5] {$\hat{y}$};
  \draw[thick, -latex] (neuron5) -- (output);
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (neuron\last) -- (neuron\i);
    \ifnum\i=4
    \else
    \draw[thick, -latex] (input\i) -- (neuron\i);
    \fi
  }
  \end{tikzpicture} 
\end{center}
\subsubsection{One-to-many}
Music generation is a one-to-many architecture:
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]  (neuron0) {$\angless{a}{0}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \ifnum\i=4
    \node[input] (neuron\i) [right=of neuron\last] {$\cdots$};
    \else\ifnum\i=5
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{T_x}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{T_y}$};
    \else
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{\i}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{\i}$};
    \fi
    \fi
  }
  \node[input] (input) [below=of neuron1] {$x$};
  \draw[thick, -latex] (input) -- (neuron1);

  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (neuron\last) -- (neuron\i);
    \ifnum\i=4
    \else
    \draw[thick, -latex] (neuron\i) -- (output\i);
    \fi
  }
  \end{tikzpicture} 
\end{center}
In this architecture, the output $\angless{y}{t}$ is often fed to the $t+1^{th}$ step as an input:
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]  (neuron0) {$\angless{a}{0}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \ifnum\i=4
    \node[input]  (neuron\i) [right=of neuron\last]     {$\cdots$};
    \node[input]  (output\i)  [above=of neuron\i]       {};
    \else\ifnum\i=5
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{T_x}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{T_y}$};
    \else
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{\i}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{\i}$};
    \fi
    \fi
  }
  \node[input]  (input1)   [below=of neuron1]     {$x$};
  
  \draw[thick, -latex] (input1) -- (neuron1);
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (neuron\last) -- (neuron\i);
    \ifnum\i=4
    \else
    \draw[thick, -latex] (neuron\i) -- (output\i);
    \fi
    \ifnum\i<5
    \draw[thick, -latex, red] (output\i.east) .. controls +(360:5mm) and +(240:18mm) .. ($(neuron\next.south)+(-0.1,0)$);
    \fi
  }
  \end{tikzpicture} 
\end{center}
\subsubsection{One-to-one}
One-to-one structure is a simple standard NN(no actual RNN):
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[neuron] (neuron)  {$\angless{a}{1}$};
  \node[input]  (input)   [below=of neuron] {$\angless{x}{1}$};
  \node[input]  (output)  [above=of neuron] {$\angless{y}{1}$};
  
  \draw[thick, -latex] (input) -- (neuron);
  \draw[thick, -latex] (neuron) -- (output);
  \end{tikzpicture}
\end{center}
\subsubsection{Many-to-many(machine translation)}
Machine translation uses a special many-to-many architecture:
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]   (neuron0) {$\angless{a}{0}$};
  \node[neuron]  (neuron1) [right=of neuron0] {$\angless{a}{1}$};
  \node[neuron]  (neuron2) [right=of neuron1] {$\angless{a}{2}$};
  \node[input]   (neuron3) [right=of neuron2] {$\cdots$};
  \node[neuron]  (neuron4) [right=of neuron3] {$\angless{a}{T_x}$};
  \node[neuron]  (neuron5) [right=of neuron4] {$\angless{a'}{1}$};
  \node[neuron]  (neuron6) [right=of neuron5] {$\angless{a'}{2}$};
  \node[input]   (neuron7) [right=of neuron6] {$\cdots$};
  \node[neuron]  (neuron8) [right=of neuron7] {$\angless{a'}{T_y}$};

  \node[input]   (input1)  [below=of neuron1] {$\angless{x}{1}$};
  \node[input]   (input2)  [below=of neuron2] {$\angless{x}{2}$};
  \node[input]   (input4)  [below=of neuron4] {$\angless{x}{T_x}$};
  
  \node[input]   (output5) [above=of neuron5] {$\angless{y}{1}$};
  \node[input]   (output6) [above=of neuron6] {$\angless{y}{2}$};
  \node[input]   (output8) [above=of neuron8] {$\angless{y}{T_y}$};
  

  \foreach \i [remember=\i as \last (initially 0)] in {1,...,8} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (neuron\last) -- (neuron\i);
  }
  \foreach \i in {1,2,4} {
    \draw[thick, -latex] (input\i) -- (neuron\i);
  }
  \foreach \i in {5,6,8} {
    \draw[thick, -latex] (neuron\i) -- (output\i);
  }
  \end{tikzpicture} 
\end{center}
It comprises an encoder ($1\sim T_x$) and a decoder ($T_x+1\sim T_x+T_y$).
\subsection{Language Model \& Sequence Generation}
A language model estimates the probability of a sentence:
\[P\left(\angless{y}{1},\angless{y}{2},\cdots, \angless{y}{T_y}\right)\]
For example, for speech recognition, when hearing a sentence \textit{the apple and pear salad}, a good system should output 
\[P(\textit{the apple and {\color{red}pear} salad})\gg P(\textit{the apple and {\color{red}pair} salad})\]
To build a language model using RNN, the training set is a corpus\footnote{NLP terminology. A large set.} of text. Each sentence is tokenized into a series of tokens:
\begin{center}
  \begin{tabular}{ccccccccc}
    Cats & average & 15 & hours & of & sleep & a & day & $\langle\text{EOS}\rangle$\footnotemark\\
    $\angless{y}{1}$ & $\angless{y}{2}$ & $\angless{y}{3}$ & $\angless{y}{4}$ & $\angless{y}{5}$ & $\angless{y}{6}$ & $\angless{y}{7}$ & $\angless{y}{8}$ &  $\angless{y}{9}$\\
  \end{tabular}
\end{center}
\footnotetext{EOS means end of sentence.}
If a word does not belong to the vocabulary, it is replaced with a unique token UNK, which stands for unknown words. After the tokenization, we build an RNN to model the chance of different sequences:
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]  (neuron0) {$\angless{a}{0}=\mathbf{0}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \ifnum\i=4
    \node[input]  (neuron\i) [right=of neuron\last]     {$\cdots$};
    \node[input]  (output\i)  [above=of neuron\i]       {};
    \else\ifnum\i=5
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{9}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{9}$};
    \else
    \node[neuron] (neuron\i)  [right=of neuron\last]  {$\angless{a}{\i}$};
    \node[input]  (output\i)  [above=of neuron\i]     {$\angless{\hat{y}}{\i}$};
    \fi
    \fi
  }
  \node[input]  (input1)   [below=of neuron1]     {$\angless{x}{1}=\mathbf{0}$};
  
  \draw[thick, -latex] (input1) -- (neuron1);
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (neuron\last) -- (neuron\i);
    \ifnum\i=4
    \else
    \draw[thick, -latex] (neuron\i) -- (output\i);
    \fi
    \ifnum\i<5
    \draw[thick, -latex, red] (output\i.east) .. controls +(360:5mm) and +(240:18mm) .. ($(neuron\next.south)+(-0.1,0)$);
    \fi
  }
  \end{tikzpicture} 
\end{center}
\begin{itemize}
  \item $\angless{\hat{y}}{t}$ is a softmax output representing the conditional probability distribution of the $t^{th}$ word given the first $t-1$ words. It's a vector whose dimension is the vocabulary size.
  \[\angless{\hat{y}}{t}_i=P\left(w_i\left\vert \angless{y}{1}\angless{y}{2}\cdots \angless{y}{t-1}\right.\right)\]
  \item The input $\angless{x}{t}$ is the ${t-1}^{th}$ token $\angless{y}{t-1}$.
  \item Loss function:
  \[\mathcal{L}=\displaystyle\sum_{t}\angless{\mathcal{L}}{t}\left(\angless{\hat{y}}{t},\angless{y}{t}\right)=-\displaystyle\sum_{t}\angless{y}{t}\log\angless{\hat{y}}{t}\]
  \item After training the RNN on a training set, the obtained model can calculate the probability of a sentence:
  \begin{align*}
    P\left(\angless{y}{1},\angless{y}{2},\angless{y}{3}\right)&=P\left(\angless{y}{1}\right)\cdot P\left(\angless{y}{2}\left\vert \angless{y}{1}\right.\right)\cdot P\left(\angless{y}{3}\left\vert \angless{y}{1}\angless{y}{2}\right.\right)\\
    &=\angless{\hat{y}}{1}_{\angless{y}{1}}\cdot\angless{\hat{y}}{2}_{\angless{y}{2}}\cdot\angless{\hat{y}}{3}_{\angless{y}{3}}
  \end{align*}
  \item Character-level model: take characters, as well as punctuations and spaces, instead of words as tokens. No unknown token, but longer sequence. Not good at capturing long range dependencies.
  \item Sampling novel sequences: randomly sample a word according to the probability distribution indicated by $\angless{\hat{y}}{t}$, feed it to the next step, and repeat. This process leads to a randomly generated sequence of words.
\end{itemize}
\subsection{Solving Vanishing Gradients}
Like CNN, RNN suffers from vanishing gradients problem, making it hard for RNN to capture long range dependencies. Exploding gradients, which cause mathematical overflow and overshoot in back-propagation, also happen, but can be solved by gradient clipping.
\subsubsection{Gated Recurrent Unit(GRU)}
\begin{itemize}
  \item Simplified GRU: 
  \begin{center}
    \begin{tikzpicture}[
      rect/.style={rectangle, draw=black, thick, minimum size=5mm},
      gate/.style={rectangle, draw=black, thick, minimum height=9mm, minimum width=1.2cm, align=center, font=\footnotesize},
      noboundary/.style={rectangle, thick, minimum size=7mm},
      circ/.style={circle, draw=black, thick, minimum size=7mm},
      node distance=5mm and 5mm
    ]
    \node[noboundary] (ct1)                                             {$\angless{c}{t-1}$};
    \node[noboundary] (ct1xt)           [right=1.5cm of ct1]                  {};
    \node[noboundary] (timesoldanchor)  [right=1cm of ct1]            {};
    \node[noboundary] (update)          [right=1cm of timesoldanchor]   {};
    \node[noboundary] (timesnewanchor)  [right=1cm of update]           {};
    \node[noboundary] (tanh)            [right=2cm of update]           {};
    \node[circ]       (timesold)        [above=1.5cm of timesoldanchor]   {$*$};
    \node[noboundary] (xt)              [below=of ct1xt]                {$\angless{x}{t}$};
    \node[gate]       (update_g)        [above=of update]               {update\\gate};
    \node[gate]       (tanh_g)          [above=of tanh]                 {tanh};
    \node[circ]       (ctplus)          [above=3cm of timesnewanchor]   {+};
    \node[noboundary] (softmaxanchor)   [right=1.5cm of ctplus]          {};
    \node[rect]       (softmax)         [above=of softmaxanchor]    {softmax};
    \node[noboundary] (ctout)           [right=1.5cm of softmaxanchor]  {$\angless{c}{t}$};
    \node[circ]       (timesnew)        [above=1.5cm of timesnewanchor]   {$*$};
    \node[noboundary] (yt)              [above=of softmax]              {$\angless{\hat{y}}{t}$};
    \draw[thick, -latex] (xt) -- (ct1xt.center);
    \draw[thick, -latex] (ct1) -| node[near end, left, scale=0.7]{$\sigma$} node[near end, right, scale=0.7]{$\makecell{ W_u\\b_u}$}(update_g);
    \draw[thick, -latex] (ct1) -| node[near end, left, scale=0.7]{$\tanh$}  node[near end, right, scale=0.7]{$\makecell{ W_c\\b_c}$}(tanh_g);
    \draw[thick, -latex] (update_g.north) |- node[near end, above, scale=0.7]{$\angless{\Gamma_u}{t}$} (timesnew.west);
    \draw[thick, -latex] (tanh_g.north) |- node[near end, above, scale=0.7]{$\angless{\tilde{c}}{t}$}(timesnew.east);
    \draw[thick, -latex] (timesnew) -- (ctplus);
    \draw[thick, -latex] (ctplus) -- (ctout);
    \draw[thick] (softmaxanchor.center) -- (softmax);  
    \draw[thick, -latex] (softmax) -- (yt);
    \draw[thick, -latex] (timesoldanchor.center) -- (timesold);
    \draw[thick, -latex] (update_g.north) |- node[near end, above, scale=0.7]{$1-\angless{\Gamma_u}{t}$} (timesold.east);
    \draw[thick, -latex] (timesold.north) |- (ctplus.west);
    
    \draw[thick, red] (1.05, -0.5) rectangle (8.95, 4.3);
    \draw[thick, green] (1, -0.55) rectangle (9, 5.3);

    \end{tikzpicture} 
  \end{center}
  \begin{align*}
    \angless{\tilde{c}}{t}&=\tanh\left(W_c\left[\angless{c}{t-1},\angless{x}{t}\right]+b_c\right)\\
    \Gamma_u&=\sigma\left(W_u\left[\angless{c}{t-1},\angless{x}{t}\right]+b_u\right)\\
    \angless{c}{t}&=\Gamma_u\ast \angless{\tilde{c}}{t}+\left(1-\Gamma_u\right)\ast \angless{c}{t-1}
  \end{align*}
  \begin{itemize}
    \item $c$ means memory cell. For GRU, $\angless{a}{t}=\angless{c}{t}$.
    \item $\angless{\tilde{c}}{t}$ is a candidate to replace $\angless{c}{t}$. 
    \item $\Gamma_u$ is an update gate controlling whether to replace $\angless{c}{t}$ with $\angless{\tilde{c}}{t}$. For most inputs, $\Gamma_u$ (sigmoid output) is close to 1 or 0. For 1, the replacement is carried out. For 0, the value of $\angless{c}{t}$ is preserved. 
    \item GRU solves the vanishing gradients problem because it preserves the value of $\angless{c}{t}$ in a lot of steps, allowing long-range dependencies to be captured.
    \item $\angless{c}{t}, \angless{\tilde{c}}{t}, \Gamma_u$ are vectors of the same dimension. $\ast$ is element-wise multiplication.
  \end{itemize}
  \item Full GRU:
  \begin{center}
    \begin{tikzpicture}[
      rect/.style={rectangle, draw=black, thick, minimum size=5mm},
      gate/.style={rectangle, draw=black, thick, minimum height=9mm, minimum width=1.2cm, align=center, font=\footnotesize},
      noboundary/.style={rectangle, thick, minimum size=7mm},
      circ/.style={circle, draw=black, thick, minimum size=7mm},
      node distance=5mm and 5mm
    ]
    \node[noboundary] (ct1)                                             {$\angless{c}{t-1}$};
    \node[noboundary] (ct1xt)           [right=1.5cm of ct1]            {};
    \node[noboundary] (timesoldanchor)  [right=1cm of ct1]              {};
    \node[noboundary] (update)          [right=1cm of timesoldanchor]   {};
    \node[noboundary] (relevance)       [right=1cm of update]           {};
    \node[circ]       (timesnewanchor)  [right=1cm of relevance]        {$*$};
    \node[noboundary] (tanh)            [right=1cm of timesnewanchor]   {};
    \node[circ]       (timesold)        [above=1.5cm of timesoldanchor] {$*$};
    \node[noboundary] (xt)              [below=of ct1xt]                {$\angless{x}{t}$};
    \node[gate]       (update_g)        [above=of update]               {update\\gate};
    \node[gate]       (relevance_g)     [above=of relevance]            {relevance\\gate};
    \node[gate]       (tanh_g)          [above=of tanh]                 {tanh};
    \node[circ]       (ctplus)          [above=3cm of timesnewanchor]   {+};
    \node[noboundary] (softmaxanchor)   [right=1.5cm of ctplus]         {};
    \node[rect]       (softmax)         [above=of softmaxanchor]        {softmax};
    \node[noboundary] (ctout)           [right=1.5cm of softmaxanchor]  {$\angless{c}{t}$};
    \node[circ]       (timesnew)        [above=1.5cm of timesnewanchor]               {$*$};
    \node[noboundary] (yt)              [above=of softmax]              {$\angless{\hat{y}}{t}$};
    \draw[thick, -latex] (xt) -| (update_g);
    \draw[thick, -latex] (xt) -| (relevance_g);
    \draw[thick, -latex] (xt) -| (tanh_g);
    \draw[thick, -latex] (ct1) -| node[near end, left, scale=0.7]{$\sigma$} node[near end, right, scale=0.7]{$\makecell{W_u\\b_u}$}(update_g);
    \draw[thick, -latex] (ct1) -| node[near end, left, scale=0.7]{$\sigma$} node[near end, right, scale=0.7]{$\makecell{W_r\\b_r}$}(relevance_g);
    \draw[thick, -latex] (ct1) -- (timesnewanchor);
    \draw[thick, -latex] (update_g.north) |- node[near end, above, scale=0.7]{$\angless{\Gamma_u}{t}$} (timesnew.west);
    \draw[thick, -latex] (tanh_g.north) |- node[near end, above, scale=0.7]{$\angless{\tilde{c}}{t}$}(timesnew.east);
    \draw[thick, -latex] (timesnew) -- (ctplus);
    \draw[thick, -latex] (ctplus) -- (ctout);
    \draw[thick] (softmaxanchor.center) -- (softmax);  
    \draw[thick, -latex] (softmax) -- (yt);
    \draw[thick, -latex] (timesoldanchor.center) -- (timesold);
    \draw[thick, -latex] (update_g.north) |- node[near end, above, scale=0.7]{$1-\angless{\Gamma_u}{t}$} (timesold.east);
    \draw[thick, -latex] (timesold.north) |- (ctplus.west);
    \draw[thick, -latex] (relevance_g) -| node[near end, right, scale=0.7]{$\angless{\Gamma_r}{t}$} (timesnewanchor);
    \draw[thick, -latex] (timesnewanchor) -| node[near end, left, scale=0.7]{$\tanh$} node[near end, right, scale=0.7]{$\makecell{W_c\\b_c}$}(tanh_g);
    
    \draw[thick, red] (1.05, -0.5) rectangle (9.6, 4.3);
    \draw[thick, green] (1, -0.55) rectangle (10.2, 5.3);

    \end{tikzpicture} 
  \end{center}
  \begin{align*}
    \angless{\tilde{c}}{t}&=\tanh\left(W_c\left[\Gamma_r\ast \angless{c}{t-1},\angless{x}{t}\right]+b_c\right)\\
    \Gamma_u&=\sigma\left(W_u\left[\angless{c}{t-1},\angless{x}{t}\right]+b_u\right)\\
    \Gamma_r&=\sigma\left(W_r\left[\angless{c}{t-1},\angless{x}{t}\right]+b_r\right)\\
    \angless{c}{t}&=\Gamma_u\ast \angless{\tilde{c}}{t}+\left(1-\Gamma_u\right)\ast \angless{c}{t-1}
  \end{align*}
  \begin{itemize}
    \item $\Gamma_r$ is a relevance gate controlling the relevance between $\angless{\tilde{c}}{t}$ and $\angless{c}{t-1}$.
    \item The addition of $\Gamma_r$ is a result of research practice. Researcher converged to it after various attempts of architecture. 
  \end{itemize}
\end{itemize}
\subsubsection{Long Short Term Memory(LSTM)}
\begin{center}
  \begin{tikzpicture}[
    rect/.style={rectangle, draw=black, thick, minimum size=5mm},
    gate/.style={rectangle, draw=black, thick, minimum height=9mm, minimum width=1.2cm, align=center, font=\footnotesize},
    noboundary/.style={rectangle, thick, minimum size=7mm},
    circ/.style={circle, draw=black, thick, minimum size=7mm},
    node distance=5mm and 5mm
  ]
  \node[noboundary] (at1)                                             {$\angless{a}{t-1}$};
  \node[noboundary] (at1xt)           [right=of at1]                  {};
  \node[noboundary] (forget)          [right=1cm of at1xt]            {};
  \node[noboundary] (update)          [right=1cm of forget]           {};
  \node[noboundary] (timesnewanchor)  [right=1mm of update]           {};
  \node[noboundary] (tanh)            [right=1cm of update]           {};
  \node[noboundary] (output)          [right=1cm of tanh]             {};
  \node[noboundary] (xt)              [below=of at1xt]                {$\angless{x}{t}$};
  \node[gate]       (forget_g)        [above=of forget]               {forget\\gate};
  \node[gate]       (update_g)        [above=of update]               {update\\gate};
  \node[gate]       (tanh_g)          [above=of tanh]                 {tanh};
  \node[gate]       (output_g)        [above=of output]               {output\\gate};
  \node[noboundary] (ct1)             [above=4cm of at1]              {$\angless{c}{t-1}$};
  \node[circ]       (timesold)        [above=4cm of forget]           {$*$};
  \node[circ]       (ctplus)          [above=4cm of timesnewanchor]   {+};
  \node[noboundary] (ctfirst)         [above=4cm of output]           {};
  \node[noboundary] (ctout)           [right=2.5cm of ctfirst]        {$\angless{c}{t}$};
  \node[circ]       (timesnew)        [above=2cm of timesnewanchor]   {$*$};
  \node[circ]       (timesout)        [above=2cm of output]           {$*$};
  \node[noboundary] (atout)           [right=2.5cm of timesout]       {$\angless{a}{t}$};
  \node[rect]       (tanhout)         [below=3mm of ctfirst]          {tanh};
  \node[noboundary] (softmaxanchor)   [right=3mm of timesout]             {};
  \node[rect]       (softmax)         [above=3cm of softmaxanchor]    {softmax};
  \node[noboundary] (yt)              [above=of softmax]              {$\angless{\hat{y}}{t}$};
  \draw[thick, -latex] (xt) -- (at1xt.center);
  \draw[thick, -latex] (at1) -| node[near end, left, scale=0.7]{$\sigma$} node[near end, right, scale=0.7]{$\makecell{ W_f\\b_f}$}(forget_g);
  \draw[thick, -latex] (at1) -| node[near end, left, scale=0.7]{$\sigma$} node[near end, right, scale=0.7]{$\makecell{ W_u\\b_u}$}(update_g);
  \draw[thick, -latex] (at1) -| node[near end, left, scale=0.7]{$\tanh$}  node[near end, right, scale=0.7]{$\makecell{ W_c\\b_c}$}(tanh_g);
  \draw[thick, -latex] (at1) -| node[near end, left, scale=0.7]{$\sigma$} node[near end, right, scale=0.7]{$\makecell{ W_o\\b_o}$}(output_g);
  \draw[thick, -latex] (ct1) -- (timesold);
  \draw[thick, -latex] (forget_g) -- node[near start, left, scale=0.7]{$\angless{\Gamma_f}{t}$}(timesold);
  \draw[thick, -latex] (timesold) -- (ctplus);
  \draw[thick, -latex] (update_g.north) |- node[near start, left, scale=0.7]{$\angless{\Gamma_u}{t}$}(timesnew.west);
  \draw[thick, -latex] (tanh_g.north) |- node[near start, left, scale=0.7]{$\angless{\tilde{c}}{t}$}(timesnew.east);
  \draw[thick, -latex] (timesnew) -- (ctplus);
  \draw[thick, -latex] (ctplus) -| (tanhout);
  \draw[thick, -latex] (ctplus) -- (ctout);
  \draw[thick, -latex] (tanhout) -- (timesout);
  \draw[thick, -latex] (output_g) -- node[near start, left, scale=0.7]{$\angless{\Gamma_u}{o}$}(timesout);
  \draw[thick, -latex] (timesout) -- (atout);
  \draw[thick] (softmaxanchor.center) -- ($(softmaxanchor.center)+(0, 1.75)$);
  \draw[thick, dotted] ($(softmaxanchor.center)+(0, 1.75)$) -- ($(softmaxanchor.center)+(0, 2.25)$); 
  \draw[thick, -latex] ($(softmaxanchor.center)+(0, 2.25)$) -- (softmax);  
  \draw[thick, -latex] (softmax) -- (yt);
  
  \draw[thick, red] (1.05, -0.5) rectangle (9.1, 5.5);
  \draw[thick, green] (1, -0.55) rectangle (10.2, 6.8);
  
  \end{tikzpicture} 
\end{center}
\begin{align*}
  \angless{\tilde{c}}{t}&=\tanh\left(W_c\left[\angless{a}{t-1},\angless{x}{t}\right]+b_c\right)\\
  \Gamma_u&=\sigma\left(W_u\left[\angless{a}{t-1},\angless{x}{t}\right]+b_u\right)\\
  \Gamma_f&=\sigma\left(W_f\left[\angless{a}{t-1},\angless{x}{t}\right]+b_f\right)\\
  \Gamma_o&=\sigma\left(W_o\left[\angless{a}{t-1},\angless{x}{t}\right]+b_o\right)\\
  \angless{c}{t}&=\Gamma_u\ast \angless{\tilde{c}}{t}+\Gamma_f \ast \angless{c}{t-1}\\
  \angless{a}{t}&=\Gamma_o\ast \tanh\left(\angless{c}{t}\right)
\end{align*}
\begin{itemize}
  \item $\Gamma_u$: update gate. $\Gamma_f$: forget gate. $\Gamma_o$: output gate.
  \item Peephole connection: $\angless{c}{t}$ used to calculate gates.
\end{itemize}
\subsection{Complex RNNs}
Using the standard RNN, including GRU and LSTM as building blocks, we can build more complex RNN structures.
\subsubsection{Bidirectional RNN(BRNN)}
The RNNs above are uni-directional: no information from future steps is used. A BRNN solves the problem by also calculating activation from the reverse direction.
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]  (neuron0) {$\angless{a}{0}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,8} {
    \pgfmathtruncatemacro{\index}{(\i+1)/2}
    \ifodd\i
    \node[neuron] (neuron\i) [right=of neuron\last]{$\angless{\overleftarrow{a}}{\index}$};
    \node[input] (input\index) [below right=0.5cm and -0.3cm of neuron\i] {$\angless{x}{\index}$};
    \node[input] (output\index) [above right=0.5cm and -0.3cm of neuron\i] {$\angless{\hat{y}}{\index}$};
    \else
    \node[neuron] (neuron\i) [right=0.1cm of neuron\last]{$\angless{\overrightarrow{a}}{\index}$};
    \fi
  }
  \node[input] (neuron9) [right=of neuron8] {$\angless{a}{5}$};
  \foreach \i [remember=\i as \last (initially 0)] in {1,...,4} {
    \pgfmathtruncatemacro{\righti}{\i*2}
    \pgfmathtruncatemacro{\lefti}{\i*2-1}
    \pgfmathtruncatemacro{\lastrighti}{\i*2-2}
    \pgfmathtruncatemacro{\nextlefti}{\i*2+1}
    \draw[thick, -latex, red]   (input\i) -- (neuron\lefti);
    \draw[thick, -latex, green] (input\i) -- (neuron\righti);
    \draw[thick, -latex, red]   (neuron\lefti) -- (output\i);
    \draw[thick, -latex, green] (neuron\righti) -- (output\i);
    \draw[thick, -latex, red]   (neuron\nextlefti.south) to [bend left=20] (neuron\lefti.south);
    \draw[thick, latex-, green] (neuron\righti.south) to [bend left=20] (neuron\lastrighti.south);
  }
  \end{tikzpicture}
\end{center}
Now the softmax output at each step becomes:
\begin{align*}
  \angless{\hat{y}}{t}=g\left(W_y\left[\angless{\overrightarrow{a}}{t},\angless{\overleftarrow{a}}{t}\right]+b_y\right)
\end{align*}
\subsubsection{Deep RNN}
\newcommand{\drnnunit}[3]{#1^{\left[#2\right]\left\langle #3\right\rangle}}
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]  (neuron10) {$\drnnunit{a}{1}{0}$};
  \node[input]  (neuron20) [above=of neuron10] {$\drnnunit{a}{2}{0}$};
  \node[input]  (neuron30) [above=of neuron20] {$\drnnunit{a}{3}{0}$};
  
  \foreach \i [remember=\i as \lasti (initially 0)] in {1,2,3} {
    \foreach \j [remember=\j as \lastj (initially 0)] in {1,2,3,4} {
      \node[neuron] (neuron\i\j) [right=of neuron\i\lastj] {$\drnnunit{a}{\i}{\j}$};
      \draw[thick, -latex] (neuron\i\lastj) -- (neuron\i\j);
    }
  }
  \foreach \j in {1,2,3,4} {
    \node[input] (neuron4\j) [above=of neuron3\j] {$\angless{\hat{y}}{\j}$};
    \foreach \i in {1,2,3} {
      \pgfmathtruncatemacro{\nexti}{\i+1}
      \draw[thick, -latex] (neuron\i\j) -- (neuron\nexti\j);
    }
  }
  \end{tikzpicture}
\end{center}
\begin{itemize}
  \item To calculate the activation:
  \begin{align*}
    \drnnunit{a}{l}{t}=g\left(\bracketss{W_a}{l}\left[\drnnunit{a}{l}{t-1},\drnnunit{a}{l-1}{t}\right]+\bracketss{b_a}{l}\right)
  \end{align*}
  \item $\bracketss{W_a}{l},\bracketss{b_a}{l}$ is the same for all hidden units in the same layer.
  \item Unlike CNN, $n_l=3$ is already deep for a DRNN.
  \item Optionally, further layers can be added along the vertical direction in the figure above (i.e. new layers without the horizontal (temporal) connections).
\end{itemize}
\section{Word Embeddings}
\newcommand{\embedding}[1]{\mathbf{e}_{#1}}
\newcommand{\onehot}[1]{\mathbf{o}_{#1}}
\subsection{What is Word Embeddings}
\begin{itemize}
  \item Words were represented as one-hot vectors $\onehot{i}$, whose dimension is the vocabulary size. This representation views each word as an isolated entity, failing to capture the similarities between words, e.g. orange and apple are both fruits, King and Queen are both royal, man and woman are both human, etc. 
  \item A featurized representation called \textit{word embedding}\footnote{Each word is embedded in the n-dimensional space formed by the features, and hence the name.} that gives each word a series of feature values solves the problem. 
  \item The features can be learned by ML, yet the learned features are generally not easy to interpret. Assuming that interpretable features do exist, the learned embeddings are probably their linear combinations.
  \item \textit{t-SNE} is a technique used to visualize word embeddings on a 2D place via non-linear dimension reduction. 
\end{itemize}
\subsubsection{Transfer Learning}
\begin{itemize}
  \item Word embedding can be applied effectively to some NLP applications in combination with transfer learning:
    \begin{enumerate}
      \item Learn word embeddings from large text corpus (1-100B words), or download pre-trained embeddings on line.
      \item Transfer the embedding to a new task with a smaller training set (100k words).
      \item Optional: fine tune the embedding with new data.
    \end{enumerate}
  \item Transfer learning + word embedding works for: named entity recognition, text summarization, co-reference resolution, parsing. Less useful for: language modeling, machine translation.
  \item Word embedding is similar to the Siamese network used in face encoding. The difference is that in face encoding, the NN outputs an encoding for any image, even unseen before; where as word embedding outputs a result only for words in the vocabulary.
\end{itemize}
\subsubsection{Analogy Reasoning}
Analogy reasoning answers questions like: what to King is like man to woman (of course Queen). The answer can be found using word embedding:
\begin{align*}
  w=\argmax sim\left(\embedding{w}, \embedding{king} - \embedding{man} + \embedding{woman}\right)
\end{align*}
in which $\embedding{w}$ is the embedding of $w$. The most widely used similarity function is cosine similarity: 
\begin{align*}
  sim\mathbf{(u, v)=\frac{u^{\mathsf{T}}v}{\Vert u\Vert_2\Vert v\Vert_2}}
\end{align*}
\subsubsection{Embedding Matrix}
Stacking the embeddings of the whole vocabulary gives us the embedding matrix:
\begin{align*}
  E=\left[\embedding{1}, \embedding{2}, \cdots, \embedding{n_v}\right]
\end{align*}
Obviously we have
\begin{align*}
  E\cdot \onehot{i}=\embedding{i}
\end{align*}
because $\onehot{i}$ is indeed the unit vector corresponding to the $i^{th}$ column.
\subsection{Learning Word Embeddings}
A feasible method to learn word embeddings is to build a language model. Suppose we are building a language model using NN to predict the next word following a sequence.
\begin{center}
  \begin{tabular}{ccccccc}
    I & want & a & glass & of & orange & \underline{\phantom{juice}} \\
    4343 & 9665 & 1 & 3852 & 6163 & 6257 & 
  \end{tabular}
\end{center}
Suppose $E$ is the embedding matrix (300$\times$10000, 300-dimension encoding, 10000-word vocabulary). We can build the following NN, in which $E$ becomes parameters, to solve the problem:
\begin{center}
  \begin{tabular}{cccccc}
    I & $\onehot{4343}$ & $\xrightarrow{E}$ & $\embedding{4343}$ & \multirow{6}{*}{$\boxed{\makecell{\circ\\\circ\\\circ\\\circ}}$} & \multirow{6}{*}{softmax}\\  
    want & $\onehot{9665}$ & $\xrightarrow{E}$ & $\embedding{9665}$ \\
    a & $\onehot{1}$ & $\xrightarrow{E}$ & $\embedding{1}$  \\
    glass & $\onehot{3852}$ & $\xrightarrow{E}$ & $\embedding{3852}$ \\ 
    of & $\onehot{6163}$ & $\xrightarrow{E}$ & $\embedding{6163}$  \\
    orange & $\onehot{6257}$ & $\xrightarrow{E}$ & $\embedding{4343}$ \\  
  \end{tabular}
\end{center}
Here we are using a \textit{historical window} of size 6, i.e. we would like to predict a word according to the preceding 6 words. The NN takes a 6$\times$300=1800 dimensional input. Besides the embedding matrix, it also has parameters $\bracketss{W}{1,2}, \bracketss{b}{1,2}$ for the NN layer and the softmax output.

The language model built above uses the preceding 6 words as the context of the prediction and the following word as the target of it. Other choices are available. For example, the 4 words on the left \& right, the preceding 1 word, 1 word nearby, etc.
\subsubsection{Word2Vec (Skip-Grams Model)}
\begin{itemize}
  \item Model construction: a word is randomly selected as the context, and another word beside it (in a $\pm n$ window, $n$ could be 5, 10, etc). Such context-target pairs form a training set for a supervised learning problem whose goal is to learn good word embeddings. The result doesn't necessarily do well in the supervised learning problem per se.
\begin{align*}
  \onehot{c}\xrightarrow{E}\embedding{c}=E\cdot\onehot{c}\xrightarrow{softmax}\hat{y}
\end{align*}
\item Output and loss function of the softmax:
\begin{align*}
  p\left(t\vert c\right)&=\frac{e^{\vect{\theta}_{t}^{\mathsf{T}}\embedding{c}}}{\displaystyle\sum_{j=1}^{n_v}e^{\vect{\theta}_{j}^{\mathsf{T}}\embedding{c}}}\\
  \mathcal{L}\left(\hat{y},y\right)&=\displaystyle\sum_{i=1}^{n_v}y_i\log\hat{y}_i
\end{align*}
$\vect{\theta}_{t}$ is a vector of parameters for each word in the vocabulary.
\item Efficiency issue: the model suffers from efficiency issue caused by the softmax classification. The sum over the vocabulary makes it slow. \textit{Hierarchical softmax classification} solves the problem by completing the classification in multiple steps following a binary-tree style structure, which reduces the average time complexity of the calculation from $O\left(n_v\right)$ to $O\left(\log\left(n_v\right)\right)$\footnote{In practice the binary tree is not necessarily perfectly balanced.}.
\item Context selection: in practice, the context is not selected at random. Various heuristics are used in order to avoid common words like ``the, of, a'' get selected too frequently.
\end{itemize}
\subsubsection{Negative Sampling}
Negative sampling algorithm creates a new supervised learning problem whose goal is to tell if a given pair of words form a context-target pair. For each context word, a group of training examples contains 1 positive example and $k$ negative examples\footnote{$k=2\sim 5$ for small training set, $5\sim20$ for big training set, as proposed by the original author.}.
\begin{center}
  \begin{tabular}{ccc}
    context ($c$) & word ($t$) & target \\
    orange & juice & 1\\
    orange & king & 0\\
    orange & book & 0\\
    orange & the & 0\\
    orange & of & 0
  \end{tabular}
\end{center}
Instead of having one giant softmax output, a logistic regression is available for every word $t$ in the vocabulary:
\begin{align*}
  P\left(y=1|c,t\right)=\sigma\left(\vect{\theta}_{t}^{\mathsf{T}}\embedding{c}\right)
\end{align*}
With the choice of training examples above, we are only training $k+1$ such logistic regressions for each context word $c$.

Sampling methods of the negative examples:
\begin{itemize}
  \item According to empirical frequencies of the words: $f\left(w_i\right)$. Problem: a lot of ``the, of, a, and''.
  \item Uniformly among vocabulary words: $\frac{1}{n_v}$. Problem: does not reflect real-world distribution of words.
  \item Mixture: $\frac{f\left(w_i\right)^{3/4}}{\sum_{j=1}^{n_v}f\left(w_j\right)^{3/4}}$ proposed by the original authors.
\end{itemize}
\subsubsection{GloVe}
GloVe: global vectors for word representation. Let $X_{ij}$ represent the number of times $j$ appears in $i$'s context.
The target of the algorithm is to minimize 
\begin{align*}
  \displaystyle\sum_{i=1}^{n_v}\displaystyle\sum_{j=1}^{n_v}f\left(X_{ij}\right)\left(\vect{\theta}_{i}^{\mathsf{T}}\embedding{j}+b_i+b'_j-\log X_{ij}\right)^2
\end{align*}
$f\left(X_{ij}\right)$ is a weight term that
\begin{itemize}
  \item Handles case of $X_{ij}=0$ ($\log X_{ij}$ is undefined). In such case $f\left(X_{ij}\right)=0$.
  \item Balances weights of high-frequency words and low-frequency words.
\end{itemize}
In this algorithm, $\vect{\theta}_{i}$ and $\embedding{j}$ play a symmetric role. Their average can be used as the final outputted embedding:
\begin{align*}
  \embedding{i}^{final}=\frac{\embedding{i}+\vect{\theta}_{i}}{2}
\end{align*}
\subsection{Applications}
\subsubsection{Sentiment Classification}
From customer's comments on a restaurant, classify their sentiment in terms of $1\sim 5$ stars.
\begin{itemize}
  \item Simple softmax model: average embeddings of all words in a comment and apply softmax to it. 
  \begin{itemize}
  \item Problem: ignores order of words. For example: \textit{Completely lacking in good taste, good service, and good ambience} will be classified as a high-star comment for all the \textit{good}s, but is actually a 1-star comment. 
  \end{itemize}
  \item Use many-to-one RNN: use the words as the input sequence of RNN, output a sentiment in the end.
\end{itemize}
\subsubsection{Debiasing Word Embeddings}
Word embeddings can reflect gender, ethnicity, age, sexual orientation and other biases of the text used to train the model. These biases should be reduced.
\begin{enumerate}
  \item Identify bias direction. For gender bias, we can obtain the bias direction by averaging a series of difference vectors between sexually opposite pairs of words: $\embedding{he}-\embedding{she}$, $\embedding{male}-\embedding{female}$, etc.  
  \[\vect{g}=\frac{1}{n_p}\displaystyle\sum_{p=1}^{n_p}\left(\embedding{w_{p1}}-\embedding{w_{p2}}\right)\]
  \item Neutralize: for every non-definitional word (examples of definitional words: he, she, grandfather, grandmother, etc), project its embedding along the non-bias directions (i.e. into the sub-space perpendicular to the bias direction) to get rid of the bias, similar to PCA. Denote the projection of a vector $\vect{\alpha}$ along the bias direction as $\vect{\alpha}_B=\frac{\vect{\alpha^{\mathsf{T}}g}}{\Vert g\Vert^2}$, we have  
  \[\vect{e}_{w\perp}=\embedding{w}-\vect{e}_{wB}=\embedding{w}-\frac{\embedding{w}^{\mathsf{T}}\vect{g}}{\Vert g\Vert^2}\vect{g}\]
  \item Equalize pairs: a series of linear algebra operations that move pairs of words in the embedding space so that their distances to supposedly neutral words are the same. For example, the distance between babysitter and grandfather/grandmother should be the same. 
  \begin{align*}
    \vect{\mu}&=\frac{\embedding{w1}+\embedding{w2}}{2}\\
    \vect{\mu}_{\perp}&=\vect{\mu}-\vect{\mu}_B\\
    \embedding{w1}^{corrected}&=\vect{\mu}_{\perp}+\sqrt{1-\Vert\vect{\mu}_{\perp}\Vert^2}\frac{\left(\embedding{w1}-\vect{\mu}\right)_B}{\Vert\embedding{w1}-\vect{\mu}\Vert}=\vect{\mu}+\lambda\vect{g}\\
    \embedding{w2}^{corrected}&=\vect{\mu}_{\perp}+\sqrt{1-\Vert\vect{\mu}_{\perp}\Vert^2}\frac{\left(\embedding{w2}-\vect{\mu}\right)_B}{\Vert\embedding{w2}-\vect{\mu}\Vert}=\vect{\mu}-\lambda\vect{g}\\
  \end{align*}
  in which constant\footnote{Unclear why in this form.}
  \[\lambda=\frac{\sqrt{1-\Vert\vect{\mu}_{\perp}\Vert^2}}{\Vert\vect{g}\Vert^2}\frac{\left(\embedding{w1}-\embedding{w2}\right)^{\mathsf{T}}\vect{g}}{\Vert\embedding{w1}-\embedding{w2}\Vert}\]
\end{enumerate}
There are generally not many definitional words that should not be neutralized and pairs that should be equalized. They can be chosen by hand.
\section{Sequence to Sequence Models}
\subsection{Basic Models}
\subsubsection{Machine Translation}
Machine translation uses an encoder network and a decoder network.
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]   (neuron0) {$\angless{a}{0}$};
  \node[neuron]  (neuron1) [right=of neuron0] {$\angless{a}{1}$};
  \node[neuron]  (neuron2) [right=of neuron1] {$\angless{a}{2}$};
  \node[input]   (neuron3) [right=of neuron2] {$\cdots$};
  \node[neuron]  (neuron4) [right=of neuron3] {$\angless{a}{T_x}$};
  \node[neuron]  (neuron5) [right=of neuron4] {$\angless{a'}{1}$};
  \node[neuron]  (neuron6) [right=of neuron5] {$\angless{a'}{2}$};
  \node[input]   (neuron7) [right=of neuron6] {$\cdots$};
  \node[neuron]  (neuron8) [right=of neuron7] {$\angless{a'}{T_y}$};

  \node[input]   (input1)  [below=of neuron1] {$\angless{x}{1}$};
  \node[input]   (input2)  [below=of neuron2] {$\angless{x}{2}$};
  \node[input]   (input4)  [below=of neuron4] {$\angless{x}{T_x}$};
  
  \node[input]   (output5) [above=of neuron5] {$\angless{y}{1}$};
  \node[input]   (output6) [above=of neuron6] {$\angless{y}{2}$};
  \node[input]   (output7) [above=of neuron7] {};
  \node[input]   (output8) [above=of neuron8] {$\angless{y}{T_y}$};
  

  \foreach \i [remember=\i as \last (initially 0)] in {1,...,8} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (neuron\last) -- (neuron\i);
  }
  \foreach \i in {1,2,4} {
    \draw[thick, -latex] (input\i) -- (neuron\i);
  }
  \foreach \i in {5,6,8} {
    \draw[thick, -latex] (neuron\i) -- (output\i);
  }
  \foreach \i in {5,6,7} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (output\i.east) .. controls +(360:5mm) and +(240:18mm) .. ($(neuron\next.south)+(-0.1,0)$);
  }
  \draw[dashed, red] (6, -1.5) -- (6, 1.5);
  \end{tikzpicture} 
\end{center}
\begin{itemize}
  \item The decoder network in the machine translation architecture is similar to the language model architecture, except that it uses the output of the encoder network, i.e. the encoded representation of the input sentence, instead of 0 as initial input. It is a conditional language model that models the output translation under the condition of the input sentence. 
  \item From the POV of probability, this conditional model captures the conditional probability distribution of the output translation given the input sentence:
  \[\pcond{\angless{y}{1}\cdots\angless{y}{T_y}}{\angless{x}{1}\cdots\angless{x}{T_x}}\]
  \item In language model, we randomly sampled the output according to the probability distribution. For machine translation, we would like to output the best translation, i.e. to obtain
  \[\displaystyle\argmax_{\angless{y}{1}\cdots\angless{y}{T_y}} \pcond{\angless{y}{1}\cdots\angless{y}{T_y}}{\angless{x}{1}\cdots\angless{x}{T_x}}\]
  \item Greedy search won't give the correct answer. Other search algorithms should be used.
\end{itemize}
\subsubsection{Image Captioning}
Image captioning can be completed using a similar structure. A pre-trained image encoder is used as the encoder network, whose output feature vector is fed into the decoder network. 
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node[input]   (neuron0) {Image};
  \node[neuron]  (neuron1) [right=of neuron0] {$\makecell{\circ\\\circ\\\circ}$};
  \node[neuron]  (neuron2) [right=of neuron1] {$\angless{a'}{1}$};
  \node[neuron]  (neuron3) [right=of neuron2] {$\angless{a'}{2}$};
  \node[input]   (neuron4) [right=of neuron3] {$\cdots$};
  \node[neuron]  (neuron5) [right=of neuron4] {$\angless{a'}{T_y}$};
  
  \node[input]   (output2) [above=of neuron2] {$\angless{y}{1}$};
  \node[input]   (output3) [above=of neuron3] {$\angless{y}{2}$};
  \node[input]   (output4) [above=of neuron4] {};
  \node[input]   (output5) [above=of neuron5] {$\angless{y}{T_y}$};
  

  \foreach \i [remember=\i as \last (initially 0)] in {1,...,5} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (neuron\last) -- (neuron\i);
  }
  \foreach \i in {2,3,5} {
    \draw[thick, -latex] (neuron\i) -- (output\i);
  }
  \foreach \i in {2,3,4} {
    \pgfmathtruncatemacro{\next}{\i+1}
    \draw[thick, -latex] (output\i.east) .. controls +(360:5mm) and +(240:18mm) .. ($(neuron\next.south)+(-0.1,0)$);
  }
  \draw[dashed, red] (1.8, -1.5) -- (1.8, 1.5);
  \end{tikzpicture} 
\end{center}
\subsection{Beam Search}
Beam search is an variation of greedy search that considers more options than the most probable one at each step. It's a heuristic algorithm that runs fast but does not guarantee finding the optimal solution. We will illustrate beam search with the machine translation example.
\begin{itemize}
  \item Parameter $B$, i.e. the \textit{beam width} controls the number of options considered at each step. 
  \item Denote the combination of the encoder network and the first $k$ layers of the decoder network as $M_k$. Obviously, $M_k$ constitutes a language model that outputs
\[\angless{P}{k}=\pcond{\angless{y}{k}}{x\angless{y}{1}\cdots\angless{y}{k-1}}\]
  In general we have the induction:
  \begin{align*}
    \pcond{\angless{y}{1}\cdots\angless{y}{k}}{x}&=\pcond{\angless{y}{k}}{x\angless{y}{1}\cdots\angless{y}{k-1}}\pcond{\angless{y}{1}\cdots\angless{y}{k-1}}{x}\\
    &=\angless{P}{k}\pcond{\angless{y}{1}\cdots\angless{y}{k-1}}{x}\\
    &=\angless{P}{k}\angless{P}{k-1}\pcond{\angless{y}{1}\cdots\angless{y}{k-2}}{x}=\cdots\\
    &=\displaystyle\prod_{t=1}^k\angless{P}{t}
  \end{align*}
  \item The operation of beam search:
  \begin{itemize}
    \item Step 1: we store $B$ values of $\angless{y}{1}$ corresponding to the $B$ components of $M_1$'s softmax output with the largest values. These $\angless{y}{1}$ values maximize $\pcond{\angless{y}{1}}{x}$. 
    \item Step 2: we have $B$ versions of $M_2$, each taking an $\angless{y}{1}$ value from step 1 and feeding it to the output layer. We examine the $B\times n_v$ softmax output components, select the largest $B$ of them and store the corresponding $\angless{y}{1}\angless{y}{2}$ combinations. These $\angless{y}{1}\angless{y}{2}$ combinations maximize $\pcond{\angless{y}{1}\angless{y}{2}}{x}$.
    \item In general, at step $k$ $(k\ge 2)$ , we examine $B\times n_v$ softmax output components of $B$ versions of $M_k$, select the largest $B$ of them and store the corresponding $\angless{y}{1}\cdots\angless{y}{k}$ combinations. These combinations maximize $\pcond{\angless{y}{1}\cdots\angless{y}{k}}{x}$.
  \end{itemize}
  \item When $B=1$, beam search degenerates to greedy search. Larger $B$: better result, slower, more computational resource; smaller $B$: worse result, faster, less computational resource.
\end{itemize}
\subsubsection{Refinements}
\begin{itemize}
\item Prevent underflow: $\angless{P}{k}$ is usually a small positive value. Multiplying a lot of $\angless{P}{k}$ results in an even smaller value, possibly causing numerical underflow. The solution is to take the logarithm of the probability values:
\[\displaystyle\argmax_{y}\displaystyle\sum_{t=1}^{T_y}\log\angless{P}{t}=\displaystyle\argmax_{y}\displaystyle\prod_{t=1}^{T_y}\angless{P}{t}\]
\item Length normalization: the model above favors short translations because appending more words always reduces the probability. Thus length normalization should be applied:
\[\displaystyle\argmax_{y}\frac{1}{T_y^{\alpha}}\displaystyle\sum_{t=1}^{T_y}\log\angless{P}{t}\]
which is called the \textit{normalized log likelihood objective}. $\alpha$ is a tunable hyper parameter\footnote{The use of $\alpha$ is a practical hack without theoretical justification.} that can take values from 0 (no normalization) to 1 (full normalization).
\end{itemize} 
\subsubsection{Error Analysis}
In case beam search fails to get a satisfactory translation, we can carry out some error analysis for improvement. Suppose $y^*$ is the human-level translation whilst $\hat{y}$ is the translation obtained by beam search. 
\begin{itemize}
  \item $\pcond{y^*}{x}>\pcond{\hat{y}}{x}$: problem with beam search. Tune $B$.
  \item $\pcond{y^*}{x}\le\pcond{\hat{y}}{x}$: problem with RNN.
\end{itemize} 
\subsection{BLEU Score}
\begin{itemize}
  \item Sometimes multiple good solutions (references) exist for machine translation. BLEU (\textit{bilingual evaluation understudy}) score is a single real number measure of accuracy in such case.
  \item Define precision as
  \[\frac{\text{\# words in output that appear in references}}{\text{\# words in output}}\]
  It is not a good measure for accuracy because a dummy output such as \textit{the the the the the the the} tends to have high precision.
  \item Define modified precision as  
  \[\frac{\text{clipped \# words in output that appear in references}}{\text{\# words in output}}\]
  in which clipped \# means no more than sum of the maximum count of each word in any reference. For example, for the two references: \textit{The cat is on the mat} and \textit{There is a cat on the mat}, the clipped \# of \textit{the} is 2, and the modified precision of \textit{the the the the the the the} becomes $\frac{2}{7}$. 
  \item Similar modified precisions can be defined for bi-grams, tri-grams, $\cdots$, $n$-grams.
  \begin{align*}
    p_1&=\frac{\displaystyle\sum_{\text{unigram}\in\hat{y}}Count_{clip}(\text{unigram})}{\displaystyle\sum_{\text{unigram}\in\hat{y}}Count(\text{unigram})}\\
    p_n&=\frac{\displaystyle\sum_{n\text{-gram}\in\hat{y}}Count_{clip}(n\text{-gram})}{\displaystyle\sum_{\text{unigram}\in\hat{y}}Count(n\text{-gram})}
  \end{align*}
  \item The bleu score is defined as 
  \[\text{BP}\exp\left(\frac{1}{4}\displaystyle\sum_{n=1}^4p_n\right)\]
  The brevity penalty factor BP is defined as 
  \[\text{BP}=\begin{cases}
    1 & \text{, if }L_{MT}>L_R \\
    \exp\left(1-L_{R}/L_{MT}\right) & \text{, otherwise }
  \end{cases}\]
  in which $L_R$ is the reference output length, and $L_{MT}$ is the machine translation length. BP is added because short translation tends to have higher $p_n$ values and hence should be penalized.
\end{itemize}
\subsection{Attention Model}
When the sentence to translate is long, it's hard for the encoder network to store all of its information in the activation passed to the decoder network. Attention model mimics human behavior to translate the long sentence part by part. Here we use a BRNN to illustrate the idea, so the activation at step $t'$ of the encoder network is $\angless{a}{t'}=\left(\angless{\overrightarrow{a}}{t'},\angless{\overleftarrow{a}}{t'}\right)$\footnote{The prime is added to differ the encoder network from the decoder network.}.

\begin{center}
\begin{tikzpicture}[
  neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
  input/.style={rectangle, thick, minimum size=5mm},
  node distance=5mm and 5mm
]
\node[input]  (enneuron0) {$\angless{a}{0}$};
\foreach \i [remember=\i as \last (initially 0)] in {1,...,4} {
  \node[neuron] (enneuron\i)  [right=1cm of enneuron\last]  {$\angless{a}{\i}$};
  \node[input]  (input\i)     [below=of enneuron\i]     {$\angless{x}{\i}$};
  \node[input]  (add\i)       [above=1.5cm of enneuron\i]     {$\bigoplus$};
  \node[input]  (context\i)   [above=of add\i]          {$\angless{c}{\i}$};
  \node[neuron] (deneuron\i)  [above=of context\i]      {$\angless{s}{\i}$};
  \node[input]  (output\i)    [above=of deneuron\i]     {$\angless{y}{\i}$};
}
\node[input]  (deneuron0)     [left=of deneuron1]     {$\angless{s}{0}$};
\node[input]  (enneuron5)     [right=of enneuron4]    {$\angless{a}{5}$};

\foreach \i [remember=\i as \last (initially 0)] in {1,...,4} {
  \pgfmathtruncatemacro{\next}{\i+1}
  \draw[thick, -latex] (input\i) -- (enneuron\i);
  \draw[thick, -latex] (context\i) -- (deneuron\i); 
  \draw[thick, -latex] (deneuron\i) -- (output\i);
  \draw[thick, -latex] (deneuron\last) -- (deneuron\i);  
  \draw[thick, -latex] (enneuron\last) -- (enneuron\i);
  \draw[thick, -latex] (enneuron\next.west) -- (enneuron\i.east);
  \draw[thick, -latex] (add\i) -- (context\i);
  \draw[thick, red, -latex] (enneuron\i) -- node[near start]{$\angless{\alpha}{1,\i}$} (add1);
  \ifnum\i > 1
    \draw[thick, -latex] (output\last.east) .. controls +(360:5mm) and +(240:18mm) .. ($(deneuron\i.south)+(-0.1,0)$);
    \foreach \j in {1,...,4} {
      \draw[ultra thin, -latex] (enneuron\j) -- (add\i);
    }
  \fi
}
\end{tikzpicture} 
\end{center}
\begin{itemize}
  \item $\angless{\alpha}{t,t'}$ is the amount of attention $\angless{y}{t}$ pays to $\angless{a}{t'}$. It's a softmax combination of factors $\angless{e}{t,t'}$. Obviously $\sum_{t'}\angless{\alpha}{t,t'}=1$ for any given $t$.
  \[
    \angless{\alpha}{t,t'}=\frac{\exp\left(\angless{e}{t,t'}\right)}{\sum_{t'=1}^{T_x}\exp\left(\angless{e}{t,t'}\right)}
  \]
\item Factor $\angless{e}{t,t'}$ is learned from a neural network. It's a small network (one-layer) because the computation is done a lot.
\begin{center}
  \begin{tikzpicture}[
    neuron/.style={rectangle, draw=black, thick, minimum size=5mm},
    input/.style={rectangle, thick, minimum size=5mm},
    node distance=5mm and 5mm
  ]
  \node [input] (inputs) {$\angless{s}{t-1}$};
  \node [input] (void) [below=of inputs] {};
  \node [input] (inputa) [below=of void] {$\angless{a}{t'}$};
  \node [neuron] (neuron) [right=of void] {$\makecell{\circ\\\circ\\\circ}$};
  \node [input] (output) [right=of neuron] {$\angless{e}{t,t'}$};
  \draw [thick, -latex] (inputs) -- (neuron);
  \draw [thick, -latex] (inputa) -- (neuron);
  \draw [thick, -latex] (neuron) -- (output);
  \end{tikzpicture}
\end{center}
\item The context $\angless{c}{t}$, i.e. the input fed to the decoder network, is a combination of all encoder activations weighted by the attentions:
\[\angless{c}{t}=\displaystyle\sum_{t'=1}^{T_x}\angless{\alpha}{t,t'}\angless{a}{t'}\]
\item The algorithm takes quadratic time $O\left(T_x\cdot T_y\right)$, which is generally acceptable because sentences are not that long in machine translation.
\end{itemize}
\subsection{Speech Recognition}
\begin{itemize}
  \item Speech recognition application transforms an audio clip to the corresponding transcript.
  \item A microphone records little variations in air pressure over time, which is perceived as sound by human ears. An audio clip can be thought of as a long list of numbers measuring the little air pressure changes detected by the microphone. The frequency of the sound measured in \textit{hz} is the number of numbers per second.
  \item Instead of directly using raw audio data, we can use the spectrum calculated from the raw data for learning. The spectrum can optionally pass a 1D convolutional layer, which plays a similar role to that of the 2D conv layer in image processing, i.e. it extracts low-level features of the spectrum. It also reduces the dimension of the data before feeding it to the RNN. 
  \item It's much easier and faster to record some positive / negative words and some random background noises (also available from internet for free) and use them to synthesize training data than to record all training examples manually. Synthesized data is also easier to label.
  \item Linguists once believed that breaking audio clips down to ``phonemes'' is the best way to do speech recognition, which has been proved unnecessary by DL.
  \item CTC (connectionist temporal classification) cost: a many-to-many RNN with equal numbers of input and output units is used to recognize speech, but usually the input contains much more items than the input (\# input timestamps $\gg$ \# letters in output). CTC cost makes the RNN generate output with repeated letters and a special blank character, e.g. \textit{ttt\_h\_eee\_\_\textvisiblespace\_\_\_qqq\_\_}. Repeated characters not separated by the blank character are collapsed to obtain the actual output, thus the output above is considered a correct prefix of the expected result \textit{the quick brown fox}.
  \item Trigger word detection: target label is set to 1 right after the trigger word, and 0 elsewhere, which results in an unbalanced training set (\#0 $\gg$ \#1). A practical solution is to output a few 1s instead of a single 1 after the trigger word.
\end{itemize}

\ifx\PREAMBLE\undefined
\end{document}
\fi