```
# Run this first, a bit of setup for the rest of the lab
import numpy as np
def softmax(x, axis=0):
""" Calculate softmax function for an array x along specified axis
axis=0 calculates softmax across rows which means each column sums to 1
axis=1 calculates softmax across columns which means each row sums to 1
"""
return np.exp(x) / np.expand_dims(np.sum(np.exp(x), axis=axis), axis)
```

## 1 Introduction

As of 2023, in deep learning the Transformer model architecture has been behind many recent advances in deep learning model performance in many areas including Natural Language Processing and Computer Vision. An **Attention** mechanism is a key part of Transformer architecture. Attention was first introduced by Bhadanau, et al (2014) as a method for improving seq2seq language models.

In this article we will look at this first use of an attention mechanism as proposed by Bhadanau, et al (2014) and implement it in NumPy.

Attention allows a seq2seq decoder to use information from each encoder step instead of just the final encoder hidden state. In the attention operation, the encoder outputs are weighted based on the decoder hidden state, then combined into one context vector. This vector is then used as input to the decoder to predict the next output step.

## 2 Machine translation and the ‘Information Bottleneck’

The traditional seq2seq model was introduced by Google in 2014 and it was a revolution at the time for helping with Machine Translation from text in one language to another. Basically, it works by taking one sequence of items such as words and its output, is another sequence. The way this is done is by mapping variable length sequences to a fixed length memory, which in machine translation, encodes the overall meaning of sentences. For example, you can have a text of length that varies and you can encode it into a vector or fixed dimension like 300, for example. This feature is what’s made this model a powerhouse for machine translation. Additionally, the inputs and outputs don’t need to have matching lengths, which is a desirable feature when translating texts.

In a seq2seq model, you have an encoder and a decoder. The encoder takes word tokens as input, and it returns its final hidden states as outputs.

This hidden state is used by the decoder to generate the translated sentence in the target language.

One major limitation of the traditional seq2seq model is what’s referred to as the **information bottleneck**. Since seq2seq uses a fixed length memory for the hidden states, long sequences become problematic. This is due to the fact that in traditional seq2seq models, only a fixed amount of information can be passed from the encoder to the decoder no matter how much information is contained in the input sequence.

The power of seq2seq, which allows for inputs and outputs to be different sizes, becomes not effective when the input sequence is long. The result is lower model performance, a sequence size increases and that’s no good. The issue with having one fixed size encoder hidden states is that it struggles to compress longer sequences and it ends up throttling itself and punishing the decoder who only wants to make a good prediction. One workaround is to use the encoder hidden states for each word instead of trying to smash it all into one big vector. But this model would have flaws with memory and contexts.

How could you build a time and memory efficient model that predicts accurately from a long sequence? This becomes possible if the model has a way to select and focus on the most important words at each time step. We can think of this as giving the model a new layer to process this information, which we call **Attention**. If we provide the information specific to each input word, you can give the model a way to focus it’s attention in the right place at each step of the decoding process.

Seq2seq models perform well for sentences with about 10-20 words, but they fall off beyond that. This is what you should expect. These are the results from the Bhadanau, et al (2014) paper comparing models with and without attention.

The models with attention perform better than the traditional Seq2Seq models across all sentence lengths.

## 3 Import Libraries & Setup

Let’s import NumPy and define a softmax function we will use later.

## 4 Calculating alignment scores

The first step is to calculate the alignment scores. This is a measure of similarity between the decoder hidden state and each encoder hidden state. From the paper Appendix Section A.1.2, this operation looks like

\[ \large e_{ij} = v_a^\top \tanh{\left(W_a s_{i-1} + U_a h_j\right)} \]

where \(W_a \in \mathbb{R}^{n\times m}\), \(U_a \in \mathbb{R}^{n \times m}\), and \(v_a \in \mathbb{R}^m\) are the weight matrices and \(n\) is the hidden state size. In practice, this is implemented as a feedforward neural network with two layers, where \(m\) is the size of the layers in the alignment network. It looks something like:

Here \(h_j\) are the encoder hidden states for each input step \(j\) and \(s_{i - 1}\) is the decoder hidden state of the previous step. The first layer corresponds to \(W_a\) and \(U_a\), while the second layer corresponds to \(v_a\).

To implement this, lets first concatenate the encoder and decoder hidden states to produce an array with size \(K \times 2n\) where \(K\) is the number of encoder states/steps. For this, we use `np.concatenate`

(docs). Note that there is only one decoder state so we’ll need to reshape it to successfully concatenate the arrays. The easiest way is to use `decoder_state.repeat`

(docs) to match the hidden state array size.

Then, we apply the first layer as a matrix multiplication between the weights and the concatenated input. We will use the tanh function to get the activations. Finally, we compute the matrix multiplication of the second layer weights and the activations. This returns the alignment scores.

```
= 16
hidden_size = 10
attention_size = 5
input_length
42)
np.random.seed(
# Synthetic vectors used to test
= np.random.randn(input_length, hidden_size)
encoder_states = np.random.randn(1, hidden_size)
decoder_state
# Weights for the neural network, these are typically learned through training
# Use these in the alignment function below as the layer weights
= np.random.randn(2*hidden_size, attention_size)
layer_1 = np.random.randn(attention_size, 1)
layer_2
# Alignment function
def alignment(encoder_states, decoder_state):
# First, concatenate the encoder states and the decoder state
= np.concatenate((encoder_states, decoder_state.repeat(input_length, axis=0)), axis=1)
inputs assert inputs.shape == (input_length, 2*hidden_size)
# Matrix multiplication of the concatenated inputs and layer_1, with tanh activation
= np.tanh(np.matmul(inputs, layer_1))
activations assert activations.shape == (input_length, attention_size)
# Matrix multiplication of the activations with layer_2. We don't need tanh here
= np.matmul(activations, layer_2)
scores assert scores.shape == (input_length, 1)
return scores
```

```
# Run to test the alignment function
= alignment(encoder_states, decoder_state)
scores print(scores)
```

```
[[4.35790943]
[5.92373433]
[4.18673175]
[2.11437202]
[0.95767155]]
```

## 5 Turning alignment into weights

The next step is to calculate the weights from the alignment scores. These weights determine the encoder outputs that are the most important for the decoder output. These weights should be between 0 and 1, and add up to 1. We can use the softmax function already implemented to get these weights from the attention scores. We will pass the attention scores vector to the softmax function to get the weights. Mathematically,

\[ \large \alpha_{ij} = \frac{\exp{\left(e_{ij}\right)}}{\sum_{k=1}^K \exp{\left(e_{ik}\right)}} \]

This is as described in Appendix section A.2.2 of the paper.

## 6 Weight the encoder output vectors and sum

The weights tell us the importance of each input word with respect to the decoder state. In this step, we use the weights to modulate the magnitude of the encoder vectors. Words with little importance will be scaled down relative to important words. We will multiply each encoder vector by its respective weight to get the alignment vectors, then sum up the weighted alignment vectors to get the context vector. Mathematically,

\[ \large c_i = \sum_{j=1}^K\alpha_{ij} h_{j} \]

This is as described in Appendix section A.2.2 of the paper.

We wil implement these steps in the `attention`

function below.

```
# Attention function
def attention(encoder_states, decoder_state):
""" Function that calculates attention, returns the context vector
Arguments:
encoder_vectors: NxM numpy array, where N is the number of vectors and M is the vector length
decoder_vector: 1xM numpy array, M is the vector length, much be the same M as encoder_vectors
"""
# First, calculate the alignment scores
= alignment(encoder_states, decoder_state)
scores
# Then take the softmax of the alignment scores to get a weight distribution
= softmax(scores)
weights
# Multiply each encoder state by its respective weight
= encoder_states * weights
weighted_scores
# Sum up weighted alignment vectors to get the context vector and return it
= np.sum(weighted_scores, axis=0)
context return context
= attention(encoder_states, decoder_state)
context_vector print(context_vector)
```

```
[-0.63514569 0.04917298 -0.43930867 -0.9268003 1.01903919 -0.43181409
0.13365099 -0.84746874 -0.37572203 0.18279832 -0.90452701 0.17872958
-0.58015282 -0.58294027 -0.75457577 1.32985756]
```

This context vector created using the new attention process will hold much more useful information relevant for producing more accurate output and better translations by the decoder of the Seq2Seq model.

## 7 Acknowledgements

I’d like to express my thanks to the great Natural Language Processing with Attention Models Course which i completed, and acknowledge the use of some images and other materials from the course in this article.