```
import pickle
import matplotlib.pyplot as plt
import numpy as np
# Load the word2int dictionaries
with open("./data/word2int_en.pkl", "rb") as f:
= pickle.load(f)
en_words
with open("./data/word2int_fr.pkl", "rb") as f:
= pickle.load(f)
fr_words
# Load the word embeddings
= np.load("./data/embeddings_en.npz")["embeddings"]
en_embeddings = np.load("./data/embeddings_fr.npz")["embeddings"]
fr_embeddings
def tokenize(sentence, token_mapping):
= []
tokenized
for word in sentence.lower().split(" "):
try:
tokenized.append(token_mapping[word])except KeyError:
# Using -1 to indicate an unknown word
-1)
tokenized.append(
return tokenized
def embed(tokens, embeddings):
= embeddings.shape[1]
embed_size
= np.zeros((len(tokens), embed_size))
output for i, token in enumerate(tokens):
if token == -1:
= np.zeros((1, embed_size))
output[i] else:
= embeddings[token]
output[i]
return output
```

## 1 Introduction

In an earlier article we looked at the simple attention model used for language translation introduced in the Bhadanau, et al. (2014) paper.

The 2017 paper Attention Is All You Need introduced the Transformer model and scaled dot-product attention, sometimes also called QKV (**Q**ueries, **K**eys, **V**alues) attention. Since then, Transformers have come to dominate large-scale natural language applications. Scaled dot-product attention can be used to improve seq2seq models as well. In this article, we’ll implement a simplified version of scaled dot-product attention and replicate word alignment between English and French, as shown in Bhadanau, et al. (2014).

## 2 Import Libraries & Setup

A Transformer model can learn how to align words in different languages. We won’t be training any weights here, so we’ve prepared some pre-trained aligned word embeddings from here.

## 3 Scaled Dot-Product Attention

The scaled-dot product attention consists of two matrix multiplications and a softmax scaling as shown in the diagram below from Vaswani, et al. (2017). It takes three input matrices, the queries, keys, and values.

Mathematically, this is expressed as

\[ \large \mathrm{Attention}\left(Q, K, V\right) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V \]

where \(Q\), \(K\), and \(V\) are the queries, keys, and values matrices respectively, and \(d_k\) is the dimension of the keys. In practice, Q, K, and V all have the same dimensions. This form of attention is faster and more space-efficient than what we implemented before with the simple attention of Bhadanau, et al. (2014) since it consists of only matrix multiplications instead of a learned feed-forward layer.

Conceptually, the first matrix multiplication is a measure of the similarity between the queries and the keys. This is transformed into weights using the softmax function. These weights are then applied to the values with the second matrix multiplication resulting in output attention vectors. Typically, decoder states are used as the queries while encoder states are the keys and values.

We will implement the softmax function with Numpy and use it to calculate the weights from the queries and keys. Let’s assume the queries and keys are 2D arrays (matrices). Note that since the dot-product of Q and K will be a matrix, we’ll need to take care to calculate softmax over a specific axis.

```
def softmax(x, axis=0):
""" Calculate softmax function for an array x
axis=0 calculates softmax across rows which means each column sums to 1
axis=1 calculates softmax across columns which means each row sums to 1
"""
= np.exp(x)
y return y / np.expand_dims(np.sum(y, axis=axis), axis)
def calculate_weights(queries, keys):
""" Calculate the weights for scaled dot-product attention"""
= np.matmul(queries, keys.T)/np.sqrt(keys.shape[1])
dot = softmax(dot, axis=1)
weights
assert weights.sum(axis=1)[0] == 1, "Each row in weights must sum to 1"
return weights
```

```
# Tokenize example sentences in English and French, then get their embeddings
= "The agreement on the European Economic Area was signed in August 1992 ."
sentence_en = tokenize(sentence_en, en_words)
tokenized_en = embed(tokenized_en, en_embeddings)
embedded_en
= "L accord sur la zone économique européenne a été signé en août 1992 ."
sentence_fr = tokenize(sentence_fr, fr_words)
tokenized_fr = embed(tokenized_fr, fr_embeddings)
embedded_fr
# These weights indicate alignment between words in English and French
= calculate_weights(embedded_fr, embedded_en)
alignment
# Visualize weights to check for alignment
= plt.subplots(figsize=(7,7))
fig, ax ='gray')
ax.imshow(alignment, cmap
ax.xaxis.tick_top()1]))
ax.set_xticks(np.arange(alignment.shape[" "), rotation=90, size=16);
ax.set_xticklabels(sentence_en.split(0]));
ax.set_yticks(np.arange(alignment.shape[" "), size=16); ax.set_yticklabels(sentence_fr.split(
```

This is a demonstration of alignment where the model has learned which words in English correspond to words in French. For example, the words *signed* and *signé* have a large weight because they have the same meaning. Typically, these alignments are learned using linear layers in the model, but we’ve used pre-trained embeddings here.

Let’s now complete the implementation of scaled dot-product attention using our `calculate_weights`

function.

```
def attention_qkv(queries, keys, values):
""" Calculate scaled dot-product attention from queries, keys, and values matrices """
= calculate_weights(queries, keys)
weights return np.matmul(weights, values)
= attention_qkv(embedded_fr, embedded_en, embedded_en)
attention_qkv_result
print(f"The shape of the attention_qkv function is {attention_qkv_result.shape}")
print(f"Some elements of the attention_qkv function are \n{attention_qkv_result[0:2,:10]}")
```

```
The shape of the attention_qkv function is (14, 300)
Some elements of the attention_qkv function are
[[-0.04039161 -0.00275749 0.00389873 0.04842744 -0.02472726 0.01435613
-0.00370253 -0.0619686 -0.00206159 0.01615228]
[-0.04083253 -0.00245985 0.00409068 0.04830341 -0.02479128 0.01447497
-0.00355203 -0.06196036 -0.00241327 0.01582606]]
```

## 4 Acknowledgements

I’d like to express my thanks to the great Natural Language Processing with Attention Models Course which i completed, and acknowledge the use of some images and other materials from the course in this article.