```
import torch
from torch import tensor
def matmul(a,b):
= a.shape # n_rows * n_cols
ar,ac = b.shape
br,bc assert ac==br
= torch.zeros(ar, bc)
c for i in range(ar):
for j in range(bc):
for k in range(ac): c[i,j] += a[i,k] * b[k,j]
return c
```

## 1 Introduction

In this article we will cover building a basic neural network from the most basic elements (arrays and Pytorch modules). We will also cover some of the key theory required for this.

This article and it’s content is based on the fastai deep learning course, chapter 17.

## 2 Building a Neural Network from basic elements

### 2.1 Creating a neuron

A neuron takes a series of inputs, each of which is multipled by a weight, summing up all those inputs, and adding a bias - this input is then put thorugh an activation function. We could represent these as:

output = sum([x*w for x,w in zip(inputs,weights)]) + bias

def relu(x): return x if x >= 0 else 0

A deep learning model stacks many of these neurons in layers. So for the output of an entire layer, using matrices we would have:

y = x @ w.t() + b

### 2.2 Matrix multiplication

So we can define a function to manually do a matrix product using loops.

However this is hugely slower than we can do using Pytorch matrix multiplciation.

### 2.3 Elementwise calculations

We can do element wise operations on tensors - as long as they are the same shape, for example.

```
= tensor([10., 6, -4])
a = tensor([2., 8, 7])
b + b a
```

`tensor([12., 14., 3.])`

### 2.4 Broadcasting

Broadcasting allows 2 arrays of different sizes to be compatible for arthimetic operations, by repeating the smaller array so it matches the size of the larger one.

For example we can use *unsqeeze* in Pytorch to add extra dimensions explictly.

```
= tensor([10.,20,30])
c 0).shape,c.unsqueeze(1).shape c.shape, c.unsqueeze(
```

`(torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))`

We can now replace our matrix multiplication with 3 loops with a broadcasting equivilent much shorter.

```
def matmul(a,b):
= a.shape
ar,ac = b.shape
br,bc assert ac==br
= torch.zeros(ar, bc)
c for i in range(ar):
# c[i,j] = (a[i,:] * b[:,j]).sum() # previous
= (a[i ].unsqueeze(-1) * b).sum(dim=0)
c[i] return c
```

## 3 Forward and Backward passes of a Neural Network

### 3.1 Defining and initialising a layer

So we can define a basic linear layer in the following way.

`def lin(x, w, b): return x @ w + b`

Let’s create some dummy data, and some simple layers.

```
= torch.randn(200, 100)
x = torch.randn(200)
y
= torch.randn(100,50)
w1 = torch.zeros(50)
b1 = torch.randn(50,1)
w2 = torch.zeros(1)
b2
= lin(x, w1, b1)
l1 l1.shape
```

`torch.Size([200, 50])`

But we have a problem to do with how the parameters are initialised consider

` l1.mean(), l1.std()`

`(tensor(-0.2733), tensor(10.1770))`

The std dev is 10, consider how if this is one layer which multiples by 10 how many layers could generate huge numbers that would be unmanagable and be a network hard to train. So we want our std dev to be close to one, and there is an equation for scaling our weights to this is so.

\(1/\sqrt{n_{in}}\)

where \(n_{in}\) represents the number of inputs. This is known as *Xavier initialization (or Glorot initialization)*.

For example if we have 100 inputs, we should scale our weights by 0.1.

```
= torch.randn(200, 100)
x for i in range(50): x = x @ (torch.randn(100,100) * 0.1)
print(x[0:5,0:5])
print(x.std())
```

```
tensor([[-0.6374, -0.3009, 0.4669, -0.7221, 0.1983],
[-1.0054, 0.0244, 0.3540, -1.0580, 0.2675],
[ 0.0789, 0.6670, 0.2132, 0.2511, -1.3466],
[ 0.7786, -0.2874, -1.2391, 0.4132, 1.9071],
[ 2.1194, 0.0046, -1.7749, 1.5797, 1.4981]])
tensor(1.1794)
```

Re-working our model with this in mind

```
= torch.randn(200, 100)
x = torch.randn(200)
y
from math import sqrt
= torch.randn(100,50) / sqrt(100)
w1 = torch.zeros(50)
b1 = torch.randn(50,1) / sqrt(50)
w2 = torch.zeros(1)
b2
= lin(x, w1, b1)
l1 l1.mean(),l1.std()
```

`(tensor(-0.0135), tensor(1.0176))`

Now we need to define an activation function.

```
def relu(x): return x.clamp_min(0.)
= relu(l1)
l2 l2.mean(),l2.std()
```

`(tensor(0.3988), tensor(0.5892))`

So now the mean is no longer zero and our std dev is less like 1. So the Glorot method is not intended to be used with Relu and was invented before.

A newer initialisation by Kaiming He et al workes better with Relu. It’s formula is:

\(\sqrt{2 / n_{in}}\)

where \(n_{in}\) is the number of inputs of our model.

Applying this.

```
= torch.randn(200, 100)
x = torch.randn(200)
y
= torch.randn(100,50) * sqrt(2 / 100)
w1 = torch.zeros(50)
b1 = torch.randn(50,1) * sqrt(2 / 50)
w2 = torch.zeros(1)
b2
= lin(x, w1, b1)
l1 = relu(l1)
l2 l2.mean(), l2.std()
```

`(tensor(0.5710), tensor(0.8222))`

Now we can define a whole model.

```
def model(x):
= lin(x, w1, b1)
l1 = relu(l1)
l2 = lin(l2, w2, b2)
l3 return l3
= model(x)
out out.shape
```

`torch.Size([200, 1])`

So we don’t want this unit dimension. We can define a loss function and also get rid of this unit dimension.

```
def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()
= mse(out, y) loss
```

### 3.2 Gradients and the Backwards Pass

So PyTorch computes the gradients for us with *loss.backward* but behind the scenes is a bit of calculus. Given the whole network is a huge function, with each part a sub-function, lets start with the final part the loss function.

We can calculate the loss with the loss function. If we take the derivative of the loss function with respect to the final weights, we can calculate the loss with respect to these weights. We can then use the chain rule to propagate these values backward, and calculate the loss with respect to every parameter in the model.

Lets define a function to calculate the gradients of the loss function with respect to the final weights.

```
def mse_grad(inp, targ):
# grad of loss with respect to output of previous layer
= 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0] inp.g
```

Let’s now define functions to calculate the gradients for the activation functions and also the linear layers.

```
def relu_grad(inp, out):
# grad of relu with respect to input activations
= (inp>0).float() * out.g
inp.g
def lin_grad(inp, out, w, b):
# grad of matmul with respect to input
= out.g @ w.t()
inp.g = inp.t() @ out.g
w.g = out.g.sum(0) b.g
```

### 3.3 Model refactoring

Let’s now put together everything: the model, the forward and backward pass methods.

```
class Relu():
def __call__(self, inp):
self.inp = inp
self.out = inp.clamp_min(0.)
return self.out
def backward(self): self.inp.g = (self.inp>0).float() * self.out.g
class Lin():
def __init__(self, w, b): self.w,self.b = w,b
def __call__(self, inp):
self.inp = inp
self.out = inp@self.w + self.b
return self.out
def backward(self):
self.inp.g = self.out.g @ self.w.t()
self.w.g = self.inp.t() @ self.out.g
self.b.g = self.out.g.sum(0)
class Mse():
def __call__(self, inp, targ):
self.inp = inp
self.targ = targ
self.out = (inp.squeeze() - targ).pow(2).mean()
return self.out
def backward(self):
= (self.inp.squeeze()-self.targ).unsqueeze(-1)
x self.inp.g = 2.*x/self.targ.shape[0]
class Model():
def __init__(self, w1, b1, w2, b2):
self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)]
self.loss = Mse()
def __call__(self, x, targ):
for l in self.layers: x = l(x)
return self.loss(x, targ)
def backward(self):
self.loss.backward()
for l in reversed(self.layers): l.backward()
# Create model
= Model(w1, b1, w2, b2)
model
# Forward pass
= model(x, y)
loss
# Backward pass
model.backward()
loss
```

`tensor(2.7466)`

### 3.4 Converting the model to Pytorch

We could build this more simply using Pytorch methods, and in fact fastai methods built on these.

```
class Model(Module):
def __init__(self, n_in, nh, n_out):
self.layers = nn.Sequential(
nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))self.loss = mse
def forward(self, x, targ): return self.loss(self.layers(x).squeeze(), targ)
```

## 4 Conclusion

In this article we have build a neural network from the most basic elements.