Table of Contents:

  1. Introduction
  2. Loss Functions
  3. Gradient Descent for Learning
  4. Binding everything together
  5. Code
  6. Further reading

Let’s start off with a case study. We’ll define a network to do some handwritten digits classification aka MNIST. The only catch: we can use only numpy. The idea behind this choice is to go through some of the fundamental concepts of neural networks without using any high level frameworks. It’ll be much more easier to translate these ideas to PyTorch/TF2 etc once you’ve built it from scratch using a tool like numpy.

NOTE: If you’d like to skip straight to the code, go here. All the code can be found on my GitHub


MNIST is a dataset that contains thousands of scanned images of handwritten digits.

The images are grayscale(1D) and 28x28 pixels in size. The goal of our network will be to generate a 10 dimensional (sparse) vector for each input, and each value can be either 0 or 1. This idea of representing categorical variables as binary vectors is called as one-hot encoding and something that you’ll find used extensively while training neural networks.

Let’s introduce another abstraction here: the concept of a cost function. Cost functions, also known as loss/objective functions are mathemagic that tell us how good our network is currently doing. To be an effective machine learning practitioner is to have an intuitive understanding on what cost functions do.

Loss Functions

For classification, an appropriate loss function would be cross-entropy(also called log loss). If you’re wondering why that is, come see after but in a nutshell, cross entropy loss allows us to heavily penalize cases where our prediction is wrong. Cross entropy is a non-negative loss function and tends to zero as as the network gets better. For classification purposes, perhaps the most important distinction between say, Mean Squared Error and Cross-Entropy is that learning is faster. To borrow from Michael Nielsen’s excellent online book, the larger the error, the faster the neuron will learn. [2]

Cross Entropy Loss is defined as follows: equation

The above definition uses a mini batch gradient descent optimization algorithm that we’ll introduce below later.

def cross_entropy_loss(y, yhat):
    Cross Entropy Loss
    Inputs: target label and ground truth target label
    Output: loss l

    lsum = np.sum(np.multiply(y, np.log(yhat)))
    _, m = y.shape
    l = -(1./m) * lsum
    return l

So far, we’ve identified what we want to do. We have defined how we’ll measure our network on the performance it gets. What we haven’t explored so far is how we’ll actually get there. What we want is a procedure that lets us approximate y(x) for each input x. This is where gradient descent comes into play.


For now, think of classification as a minimization problem(and in some ways, it is one). Given an input and a multi-variable cost function, we want to minimize it. This is the part where the fun starts.

Detour: Before we get into it, it’s in our interests to think of an analogy. We’ll come back later and wrap it with concepts specific to what we’re trying to do.

Consider you're at the top of a mountain, blindfolded and you want to reach the valley below. One way to start would be to determine what the steepest descent is and go down that path. Remember, you're blindfolded so there's no way to look at the real valued input and determine the direction of descent. 

This is what we’ll use gradients for. Gradients are the partial derivatives of multi-variable functions. A derivative is the rate of change for a function at any given point. Calculating these gradients should give us an idea about the ‘shape’ of the valley and help us choose the path we should take to get to the bottom. Then we update the parameters in our multi-variable function and this constitutes one step in the learning process.

For every step, the cost function changes as the product of the gradient vector and the random movements themselves. This makes sense, as for changes in movements, the gradient vector makes the updates to the cost function. We can define these updates in terms of the learning rate such that the gradient always decreases, thus ensuring that the cost will always go down. For a concrete mathematical representation of these ideas, go here.

Coming back, now we’ve got an algorithm, that for every step(variable update) minimizes the cost function and makes a step(movement) in the opposite direction to ‘fall’ down the valley. This is exactly what we wanted. This however, brings up an important point: the learning rate is an important parameter that needs to be set accordingly. Too small a rate, and we’ll never reach the global minimum fast enough. Too fast(too high) and we’d end up with a non-negative update which is the opposite of what we want.

NOTE: This is a highly abstracted, simplified explanation of gradient descent. There are many potential issues, one of which we’ll cover in the next section.

Binding it all together

I’ve mentioned multi-variable cost functions without going into what those variables are. For this case study, we’ll just consider the weights(w) and biases(b) parameters. In the context of neural networks, the idea of making a movement is reflected by the weight(w) and bias(b) parameters. We make ‘movements’ by modifying these parameters to reflect the effect of a possible movement(another abstraction!). The update rule thus becomes

wi = wi - η * ∇C
bi = bi - η * ∇C

where ∇C is the gradient and η is the learning rate parameter

The gradient descent algorithm we’ve looked at earlier computes the gradient ∇C for each input example x and then averages over the number of inputs. For large datasets, this means learning updates can be very slow. There is a very simple way out of this: random sampling.

We stochastically choose a number of samples(say l), called a mini-batch. As long as the size of a mini batch is large enough such that the average of gradients over a mini-batch is roughly equal to the gradient over the dataset, we’re good to go. The update rule changes as follows:

wi = wi - η/l * ∇CXi
bi = bi - η/l * ∇CXi

where ∇CXi is the gradient over the training examples for the current mini-batch.

def sgd(self, epochs, batch_size, lr, train, test=None):
    for i in range(epochs):
        mini_batch = [train[b:batch_size] for b in range(0, len(train), batch_size)]
        for batch in mini_batch:
            self.update(batch, lr)
        print("Epoch {0} complete".format(i))

To recap, the gradients over a mini-batch of training inputs(Xi) are applied to the entire dataset. Then, we randomly sample another mini-batch and so on. When all training inputs have been used, it constitutes one epoch. Learning is continued for the number of epochs we define.


Let’s start off by initiating a simple network with random weight init.

class Network():

    def __init__(self, size):
        self.size = size
        self.layers = len(size)
        self.bias = np.random.randn(x, 1) for x in size[1:]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(size[:-1, size[1:]])]

We’ve initialized our weifghts and biases vectors randomly. There are however better strategies for weight init(Xavier, Kaiming etc) and we’ll look at them later. For now, random init is simple and works good enough that we don’t have to worry about it.

Earlier, we’ve seen the update rule but how does that relate to actually training networks? These vectors are passed to what’s called an activation function that attaches a non-linearity.

A layer in a neural network is a collection of neurons and bias that is combined in some fashion to get a cumulative output which is then passed to a non-linearity such as sigmoid above. These successive layers of activations and non-linearities means neural networks can approximate a large distribution of inputs, and are only limited by how far our imagination stretches.

Let’s look at the sigmoidal activation function. Mathematically, it can be represented as: equation

def sigmoid(z):
    """Outputs the sigmoid of the input"""
    s = 1./(1. + np.exp(-z))
    return s

Let’s go over it bit by bit

a is the activation for a current layer. To get the activation for the next layer, we mulitply the weight vector w with a and add the bias b to it. This is then passed to the activation function we looked at to get the activation for the next layer.

def sigmoid_d(z):
    """Returns the derivative of the current input for sigmoid function"""
    return sigmoid(z) * (1-(sigmoid(z))

Now, we’ll define a forward pass through the network. Simplistically, a forward pass, given an input, returns the activated output. In the overall architecture, tbis output becomes the input for the next layer in the network.

def ff(self, a):
    Makes a forward pass through the network for the input
    ## Only used during testing phase
    for w, b in zip(self.weights, bias):
        layer_i = (, a)+b)
        a = sigmoid(layer_i)       

    return a

We’ve defined what a forward pass through the network. What we really want to do is talk about how the network will learn ie. gradient descent and backpropagation. Without going into the complicated math, backprop lets us figure out how loss changes with respect to each weight and bias component. This lets up pinpoint which parameter updates to make(and by how much, as a multiple of the learning rate) to make the network learn better. Let’s code it up.

def backprop(self, x, y):
    temp = x
    activations = [x]
    combination = []

    ## Feedforward pass
    for w, b in zip(self.weights, bias):
        z =, activation) + b

    ## Backward pass
    delta = (activations[-1] - y) * sigmoid_d(combination[-1])
    gradient_b[-1] = delta
    gradient_w[-1] =, activations[-2].transpose())

    for n in range(2, self.layers):
        linear_comb = combination[-l]
        derivative = sigmoid_d(linear_comb)
        delta =[-l+1].transpose(), delta) * derivative
        gradient_b[-l] = delta
        gradient_w[-] =, activations[-l-1].transpose())

    return (gradient_b, gradient_w)

These are then passed to gradient descent which makes updates using the update rule we looked at earlier.

def update(self, mini_batch, rate):
    n = len(mini_batch)
    for b in self.bias:
        gradient_b = [np.zeros(b.shape)]
    for w in self.weights:
        gradient_w = [np.zeros(w.shape)]

    for X, y in mini_batch:
        delta_b, delta_w = self.backprop(X, y)
        gradient_b = [a+b for a,b in zip(gradient_b, delta_b)]
        gradient_w = [a+b for a,b in zip(gradient_w, delta_w)]

    #Update rule
    for w, dw in zip(self.weights, gradient_w):
        self.weights = [w - (rate/n) * dw]
    for b, db in zip(self.bias, gradient_b):
        self.bias = [b - (rate/n) * db]


  1. CS231N Numpy Tutorial
  2. Neural Networks and Deep Learning