Understanding backpropogation

In this post, we will implement backpropogation using elementary arithmetic operations [colab] [repo]

We also recommend cloning the original micrograd repository by Andrej Karpathy [1] [2]. The approach in micrograd is modeled after the dynamic computational graph of Pytorch [3], with the mindset of getting to a operationally minimal implementation of backpropogation. This post is not a guide on training neural networks or the numerous subtleties associated with training. It is meant to be short introduction to automatic differentiation.

We start with a few basic imports:

import math
import numpy as np
import matplotlib.pyplot as plt

We now define the Value class which forms the heart of micrograd:

class Value:
    """ stores a single scalar value and its gradient """
    def __init__(self, data, _children=(), _op='', label=''): #__init__(object, data, attributes)
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
        self.label = label
    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out  
      
    def __pow__(self, other):
        assert isinstance(other, (int, float)) # only supporting int/float powers for now
        out = Value(self.data**other, (self,), f'**{other}')

        def _backward():
            self.grad += (other * self.data**(other-1)) * out.grad
        out._backward = _backward

        return out    
    
    def __truediv__(self, other): # self / other
        return self*other**-1   
    def __neg__(self): # -self
        return self * -1
    def __sub__(self, other): # self - other
        return self + (-other)
    
    def __rmul__(self, other): # other * self
        return self * other
    def __radd__(self, other): # other + self
        return self + other
    def __rsub__(self, other): # other - self
        return other + (-self)
    def __rtruediv__(self, other): # other / self
        return other * self**-1
    
    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
        out = Value(t, (self,), 'tanh')
        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        return out
    
    def exp(self):
        x = self.data
        out = Value(math.exp(x), (self,), 'exp')
        def _backward():
            self.grad += out.data * out.grad
        out._backward = _backward
        return out
    
    #pow, truediv
    
    def backward(self):
        #topological sort 
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        self.grad = 1.0
        build_topo(self)
        for node in reversed(topo):
            node._backward()
        # implements in order: o._backward(), nn._backward(), x1w1x2w2._backward(), x2w2._backward(), x1w1._backward()

Basically, the Value class defines python’s elementary arithmetic operations (like add, multiply, divide) for a computational graph represented as a tree data structure: the root node of the tree signifies the neural net output, and leaf nodes signify the input data vector along with first-layer weights. In addition, it includes rules for local gradient updates in _backward() and a full-graph gradient update in backward(). The full-graph gradient update combines _backward() with a topological sort approach to traversing the graph from output to input nodes. The main non-elementary differentiable operation it contains is tanh, which is to be used as the nonlinear activation function at each node.

We now include a graphviz code like Andrej does, that is capable of visualizing such a computational graph diagrammatically

from graphviz import Digraph
def trace(root):
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)
    build(root)
    return nodes, edges

def draw_dot(root, format='svg', rankdir='LR'):
    """
    format: png | svg | ...
    rankdir: TB (top to bottom graph) | LR (left to right)
    """
    assert rankdir in ['LR', 'TB']
    nodes, edges = trace(root)
    dot = Digraph(format=format, graph_attr={'rankdir': rankdir}) #, node_attr={'rankdir': 'TB'})
    
    for n in nodes:
        dot.node(name=str(id(n)), label = "{ %s | data %.4f | grad %.4f}" % (n.label, n.data, n.grad), shape='record')
        if n._op:
            dot.node(name=str(id(n)) + n._op, label=n._op)
            dot.edge(str(id(n)) + n._op, str(id(n)))
    
    for n1, n2 in edges:
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)
    
    return dot

The code above can basically draw the entire graph, and indicate label, data, & gradient information at each node. The gradient at each node j (as computed in value class) is the partial derivative of the loss function L(o(xin,w),yin) w.r.t. the weight wj at that node, so that’s L(o(xin,w),yin)wj. Explaining the notation: xin is the input data sample vectorized, w are the weights vectorized, o() computes the network output for the input sample and label, L(o(),yin) computes the loss function for the output and label.

As such, the neural network output is simply o(xin,w). The loss function is an additional step computing a metric function on the output and supervised output labels of the training data. The mean-squared loss (2-norm) is a simple example: o(xin,w)l2, where weight vector w is known to produce a scalar output l, and the 2-norm (mean squared loss) is only really sensible in the limit of multiple outputs (so that o and l become vectors)

Just out of curiosity, the hyperbolic tangent is plotted to visualize its nonlinearity:

plt.plot( np.arange(-5,5,0.2), np.tanh(np.arange(-5,5,0.2)) ); plt.grid()
tanh plot
tanh plot

In what follows, a simple two-input neuron is initialized via the value class:

x1 = Value(2.0,label='x1')
x2 = Value(0.0,label='x2')
w1 = Value(-3.0,label='w1')
w2 = Value(1.0,label='w2')
b = Value(6.88137358,label='b')
x1w1 = x1*w1; x1w1.label = 'x1*w1'; print(x1w1)
x2w2 = x2*w2; x2w2.label = 'x2*w2'; print(x2w2)
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'; print(x1w1x2w2)
nn = x1w1x2w2 + b; nn.label = 'nn'; print(nn)
o = nn.tanh(); o.label = 'o'; print(o)
Value(data=-6.0)
Value(data=0.0)
Value(data=-6.0)
Value(data=0.88137358)
Value(data=0.707106777676776)

and then the gradients are backpropogated using the backward() function in value class, and visualized using the draw_dot() function defined in the graphviz section earlier:

o.backward()
draw_dot(o)
computational graphs
computational graph for a single two-input neuron with tanh activation

Now, just as an extra exercise in object oriented programming, the tanh function is further broken down as tanh(x)=e2x+1e2x1 and we implement it as a composite function of exponentiation, addition (subtraction), division. Division in the value class is implemented more generally as a composite of multiplication and monomials: ab=ab1, so implement pow() in value class so that we can more generally produce monomials xn of arbitrary degree in our arithmetic system.

Once the above steps are implemented in the value class, we can now produce a longer computational graph and verify that we still get the same gradients:

#after implementing pow, truediv, exp
# inputs x1, x2
x1 = Value(2.0,label='x1')
x2 = Value(0.0,label='x2')
# weights w1, w2
w1 = Value(-3.0,label='w1')
w2 = Value(1.0,label='w2')
# neuron bias
b = Value(6.88137358,label='b')

x1w1 = x1 * w1; x1w1.label = 'x1*w1'; print(x1w1)
x2w2 = x2 * w2; x2w2.label = 'x2*w2'; print(x2w2)
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'; print(x1w1x2w2)
nn = x1w1x2w2 + b; nn.label = 'nn'; print(nn)
ee = (2 * nn).exp(); print(ee)
o = (ee - 1) / (ee + 1)
o.backward()
draw_dot(o)
Value(data=-6.0)
Value(data=0.0)
Value(data=-6.0)
Value(data=0.88137358)
Value(data=5.828427042920401)
computational graphs
bigger computational graph for the same neuron

We now shift focus to being able to achieve the same with Pytorch: the most popular deep learning library. In the process, we can verify the functional equivalence of micrograd’s and Pytorch’s backward() functions:


# single two-param neuron backprop in pytorch
import torch
x1 = torch.Tensor([2.0]).double()   ;   x1.requires_grad = True
x2 = torch.Tensor([0.0]).double()   ;   x2.requires_grad = True
w1 = torch.Tensor([-3.0]).double()  ;   w1.requires_grad = True
w2 = torch.Tensor([1.0]).double()   ;   w2.requires_grad = True
b = torch.Tensor([6.8813735]).double() ;   b.requires_grad = True
n = x1*w1 + x2*w2 + b
o = torch.tanh(n)

print(o.data.item())
o.backward()

print('---')
print('x2', x2.grad.item())
print('w2', w2.grad.item())
print('x1', x1.grad.item())
print('w1', w1.grad.item())

0.7071066904050358
---
x2 0.5000001283844369
w2 0.0
x1 -1.5000003851533106
w1 1.0000002567688737

That’s it from me! I’m skipping the part where Andrej discusses the nn.py classes, as that’s primarily linguistic abstractions. The core mathematics is automatic differentiation on computational graphs. Backpropogation is simply a special case where such an autograd engine is applied to an “ansatz” of nonlinear functions we call neural networks. Have a wonderful day!

Roughwork:

a = Value(2.0)
b = Value(4.0)
a / b
Value(data=0.5)
a = Value(3.0, label='a')
b=a+a; b.label = 'b'
b.backward()
draw_dot(b)
chain rule error fixed
accounting for variable reuse in chain rule
a = Value(-2.0, label='a')
b = Value(3.0, label='b')
d = a * b   ;   d.label = 'd'
e = a + b   ;   e.label = 'e'
f = d * e   ;   f.label = 'f'
f.backward()
draw_dot(f)
chain rule error fixed
another example of variable reuse in chain rule
a = Value(2.0)
2 * a

Value(data=4.0)

References

[1] Github: Micrograd by Andrej Karpathy
[2] Youtube: Lecture by Andrej Karpathy
[3] GeeksforGeeks: Computational Graphs

Written on March 8, 2024