Every developer who wants to understand AI will eventually reach for PyTorch or TensorFlow. These frameworks are excellent. But they abstract away everything important. You can train a transformer in 10 lines of PyTorch without understanding what a gradient is, what backpropagation does, or why your model isn't converging.

This tutorial takes a different approach: we'll build a neural network from scratch using only NumPy. You'll implement forward propagation, backpropagation, and gradient descent by hand โ€” and by the end, you'll have a working network that classifies handwritten digits.

๐Ÿ“‹
Prerequisites
Basic Python, comfort with NumPy arrays, and high school calculus (derivatives โ€” you don't need much). Install: pip install numpy matplotlib

The Big Picture: What Is a Neural Network?

A neural network is a chain of matrix multiplications, each followed by a nonlinear function. That's it. The magic is in how we learn which weights to use for those multiplications โ€” and that's gradient descent + backpropagation.

Our network will have:

  • Input layer: 784 neurons (28ร—28 pixel images, flattened)
  • Hidden layer: 128 neurons with ReLU activation
  • Output layer: 10 neurons with softmax (one per digit 0-9)

Building the Network

import numpy as np

class NeuralNetwork:
    def __init__(self, layer_sizes, learning_rate=0.01):
        self.lr = learning_rate
        self.weights = []
        self.biases  = []

        # Xavier initialization (important for stable training)
        for i in range(len(layer_sizes) - 1):
            fan_in  = layer_sizes[i]
            fan_out = layer_sizes[i + 1]
            std = np.sqrt(2.0 / fan_in)  # He init for ReLU
            W = np.random.randn(fan_out, fan_in) * std
            b = np.zeros((fan_out, 1))
            self.weights.append(W)
            self.biases.append(b)

    # โ”€โ”€ Activation Functions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    def relu(self, z):
        return np.maximum(0, z)

    def relu_derivative(self, z):
        return (z > 0).astype(float)

    def softmax(self, z):
        # Subtract max for numerical stability
        exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
        return exp_z / exp_z.sum(axis=0, keepdims=True)

    # โ”€โ”€ Forward Pass โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    def forward(self, X):
        self.cache = {'A0': X}  # Cache activations for backprop

        A = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            Z = W @ A + b                    # Linear: Z = WA + b
            A = self.relu(Z) if i < len(self.weights) - 1 else self.softmax(Z)
            self.cache[f'Z{i+1}'] = Z
            self.cache[f'A{i+1}'] = A

        return A  # Probability distribution over classes

    # โ”€โ”€ Loss Function โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    def cross_entropy_loss(self, Y_hat, Y):
        m = Y.shape[1]
        # Clip to avoid log(0)
        Y_hat_clipped = np.clip(Y_hat, 1e-9, 1 - 1e-9)
        return -np.sum(Y * np.log(Y_hat_clipped)) / m

    # โ”€โ”€ Backward Pass (Backpropagation) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    def backward(self, Y):
        m = Y.shape[1]
        grads = {}
        L = len(self.weights)

        # Gradient of loss w.r.t. output (softmax + cross-entropy combined)
        dA = self.cache[f'A{L}'] - Y  # This is dL/dZ for the last layer

        for i in reversed(range(L)):
            A_prev = self.cache[f'A{i}']
            Z      = self.cache[f'Z{i+1}']

            if i == L - 1:
                dZ = dA  # Already computed above for output layer
            else:
                dZ = dA * self.relu_derivative(Z)

            grads[f'dW{i+1}'] = (dZ @ A_prev.T) / m
            grads[f'db{i+1}'] = np.sum(dZ, axis=1, keepdims=True) / m

            # Propagate gradient to previous layer
            dA = self.weights[i].T @ dZ

        return grads

    # โ”€โ”€ Parameter Update โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    def update_params(self, grads):
        for i in range(len(self.weights)):
            self.weights[i] -= self.lr * grads[f'dW{i+1}']
            self.biases[i]  -= self.lr * grads[f'db{i+1}']

    # โ”€โ”€ Training Loop โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    def train(self, X, Y, epochs=100, batch_size=64):
        m = X.shape[1]
        history = []

        for epoch in range(epochs):
            # Shuffle data
            perm = np.random.permutation(m)
            X_shuffled = X[:, perm]
            Y_shuffled = Y[:, perm]

            epoch_loss = 0
            for start in range(0, m, batch_size):
                X_batch = X_shuffled[:, start:start + batch_size]
                Y_batch = Y_shuffled[:, start:start + batch_size]

                Y_hat = self.forward(X_batch)
                loss  = self.cross_entropy_loss(Y_hat, Y_batch)
                grads = self.backward(Y_batch)
                self.update_params(grads)
                epoch_loss += loss

            avg_loss = epoch_loss / (m // batch_size)
            history.append(avg_loss)

            if epoch % 10 == 0:
                acc = self.accuracy(X, Y)
                print(f'Epoch {epoch:3d} | Loss: {avg_loss:.4f} | Accuracy: {acc:.2%}')

        return history

    def predict(self, X):
        return np.argmax(self.forward(X), axis=0)

    def accuracy(self, X, Y):
        predictions = self.predict(X)
        labels      = np.argmax(Y, axis=0)
        return np.mean(predictions == labels)

Loading MNIST and Training

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load MNIST
print("Loading MNIST...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y  = mnist.data, mnist.target.astype(int)

# Normalize pixels to [0, 1]
X = X / 255.0

# One-hot encode labels
def one_hot(y, n_classes=10):
    m = len(y)
    Y = np.zeros((n_classes, m))
    Y[y, np.arange(m)] = 1
    return Y

Y = one_hot(y)

# Train/test split โ€” transpose to (features, samples) convention
X_train, X_test, Y_train, Y_test = train_test_split(X.T, Y.T, test_size=0.2, random_state=42)
# Fix orientation after split
X_train, X_test = X_train.T, X_test.T
Y_train, Y_test = Y_train.T, Y_test.T

# Build and train
print(f"Training on {X_train.shape[1]} samples...")
nn = NeuralNetwork([784, 128, 64, 10], learning_rate=0.05)
history = nn.train(X_train, Y_train, epochs=50, batch_size=128)

# Evaluate
test_acc = nn.accuracy(X_test, Y_test)
print(f"\nTest Accuracy: {test_acc:.2%}")
# Expect ~97% with this architecture
๐Ÿ“ˆ
Expected Results
With this architecture and 50 epochs, you should see roughly 96-97% test accuracy on MNIST. Not state-of-the-art (CNNs get 99.7%), but remarkable for 150 lines of Python and a single hidden layer.

What Backpropagation Is Actually Doing

Backpropagation is just the chain rule from calculus, applied systematically from the output back to each layer's weights. The key insight:

We want to know: "If I change this weight by a tiny amount, how much does the loss change?" That's the partial derivative of loss with respect to that weight โ€” the gradient.

We then move each weight in the opposite direction of its gradient (gradient descent), because moving against the gradient decreases the loss. Repeat thousands of times. That's training.

The reason we need the chain rule: the loss depends on the output, the output depends on the last layer's activations, which depend on the second-to-last layer, which depend on weights many layers back. The chain rule lets us multiply these dependencies together efficiently.

What's Different in PyTorch/TensorFlow

Once you understand the above, framework features make sense:

  • Automatic differentiation (autograd) โ€” The framework tracks operations and computes gradients automatically. Our backward() was manual; autograd does this symbolically for any computation graph.
  • GPU acceleration โ€” Matrix multiplications parallelized on thousands of CUDA cores. 100-1000x speedup for large models.
  • Optimizers beyond SGD โ€” Adam, AdaGrad, RMSProp. They adapt learning rates per parameter for faster convergence.
  • Batch normalization, dropout, skip connections โ€” Techniques that make deep networks (dozens of layers) trainable.

None of these are magical. They're engineered solutions to specific training problems. Now that you've built the foundation, the framework abstractions will make sense โ€” and you'll know when to look under the hood when something goes wrong.

TF Editorial

TF Editorial

Editorial Team ยท Tomfoolering

We write about technology with depth and without condescension.