Every developer who wants to understand AI will eventually reach for PyTorch or TensorFlow. These frameworks are excellent. But they abstract away everything important. You can train a transformer in 10 lines of PyTorch without understanding what a gradient is, what backpropagation does, or why your model isn't converging.
This tutorial takes a different approach: we'll build a neural network from scratch using only NumPy. You'll implement forward propagation, backpropagation, and gradient descent by hand โ and by the end, you'll have a working network that classifies handwritten digits.
pip install numpy matplotlib
The Big Picture: What Is a Neural Network?
A neural network is a chain of matrix multiplications, each followed by a nonlinear function. That's it. The magic is in how we learn which weights to use for those multiplications โ and that's gradient descent + backpropagation.
Our network will have:
- Input layer: 784 neurons (28ร28 pixel images, flattened)
- Hidden layer: 128 neurons with ReLU activation
- Output layer: 10 neurons with softmax (one per digit 0-9)
Building the Network
import numpy as np
class NeuralNetwork:
def __init__(self, layer_sizes, learning_rate=0.01):
self.lr = learning_rate
self.weights = []
self.biases = []
# Xavier initialization (important for stable training)
for i in range(len(layer_sizes) - 1):
fan_in = layer_sizes[i]
fan_out = layer_sizes[i + 1]
std = np.sqrt(2.0 / fan_in) # He init for ReLU
W = np.random.randn(fan_out, fan_in) * std
b = np.zeros((fan_out, 1))
self.weights.append(W)
self.biases.append(b)
# โโ Activation Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def relu(self, z):
return np.maximum(0, z)
def relu_derivative(self, z):
return (z > 0).astype(float)
def softmax(self, z):
# Subtract max for numerical stability
exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
return exp_z / exp_z.sum(axis=0, keepdims=True)
# โโ Forward Pass โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def forward(self, X):
self.cache = {'A0': X} # Cache activations for backprop
A = X
for i, (W, b) in enumerate(zip(self.weights, self.biases)):
Z = W @ A + b # Linear: Z = WA + b
A = self.relu(Z) if i < len(self.weights) - 1 else self.softmax(Z)
self.cache[f'Z{i+1}'] = Z
self.cache[f'A{i+1}'] = A
return A # Probability distribution over classes
# โโ Loss Function โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def cross_entropy_loss(self, Y_hat, Y):
m = Y.shape[1]
# Clip to avoid log(0)
Y_hat_clipped = np.clip(Y_hat, 1e-9, 1 - 1e-9)
return -np.sum(Y * np.log(Y_hat_clipped)) / m
# โโ Backward Pass (Backpropagation) โโโโโโโโโโโโโโโโโโโโโโโ
def backward(self, Y):
m = Y.shape[1]
grads = {}
L = len(self.weights)
# Gradient of loss w.r.t. output (softmax + cross-entropy combined)
dA = self.cache[f'A{L}'] - Y # This is dL/dZ for the last layer
for i in reversed(range(L)):
A_prev = self.cache[f'A{i}']
Z = self.cache[f'Z{i+1}']
if i == L - 1:
dZ = dA # Already computed above for output layer
else:
dZ = dA * self.relu_derivative(Z)
grads[f'dW{i+1}'] = (dZ @ A_prev.T) / m
grads[f'db{i+1}'] = np.sum(dZ, axis=1, keepdims=True) / m
# Propagate gradient to previous layer
dA = self.weights[i].T @ dZ
return grads
# โโ Parameter Update โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def update_params(self, grads):
for i in range(len(self.weights)):
self.weights[i] -= self.lr * grads[f'dW{i+1}']
self.biases[i] -= self.lr * grads[f'db{i+1}']
# โโ Training Loop โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def train(self, X, Y, epochs=100, batch_size=64):
m = X.shape[1]
history = []
for epoch in range(epochs):
# Shuffle data
perm = np.random.permutation(m)
X_shuffled = X[:, perm]
Y_shuffled = Y[:, perm]
epoch_loss = 0
for start in range(0, m, batch_size):
X_batch = X_shuffled[:, start:start + batch_size]
Y_batch = Y_shuffled[:, start:start + batch_size]
Y_hat = self.forward(X_batch)
loss = self.cross_entropy_loss(Y_hat, Y_batch)
grads = self.backward(Y_batch)
self.update_params(grads)
epoch_loss += loss
avg_loss = epoch_loss / (m // batch_size)
history.append(avg_loss)
if epoch % 10 == 0:
acc = self.accuracy(X, Y)
print(f'Epoch {epoch:3d} | Loss: {avg_loss:.4f} | Accuracy: {acc:.2%}')
return history
def predict(self, X):
return np.argmax(self.forward(X), axis=0)
def accuracy(self, X, Y):
predictions = self.predict(X)
labels = np.argmax(Y, axis=0)
return np.mean(predictions == labels)
Loading MNIST and Training
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
# Load MNIST
print("Loading MNIST...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist.data, mnist.target.astype(int)
# Normalize pixels to [0, 1]
X = X / 255.0
# One-hot encode labels
def one_hot(y, n_classes=10):
m = len(y)
Y = np.zeros((n_classes, m))
Y[y, np.arange(m)] = 1
return Y
Y = one_hot(y)
# Train/test split โ transpose to (features, samples) convention
X_train, X_test, Y_train, Y_test = train_test_split(X.T, Y.T, test_size=0.2, random_state=42)
# Fix orientation after split
X_train, X_test = X_train.T, X_test.T
Y_train, Y_test = Y_train.T, Y_test.T
# Build and train
print(f"Training on {X_train.shape[1]} samples...")
nn = NeuralNetwork([784, 128, 64, 10], learning_rate=0.05)
history = nn.train(X_train, Y_train, epochs=50, batch_size=128)
# Evaluate
test_acc = nn.accuracy(X_test, Y_test)
print(f"\nTest Accuracy: {test_acc:.2%}")
# Expect ~97% with this architecture
What Backpropagation Is Actually Doing
Backpropagation is just the chain rule from calculus, applied systematically from the output back to each layer's weights. The key insight:
We want to know: "If I change this weight by a tiny amount, how much does the loss change?" That's the partial derivative of loss with respect to that weight โ the gradient.
We then move each weight in the opposite direction of its gradient (gradient descent), because moving against the gradient decreases the loss. Repeat thousands of times. That's training.
The reason we need the chain rule: the loss depends on the output, the output depends on the last layer's activations, which depend on the second-to-last layer, which depend on weights many layers back. The chain rule lets us multiply these dependencies together efficiently.
What's Different in PyTorch/TensorFlow
Once you understand the above, framework features make sense:
- Automatic differentiation (autograd) โ The framework tracks operations and computes gradients automatically. Our
backward()was manual; autograd does this symbolically for any computation graph. - GPU acceleration โ Matrix multiplications parallelized on thousands of CUDA cores. 100-1000x speedup for large models.
- Optimizers beyond SGD โ Adam, AdaGrad, RMSProp. They adapt learning rates per parameter for faster convergence.
- Batch normalization, dropout, skip connections โ Techniques that make deep networks (dozens of layers) trainable.
None of these are magical. They're engineered solutions to specific training problems. Now that you've built the foundation, the framework abstractions will make sense โ and you'll know when to look under the hood when something goes wrong.