Building a Neural Network for MNIST Digit Classification from Scratch
In this post, I’ll walk through how I built a neural network for handwritten digit classification using only NumPy - no TensorFlow, PyTorch, or any other deep learning framework. We’ll dive into the math behind neural networks, explore the code implementation, and see how to use the trained model to classify new handwritten digits.
Introduction
The MNIST dataset is the “Hello World” of machine learning - a collection of 70,000 handwritten digits (60,000 for training, 10,000 for testing) that has been used for benchmarking machine learning algorithms for decades. While modern deep learning frameworks make it trivial to achieve 99%+ accuracy on MNIST, building a neural network from scratch helps us truly understand what’s happening under the hood.
Our project repository is structured as follows:
handwritte-digit-classifier/
│
├── data/
│ ├── train_images.npy # Training images
│ ├── train_labels.npy # Training labels
│ ├── test_images.npy # Test images
│ ├── test_labels.npy # Test labels
│
├── src/
│ ├── __init__.py # Marks this directory as a package
│ ├── data_loader.py # Loads the MNIST data and preprocesses it
│ ├── neural_network.py # Contains the neural network code (forward pass, backpropagation, etc.)
│ ├── utils.py # Helper functions for evaluation, loss calculation, etc.
│
├── tests/
│ ├── setup.py # MNIST Training & Testing data setup.
│ ├── test_single_digit.py # Testing our own image.
|
├── main.py # The entry point to train the model and run the project
├── requirements.txt # Python dependencies for the project
└── README.md # Project description and setup instructions
The Math Behind Neural Networks
Let’s explore the key mathematical concepts behind our neural network implementation.
Architecture
We implemented a simple feedforward neural network with:
- Input layer: 784 neurons (28×28 pixels flattened)
- Hidden layer: 128 neurons with ReLU activation
- Output layer: 10 neurons with softmax activation (one for each digit 0-9)
Forward Propagation
In the forward pass, we compute the network’s prediction for a given input:
- Input to Hidden Layer:
\[Z^{[1]} = X W^{[1]} + b^{[1]}\]
\[A^{[1]} = \text{ReLU}(Z^{[1]}) = \max(0, Z^{[1]})\]
- Hidden Layer to Output Layer:
\[Z^{[2]} = A^{[1]} W^{[2]} + b^{[2]}\]
\[A^{[2]} = \text{softmax}(Z^{[2]}) = \frac{e^{Z^{[2]}_i}}{\sum_j e^{Z^{[2]}_j}}\]
Where:
- \(X\) is the input data (batch_size × 784)
- \(W^{[1]}\) is the weight matrix for the first layer (784 × 128)
- \(b^{[1]}\) is the bias vector for the first layer (1 × 128)
- \(A^{[1]}\) is the activation output from the first layer
- \(W^{[2]}\) is the weight matrix for the second layer (128 × 10)
- \(b^{[2]}\) is the bias vector for the second layer (1 × 10)
- \(A^{[2]}\) is the final output (probabilities for each digit)
The softmax function converts the raw output scores into probabilities that sum to 1, making it perfect for classification tasks.
Loss Function
We use cross-entropy loss to measure the error between our predictions and the true labels:
\[L = -\frac{1}{m} \sum_{i=1}^{m} \sum_{j=1}^{10} y_{ij} \log(a_{ij})\]
Where:
- \(m\) is the batch size
- \(y_{ij}\) is the true probability of example \(i\) being class \(j\) (from one-hot encoding)
- \(a_{ij}\) is the predicted probability of example \(i\) being class \(j\)
Backpropagation
Backpropagation computes the gradients of the loss with respect to the weights and biases:
- Output Layer Gradients:
\[dZ^{[2]} = A^{[2]} - Y\]
\[dW^{[2]} = \frac{1}{m} A^{[1]T} dZ^{[2]}\]
\[db^{[2]} = \frac{1}{m} \sum_{i=1}^{m} dZ^{[2]}_i\]
- Hidden Layer Gradients:
\[dA^{[1]} = dZ^{[2]} W^{[2]T}\]
\[dZ^{[1]} = dA^{[1]} * \text{ReLU}'(Z^{[1]}) = dA^{[1]} * (Z^{[1]} > 0)\]
\[dW^{[1]} = \frac{1}{m} X^T dZ^{[1]}\]
\[db^{[1]} = \frac{1}{m} \sum_{i=1}^{m} dZ^{[1]}_i\]
Where:
- \(Y\) is the true labels (one-hot encoded)
- \(dZ^{[2]}\) is the gradient of the loss with respect to the output layer pre-activation
- \(dW^{[2]}\) is the gradient of the loss with respect to \(W^{[2]}\)
- \(db^{[2]}\) is the gradient of the loss with respect to \(b^{[2]}\)
- The ReLU derivative is 1 for positive inputs and 0 for negative inputs
Parameter Updates
After computing the gradients, we update the parameters using gradient descent:
\[W^{[1]} = W^{[1]} - \alpha \cdot dW^{[1]}\]
\[b^{[1]} = b^{[1]} - \alpha \cdot db^{[1]}\]
\[W^{[2]} = W^{[2]} - \alpha \cdot dW^{[2]}\]
\[b^{[2]} = b^{[2]} - \alpha \cdot db^{[2]}\]
Where \(\alpha\) is the learning rate, controlling the step size of our updates.
The Code Implementation
Let’s look at the key components of our implementation:
Neural Network Class
The core of our project is the NeuralNetwork
class:
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
# Xavier/Glorot initialization for better convergence
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1 / hidden_size)
self.b2 = np.zeros((1, output_size))
self.learning_rate = learning_rate
We use Xavier/Glorot initialization for the weights, which helps prevent vanishing/exploding gradients by scaling the random values based on the layer sizes.
Forward and Backward Passes
The forward pass computes the network’s prediction:
def forward(self, X):
# First layer
z1 = np.dot(X, self.W1) + self.b1
a1 = self.relu(z1)
# Output layer
z2 = np.dot(a1, self.W2) + self.b2
a2 = self.softmax(z2)
# Store intermediate values for backpropagation
cache = {
'X': X, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2
}
return a2, cache
The backward pass computes the gradients:
def backward(self, y, cache):
# Get batch size
batch_size = y.shape[0]
# Unpack cache
X = cache['X']
a1 = cache['a1']
a2 = cache['a2']
z1 = cache['z1']
# Gradient of output layer
dz2 = a2 - y # Derivative of softmax with cross-entropy
dW2 = (1 / batch_size) * np.dot(a1.T, dz2)
db2 = (1 / batch_size) * np.sum(dz2, axis=0, keepdims=True)
# Gradient of hidden layer
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.relu_derivative(z1)
dW1 = (1 / batch_size) * np.dot(X.T, dz1)
db1 = (1 / batch_size) * np.sum(dz1, axis=0, keepdims=True)
return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
Training Process
Training happens through the train_step
method, which combines forward propagation, backpropagation, and parameter updates:
def train_step(self, X, y):
# Forward pass
output, cache = self.forward(X)
# Backward pass
gradients = self.backward(y, cache)
# Update parameters
self.update_parameters(gradients)
# Compute loss
loss = -np.mean(np.sum(y * np.log(np.clip(output, 1e-10, 1.0)), axis=1))
return loss
Understanding the Training Process
Let’s demystify what happens during training and what those outputs mean.
What is an Epoch?
An epoch is one complete pass through the entire training dataset. In our implementation, we train for 10 epochs. For each epoch:
- We shuffle the training data to prevent the model from learning the order of samples
- We divide the data into mini-batches of size 64
- For each mini-batch, we:
- Perform a forward pass to get predictions
- Calculate the loss
- Perform backpropagation to get gradients
- Update the model parameters
Interpreting the Training Output
Here’s a sample of the output during training:
Epoch 10/10 completed in 7.79s, Avg. Loss: 0.2439, Test Accuracy: 93.52%, Test Loss: 0.2318
Let’s break this down:
-
Avg. Loss (0.2439): This is the cross-entropy loss averaged across all training batches in the epoch. Lower values indicate better performance.
-
Test Accuracy (93.52%): After each epoch, we evaluate the model on the test set. This percentage means that the model correctly classified 93.52% of the test images. This is our most important metric, as it tells us how well the model generalizes to unseen data.
-
Test Loss (0.2318): This is the cross-entropy loss calculated on the test set. It’s generally close to the training loss, which is a good sign that the model isn’t overfitting.
The fact that our model achieves over 93% accuracy is quite impressive for a simple neural network built from scratch!
Testing the Model on New Images
One of the most exciting parts is using our trained model to classify new handwritten digits. We’ve created test_single_digit.py
to do exactly that.
How the Testing Script Works
- Load the trained model parameters:
def load_model(model_dir='models'):
W1 = np.load(os.path.join(model_dir, 'W1.npy'))
b1 = np.load(os.path.join(model_dir, 'b1.npy'))
W2 = np.load(os.path.join(model_dir, 'W2.npy'))
b2 = np.load(os.path.join(model_dir, 'b2.npy'))
nn = NeuralNetwork(W1.shape[0], W1.shape[1], W2.shape[1])
nn.W1, nn.b1, nn.W2, nn.b2 = W1, b1, W2, b2
return nn
- Preprocess the input image:
def preprocess_image(image_path):
# Open the image and convert to grayscale
img = Image.open(image_path).convert('L')
# Resize to 28x28 pixels
img = img.resize((28, 28))
# Invert if needed (MNIST has white digits on black background)
img = ImageOps.invert(img)
# Convert to numpy array and normalize to [0, 1]
img_array = np.array(img) / 255.0
# Flatten the 28x28 image to a 784-dimensional vector
flattened = img_array.reshape(1, 784)
return flattened, img_array
- Make the prediction:
def predict_digit(model, image_path):
# Preprocess the image
img_data, img_array = preprocess_image(image_path)
# Make the prediction
prediction = model.predict(img_data)[0]
# Get the probabilities
output, _ = model.forward(img_data)
probabilities = output[0]
return prediction, probabilities, img_array
Using the Testing Script
To test the model on a new image:
- Save the image as a PNG/JPG file
- Run the script:
python test_single_digit.py path/to/your/digit/image.png
- The script will display:
- The image with the predicted digit
- A bar chart showing confidence levels for each possible digit
- The prediction and confidence percentages in the terminal
Why This Works: The Magic of Neural Networks
Our model achieves good accuracy because neural networks excel at finding patterns in high-dimensional data. For the MNIST dataset:
- Each 28×28 image becomes a 784-dimensional vector
- During training, the network learns to associate patterns in these vectors with specific digits
- The weights in the neural network encode these patterns
- When given a new image, the network computes how well it matches the learned patterns for each digit
The beauty of this approach is that we never explicitly tell the network what features to look for - it learns them automatically from the data!
Limitations and Potential Improvements
While our model works well, it has limitations:
-
Architecture: Our simple two-layer network has limited capacity. Adding more layers would enable learning more complex patterns.
-
Regularization: We don’t use any regularization techniques like dropout or L2 regularization, which could help reduce overfitting.
-
Optimization: We use vanilla gradient descent. Techniques like momentum, RMSprop, or Adam could lead to faster and better convergence.
-
Data Augmentation: We could generate additional training data by applying transformations to the existing images.
-
Preprocessing: More sophisticated preprocessing techniques might improve performance.
Conclusion
Building a neural network from scratch is an incredibly rewarding experience. Not only have we achieved respectable performance on the MNIST dataset (93.52% accuracy), but we’ve gained a deep understanding of how neural networks work internally.
This knowledge is invaluable even when using high-level frameworks like TensorFlow or PyTorch, as it helps us make informed decisions about architecture, hyperparameters, and debugging.
We encourage you to experiment with the code - try different hyperparameters, add more layers, or implement regularization techniques. The possibilities are endless!
References
- Neural Networks and Deep Learning - Michael Nielsen
- Backpropagation Explained - Christopher Olah
- Xavier Initialization - Xavier Glorot & Yoshua Bengio
- ReLU Activation Function - Wikipedia
- Softmax Function - Wikipedia
All code for this project is available in the GitHub repository.