Building a Neural Network for MNIST Digit Classification from Scratch

In this post, I’ll walk through how I built a neural network for handwritten digit classification using only NumPy - no TensorFlow, PyTorch, or any other deep learning framework. We’ll dive into the math behind neural networks, explore the code implementation, and see how to use the trained model to classify new handwritten digits.

Introduction

The MNIST dataset is the “Hello World” of machine learning - a collection of 70,000 handwritten digits (60,000 for training, 10,000 for testing) that has been used for benchmarking machine learning algorithms for decades. While modern deep learning frameworks make it trivial to achieve 99%+ accuracy on MNIST, building a neural network from scratch helps us truly understand what’s happening under the hood.

Our project repository is structured as follows:

handwritte-digit-classifier/
│
├── data/
│   ├── train_images.npy       # Training images
│   ├── train_labels.npy       # Training labels
│   ├── test_images.npy        # Test images
│   ├── test_labels.npy        # Test labels
│
├── src/
│   ├── __init__.py            # Marks this directory as a package
│   ├── data_loader.py         # Loads the MNIST data and preprocesses it
│   ├── neural_network.py      # Contains the neural network code (forward pass, backpropagation, etc.)
│   ├── utils.py               # Helper functions for evaluation, loss calculation, etc.
│
├── tests/
│   ├── setup.py               # MNIST Training & Testing data setup.
│   ├── test_single_digit.py   # Testing our own image.
|
├── main.py                    # The entry point to train the model and run the project
├── requirements.txt           # Python dependencies for the project
└── README.md                  # Project description and setup instructions

The Math Behind Neural Networks

Let’s explore the key mathematical concepts behind our neural network implementation.

Architecture

We implemented a simple feedforward neural network with:

Input layer: 784 neurons (28×28 pixels flattened)
Hidden layer: 128 neurons with ReLU activation
Output layer: 10 neurons with softmax activation (one for each digit 0-9)

Forward Propagation

In the forward pass, we compute the network’s prediction for a given input:

Input to Hidden Layer:

\[Z^{[1]} = X W^{[1]} + b^{[1]}\]

\[A^{[1]} = \text{ReLU}(Z^{[1]}) = \max(0, Z^{[1]})\]

Hidden Layer to Output Layer:

\[Z^{[2]} = A^{[1]} W^{[2]} + b^{[2]}\]

\[A^{[2]} = \text{softmax}(Z^{[2]}) = \frac{e^{Z^{[2]}_i}}{\sum_j e^{Z^{[2]}_j}}\]

Where:

\(X\) is the input data (batch_size × 784)
\(W^{[1]}\) is the weight matrix for the first layer (784 × 128)
\(b^{[1]}\) is the bias vector for the first layer (1 × 128)
\(A^{[1]}\) is the activation output from the first layer
\(W^{[2]}\) is the weight matrix for the second layer (128 × 10)
\(b^{[2]}\) is the bias vector for the second layer (1 × 10)
\(A^{[2]}\) is the final output (probabilities for each digit)

The softmax function converts the raw output scores into probabilities that sum to 1, making it perfect for classification tasks.

Loss Function

We use cross-entropy loss to measure the error between our predictions and the true labels:

\[L = -\frac{1}{m} \sum_{i=1}^{m} \sum_{j=1}^{10} y_{ij} \log(a_{ij})\]

Where:

\(m\) is the batch size
\(y_{ij}\) is the true probability of example \(i\) being class \(j\) (from one-hot encoding)
\(a_{ij}\) is the predicted probability of example \(i\) being class \(j\)

Backpropagation

Backpropagation computes the gradients of the loss with respect to the weights and biases:

Output Layer Gradients:

\[dZ^{[2]} = A^{[2]} - Y\]
\[dW^{[2]} = \frac{1}{m} A^{[1]T} dZ^{[2]}\]
\[db^{[2]} = \frac{1}{m} \sum_{i=1}^{m} dZ^{[2]}_i\]

Hidden Layer Gradients:

\[dA^{[1]} = dZ^{[2]} W^{[2]T}\]
\[dZ^{[1]} = dA^{[1]} * \text{ReLU}'(Z^{[1]}) = dA^{[1]} * (Z^{[1]} > 0)\]
\[dW^{[1]} = \frac{1}{m} X^T dZ^{[1]}\]
\[db^{[1]} = \frac{1}{m} \sum_{i=1}^{m} dZ^{[1]}_i\]

Where:

\(Y\) is the true labels (one-hot encoded)
\(dZ^{[2]}\) is the gradient of the loss with respect to the output layer pre-activation
\(dW^{[2]}\) is the gradient of the loss with respect to \(W^{[2]}\)
\(db^{[2]}\) is the gradient of the loss with respect to \(b^{[2]}\)
The ReLU derivative is 1 for positive inputs and 0 for negative inputs

Parameter Updates

After computing the gradients, we update the parameters using gradient descent:

\[W^{[1]} = W^{[1]} - \alpha \cdot dW^{[1]}\]
\[b^{[1]} = b^{[1]} - \alpha \cdot db^{[1]}\]
\[W^{[2]} = W^{[2]} - \alpha \cdot dW^{[2]}\]
\[b^{[2]} = b^{[2]} - \alpha \cdot db^{[2]}\]

Where \(\alpha\) is the learning rate, controlling the step size of our updates.

The Code Implementation

Let’s look at the key components of our implementation:

Neural Network Class

The core of our project is the NeuralNetwork class:

class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
        # Xavier/Glorot initialization for better convergence
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1 / hidden_size)
        self.b2 = np.zeros((1, output_size))
        self.learning_rate = learning_rate

We use Xavier/Glorot initialization for the weights, which helps prevent vanishing/exploding gradients by scaling the random values based on the layer sizes.

Forward and Backward Passes

The forward pass computes the network’s prediction:

def forward(self, X):
    # First layer
    z1 = np.dot(X, self.W1) + self.b1
    a1 = self.relu(z1)
    
    # Output layer
    z2 = np.dot(a1, self.W2) + self.b2
    a2 = self.softmax(z2)
    
    # Store intermediate values for backpropagation
    cache = {
        'X': X, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2
    }
    
    return a2, cache

The backward pass computes the gradients:

def backward(self, y, cache):
    # Get batch size
    batch_size = y.shape[0]
    
    # Unpack cache
    X = cache['X']
    a1 = cache['a1']
    a2 = cache['a2']
    z1 = cache['z1']
    
    # Gradient of output layer
    dz2 = a2 - y  # Derivative of softmax with cross-entropy
    dW2 = (1 / batch_size) * np.dot(a1.T, dz2)
    db2 = (1 / batch_size) * np.sum(dz2, axis=0, keepdims=True)
    
    # Gradient of hidden layer
    da1 = np.dot(dz2, self.W2.T)
    dz1 = da1 * self.relu_derivative(z1)
    dW1 = (1 / batch_size) * np.dot(X.T, dz1)
    db1 = (1 / batch_size) * np.sum(dz1, axis=0, keepdims=True)
    
    return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}

Training Process

Training happens through the train_step method, which combines forward propagation, backpropagation, and parameter updates:

def train_step(self, X, y):
    # Forward pass
    output, cache = self.forward(X)
    
    # Backward pass
    gradients = self.backward(y, cache)
    
    # Update parameters
    self.update_parameters(gradients)
    
    # Compute loss
    loss = -np.mean(np.sum(y * np.log(np.clip(output, 1e-10, 1.0)), axis=1))
    
    return loss

Understanding the Training Process

Let’s demystify what happens during training and what those outputs mean.

What is an Epoch?

An epoch is one complete pass through the entire training dataset. In our implementation, we train for 10 epochs. For each epoch:

We shuffle the training data to prevent the model from learning the order of samples
We divide the data into mini-batches of size 64
For each mini-batch, we:
- Perform a forward pass to get predictions
- Calculate the loss
- Perform backpropagation to get gradients
- Update the model parameters

Interpreting the Training Output

Here’s a sample of the output during training:

Epoch 10/10 completed in 7.79s, Avg. Loss: 0.2439, Test Accuracy: 93.52%, Test Loss: 0.2318

Let’s break this down:

Avg. Loss (0.2439): This is the cross-entropy loss averaged across all training batches in the epoch. Lower values indicate better performance.
Test Accuracy (93.52%): After each epoch, we evaluate the model on the test set. This percentage means that the model correctly classified 93.52% of the test images. This is our most important metric, as it tells us how well the model generalizes to unseen data.
Test Loss (0.2318): This is the cross-entropy loss calculated on the test set. It’s generally close to the training loss, which is a good sign that the model isn’t overfitting.

The fact that our model achieves over 93% accuracy is quite impressive for a simple neural network built from scratch!

Testing the Model on New Images

One of the most exciting parts is using our trained model to classify new handwritten digits. We’ve created test_single_digit.py to do exactly that.

How the Testing Script Works

Load the trained model parameters:

def load_model(model_dir='models'):
    W1 = np.load(os.path.join(model_dir, 'W1.npy'))
    b1 = np.load(os.path.join(model_dir, 'b1.npy'))
    W2 = np.load(os.path.join(model_dir, 'W2.npy'))
    b2 = np.load(os.path.join(model_dir, 'b2.npy'))
    
    nn = NeuralNetwork(W1.shape[0], W1.shape[1], W2.shape[1])
    nn.W1, nn.b1, nn.W2, nn.b2 = W1, b1, W2, b2
    
    return nn

Preprocess the input image:

def preprocess_image(image_path):
    # Open the image and convert to grayscale
    img = Image.open(image_path).convert('L')
    
    # Resize to 28x28 pixels
    img = img.resize((28, 28))
    
    # Invert if needed (MNIST has white digits on black background)
    img = ImageOps.invert(img)
    
    # Convert to numpy array and normalize to [0, 1]
    img_array = np.array(img) / 255.0
    
    # Flatten the 28x28 image to a 784-dimensional vector
    flattened = img_array.reshape(1, 784)
    
    return flattened, img_array

Make the prediction:

def predict_digit(model, image_path):
    # Preprocess the image
    img_data, img_array = preprocess_image(image_path)
    
    # Make the prediction
    prediction = model.predict(img_data)[0]
    
    # Get the probabilities
    output, _ = model.forward(img_data)
    probabilities = output[0]
    
    return prediction, probabilities, img_array

Using the Testing Script

To test the model on a new image:

Save the image as a PNG/JPG file
Run the script: python test_single_digit.py path/to/your/digit/image.png
The script will display:
- The image with the predicted digit
- A bar chart showing confidence levels for each possible digit
- The prediction and confidence percentages in the terminal

Why This Works: The Magic of Neural Networks

Our model achieves good accuracy because neural networks excel at finding patterns in high-dimensional data. For the MNIST dataset:

Each 28×28 image becomes a 784-dimensional vector
During training, the network learns to associate patterns in these vectors with specific digits
The weights in the neural network encode these patterns
When given a new image, the network computes how well it matches the learned patterns for each digit

The beauty of this approach is that we never explicitly tell the network what features to look for - it learns them automatically from the data!

Limitations and Potential Improvements

While our model works well, it has limitations:

Architecture: Our simple two-layer network has limited capacity. Adding more layers would enable learning more complex patterns.
Regularization: We don’t use any regularization techniques like dropout or L2 regularization, which could help reduce overfitting.
Optimization: We use vanilla gradient descent. Techniques like momentum, RMSprop, or Adam could lead to faster and better convergence.
Data Augmentation: We could generate additional training data by applying transformations to the existing images.
Preprocessing: More sophisticated preprocessing techniques might improve performance.

Conclusion

Building a neural network from scratch is an incredibly rewarding experience. Not only have we achieved respectable performance on the MNIST dataset (93.52% accuracy), but we’ve gained a deep understanding of how neural networks work internally.

This knowledge is invaluable even when using high-level frameworks like TensorFlow or PyTorch, as it helps us make informed decisions about architecture, hyperparameters, and debugging.

We encourage you to experiment with the code - try different hyperparameters, add more layers, or implement regularization techniques. The possibilities are endless!

References

All code for this project is available in the GitHub repository.