In lab 1, we implement a simple neural network consisting of an input layer, two hidden layers, and an output layer.
The neural network is trained using the backpropagation algorithm, and the Mean Squared Error (MSE) is used as the loss function.
The neural network is implemented using only the NumPy library.
The sigmoid function is used as the activation function in the hidden layer of the neural network. The sigmoid function is defined as follows:
sigmoid(x) = 1 / (1 + np.exp(-x))
There are some properties of the sigmoid function:
- The sigmoid function is differentiable.
- The sigmoid function is monotonically increasing.
- The sigmoid function is bounded between 0 and 1.
The Mean Squared Error (MSE) is used as the loss function. The MSE is defined as follows:
MSE = 1/N * sum((y - y_pred)^2)
The neural network is implemented with the following structure:
- Input layer: 2 neurons
- 1st Hidden layer: 4 neurons
- 2nd Hidden layer: 4 neurons
- Output layer: 1 neuron
The backpropagation algorithm is used to train the neural network. The backpropagation algorithm is implemented as follows:
- Forward pass
- Calculate the output of each layer.
- Backward pass
- Calculate the gradient of the loss function with respect to the output of each layer.
- Update the weights of each layer using the gradient descent algorithm.
parameters:
- epochs: 100000
- learning rate: 0.1
- hidden unit size: 16
Learning Curve / Accuracy Curve:
parameters:
- epochs: 100000
- learning rate: 0.1
Training linear dataset with learning rate 0.1, 0.01, 0.001:
We can see that the learning rate affects the convergence speed of the neural network. A higher learning rate leads to faster convergence, but it may also lead to overshooting.
Training XOR dataset with learning rate 0.1, 0.01, 0.001:
For the XOR dataset, the 0.01 and 0.001 learning rates are too small, and the neural network cannot converge until 500,000 or doesn't converge at all. The 0.1 learning rate achieves the best performance.
Different number of hidden units
Training linear dataset with number of hidden units 16, 32, 64:
We can see that the number of hidden units affects the capacity of the neural network. A higher number of hidden units leads to higher capacity, which allows the network to learn more complex patterns. However, a higher number of hidden units may also leads to overfitting. But in this case, there's no overfitting, as we have the same training/testing dataset. Hence, the higher number of hidden units converges faster.
Training XOR dataset with number of hidden units 16, 32, 64:
We can see that for hidden units = 16, the network converges much slower than hidden units = 32, 64. The network with hidden units = 64 converges the fastest.
Training linear dataset without activation function:
We can see that without the activation function, the neural network is still able to classify the linear dataset pretty well.
Training XOR dataset without activation function:
We can see that without the activation function, the neural network is unable to classify the XOR dataset.
I implemented the Momentum optimizer.
Training linear dataset with Momentum optimizers: The learning curve shows that the momentum updates the weights faster than the normal gradient descent. The accuracy curve shows that the momentum optimizer converges faster than the normal gradient descent.
Training XOR dataset with Momentum optimizers: Same for the XOR dataset, the momentum optimizer converges faster than the normal gradient descent.
I implemented the Sigmoid
, ReLU
, and tanh
activation functions.
The model with ReLU activation function converges the fastest on linear dataset
Training linear dataset with different activation functions:
The different activation functions seem to not affect the model convergence speed too much on the linear dataset.
Training XOR dataset with different activation functions:
For the XOR dataset, the ReLU activation function converges fastest, while tanh being second, and sigmoid being the last.