[MLE] Artificial Neural Network Training
Overview
- Error Functions
- Basic Linear Algebra
- Singular Value Decomposition
- Gradient Descent
- Backpropagation
- Deep Learning
Error Functions
In order to optimise the performance of ANNs an error function on the training set must be minimised
This is done by adjusting:
- Weights connecting nodes
- Network Architecture
- Parameters of non-linear functions h(a)
Backpropagation
- Used to calculate derivatives of error function efficiently
- Error propagate backwards layer by layer
Iterative minimisation of error function:
- Calculate derivative of error function with respect to weights
- Derivatives used to adjust weights
That’s the way we do backpropagation, but after get the derivatives, how do we update our weights?
Here is a graph I found on the internet:
In the lecture, it introduces as follows
Basic Linear Algebra
Matrix Determinant
- Used in many calculations, e.g.
- matrix inversion
- singularity testing(singular iff |A| = 0)
- det(A) = |A|
Eigenvalues
Given an invertible matrix M, an eigenvalue equation can be found in terms of a set of orthogonal vectors \(v_i\) and scalars \(\lambda_i\) such that \(Mv_i = \lambda_iv_i\)
Eigenvalues are found by solving the characteristic equation: \(| A - \lambda I| = 0\)
Jacobian and Hessian
Ans:13 and 33**
When doing BP, how to calculate gradient of error function
Regularization
We always add regularization in our neural network calculation
In CS231, we first know regularization from this slide:
Why use regularization
- penalise bad weights
- avoid overfitting
- early stopping
We use regularization to penalise large weights and unbalanced weights.
Regularization is a technique used in an attempt to solve the overfitting problem in statistical models.
And you might ask: OK I have everything now. How can I tune in the regularization term \(\lambda\)?
One possible answer is to use cross-validation: you divide your training data, you train your model for a fixed value of \(\lambda\) and test it on the remaining subsets and repeat this procedure while varying \(\lambda\). Then you select the best \(\lambda\) that minimizes your loss function.
Also, it is a way of early stopping since the test error will be steady.
评论
发表评论