Training Neural Networks
Neural Networks offer great promise with their ability to "create" algorithms to solve problems - without the programmer knowing how to solve the problem in the first place. Example based problem solving if you will. I would expect that if you knew precisely how to solve a particular problem to the same degree, you could certainly do it perhaps many orders of magnitude faster and possibly higher quality by coding the solution directly -- however its not always easy or practical to know such a solution.
One opportunity with NNs that I find most interesting is that, no matter how slow they are, you can use NNs as a kind of existence proof -- does an algorithm exist to solve this problem at all?
Of course, when I'm talking about "problems" I'm referring to input to output mappings via some generic algorithm or something. Not everything cleanly fits into this definition, but many things do. Of course there are many different kind of NNs for solving various different kinds of problems too.
After working with NNs for a while (and indeed to anybody who has) I can say that Neural Networks are asymmetric in complexity. That is, training a neural network to accomplish a task can take extreme amounts of time (days is common). However, executing a previously trained Neural Network is embarrassingly parallel and is pretty well mapped to GPUs. Running a NN can be done in real-time if the network is simple enough!
I have spent considerable amounts of time in figuring out how to train Neural Networks faster. The generally recommended practice these days is to use Stochastic Gradient Descent (SGD) with Back Propagation (BP). What this means is you take a random piece of data out of your training set to train with, train with it, and then repeat.
SGD works, but is *incredibly* slow at converging.
I endeavored to improve the training performance here (how could you not, you spend a *lot* of time waiting...)
There are many different techniques to improve upon BP (Adam, etc.. etc..), however each of them are in my measurements slower, regardless of the steeper descent they provide, they take more computation to provide that steeper descent and so when you measure not by epoch but by wall clock time, its actually slower.
So, then came the theory that if you somehow knew the precise order to train the samples in, you could perfectly train to the correct solution in some minimum amount of time. I don't know if there is a theorem about this or what-not, but if not you now have heard of it. It seemed common sense to me.
In any case, then the question becomes is there a heuristic which can approximate this theoretical "perfect" ordering?
The first thing I tried turned out to be very hard to beat, calculate error on all samples, then sort the training order by the error for each in decreasing order. Then, only train 25% of the worst error samples.
The speedup from this approach was pretty awesome, but again I got bored waiting so I went further. Essentially you don't waste time training the easy stuff and instead concentrate on learning the parts it has problems with.
I then tried doing many variations on this, but the one that ended up working even better (30% improved training time) was taking the sorted order and splitting it into 3 sections of easy, medium and hard. Then reorganizing the training order into hard, medium, easy, hard, medium, easy, hard, medium, easy, etc...
Not only did this improve the training time - it also was able to train to an overall lower error than without.
Another option that works pretty well is to just take the 25% highest error samples and randomize the order. Its easier to implement and also works really well. Also, this should be overall a better approach (vs unrandomized) as it seems more robust to training situations where the error explodes (which does happen in some cases).
Thats generally how I would approach a finite and small-ish data set.
I am also developing a technique based on this that works for significantly larger data sets - ones that cannot possibly fit in memory (hundreds or thousands of images).
Thus far the setup is fairly similar, except you pick some small batch of images and do basically the same as above with that.
There are some interesting relationships between batch size (number of images) and training time/quality.
In my data set, the size of the batch reduces the variance of the solution error across the training set and also appears thus far to reduce the number of epochs required to converge - however it is also slower so the jury is still out on if a bigger batch is better - but certainly going too small makes it harder to converge on a general solution.
Thats all for now!