2006-05-23

cool algorithms

I have been busy at work recently building neural network ensembles for solving quite a complex prediction/classification problem. There are two major issues to solve for my particular problem: 1) How do you reduce the dimensionality of your input data to something manageable while still maintaining the relevant information? 2) How do you find a set of neural networks that each produces good results individually while making it's prediction errors in disjoint areas of the input space, thus archieving maximal benefit from grouping. In my research for 1) I've stumbled across two majorly cool algorithms: Isomap and Locally Linear Embedding. Both of which try to solve the same problem: Finding the inherent dimensionality of a dataset hidden in a high dimensional input space (finding an embedded manifold). As a simple example you can view the 2D still fotograph of a three dimensional object as a vector of pixels. This vector will have an insanely high number of components (i.e. dimensions), your average digital camera has millions. On the other hand the real world object they describe usually has far less degrees of freedom, thus needing far fewer components to be described. The aforementioned papers both have very cool graphical examples for this. Unfortunately both algorithms suffer from the fact that you cannot train them on one dataset and subsequently apply them to another, unknown, one. Real world problems are often of this nature though... Still, I think they are among the coolest ideas in recent years and definately a worthwhile direction for more research. I also investigated Self Organizing Maps, but they suffer from the same problem like LLE and IsoMap and additionally also include quantization as a necessary step, thus corrupting your data more than necessary. In the end, I settled for boring old plain vanilla principal component analysis, which manages to simplify my data at least a little bit. For the second problem ( 2) mentioned above) the primary issues are how to determine network topology in the first place and how to compare networks against each other. I implemented a genetic algorithm growing thausands of networks, thus solving the network topolgy problem in some pretty brute force way. Comparing network quality is more difficult than it sounds though, the standard mean squared error usually used in training backpropagation nets isn't very useful for this. I ended up creating a gini curve/coefficient for my alpha beta errors (it's a two class classification problem) and, more importantly, scatterplots of the output neurons' activation. The latter plots are very interesting, creating funky curves for certain training configurations and transfer functions (radial basis functions are most freaky). They also offer insight into the particular training samples a net makes errors on, thus allowing an optimal strategy for choosing a set of nets. Fun ;-)