Machine Learning Workshop, Day 2.
Once again there were 3 talks in the morning and 2 workshops in the afteroon. In the morning we covered:
-Random Forests. Take data with a classification variable e.g. active/inactive. Split the data by each variable in turn to optimise the separation of these two classes. Find the variable which ‘best’ splits the data. Take the two subsets that this split creates, and repeat the process. This forms a classification tree. Problem: the tree is greedy i.e. overfits. Solution, create lots of trees by resampling from the data with replacement (Boosting). For each sample, build at tree. At each node of the tree, restrict variable selection to a subset of the variables. Pick the ‘best’ variable from this subset and move down to the next node, where you again pick a subset of variables. Repeat to form lots of trees, a forest if you will. Since it’s based on random sampling, you might even call it a random forest. Clever, eh.
-ILP (inductive Logic Programming). Oh dear. Where to begin? Perhaps with the unjustified dig at the previous speaker during their Q and A. Or maybe with the explanation of how we can look at ‘facts’ about the molecules, rather than rows of data in a table, and how we can store these ‘facts’ in what looks suspiciously like… rows of data in a table (or at the least a database of table. Which does raise an interesting question I’ll come back to later). Or maybe with the lack of any information on how these ‘facts’ were used to create a predictive model. Or maybe with the random and completely irrelevant video footage of a robotic laboratory set up back at the speakers lab. Yes the robots looked cool, but they also had NOTHING WHATSOEVER TO DO WITH MACHINE LEARNING. Ahem. But it did raise an interesting question: sometimes we have a priori knowledge about the relationships between datapoints e.g. information that we might naturally store in a database of cross linked tables. How can we use this database in our analysis, rather than a single table?
-QSAR Applications of Machine Learning. A great talk covering a lot of ground, mainly focusing on how we can create alignment independent descriptive variables. Nice example was building a 2D histrogram from a 3D molecular surface by moving around inside it, changing direction when you hit an outer wall, and then recording the path lengths (1D histogram) and also the properties at the surface where you hit it (to get a 2D histogram). You can then compare histograms of molecules with known strong binding affinity to look for patterns, and then test to see whether these patterns appear in ‘new’ molecules. Only talk to beat my 30min impatience gene.
Then came the second half of my R workshop, this time covering:
-linear regression models (lm)
-how to asses them (r.squared, AIC, summary, ANOVA)
-how to test the predictive ability of the model (predict.lm)
-overview of principal components analysis (princomp,screeplot,biplot)
-overview of cluster analysis (hclust,cutree,kmeans)
Didn’t go as well as the first half, mainly because I got nervous and fell back into my bad habit of talking at 500mph (mental note – never check email just before a talk). Hopefully it was still of some use, and people will go back over the slides at their own pace. At the end of the day, if I persuaded 1 more person to try R, I’ll be happy.