Development of technologies for large-scale knowledge-engineering in the Post-genomic era

Prof Stephen Muggleton,

Imperial College

Abstract

The biological sciences of a Post-Genomic era will need to construct knowledge-bases on an unprecedented scale. These knowledge-bases will not only describe properties and interactions of individual molecules but will also contain models of the way in which cells and organisms function. Relevant data on which to base such knowledge-bases is already available and expanding rapidly (eg. SCOP, Promotif, SwissProt, KEGG). Owing to the scale and rate of data generation, such knowledge-bases will require automatic construction and modification. Moreover, during the 21st century, it is already clear that computers will play an increasingly central role in supporting the fundamental formulation and testing of scientific hypotheses. This traditionally human activity will soon become unsustainable in the biological sciences if unaided by computers. This is not due to the scale of the data involved but because scientists will not be able to conceptualise the breadth and depth of the relationships within the relevant databases and knowledge-bases.

The automatic construction and testing of hypotheses and their eventual incorporation into accepted knowledge-bases will require an ability to handle incomplete, incorrect and imprecise information. For these purpose arguably two of the most relevant computational technologies are those of a) machine learning and b) uncertain reasoning. A number of successes (by the author and others) of machine learning applied to molecular biology will be described. Despite success in these applications the present state of the technologies involved urgently requires enhancement in order to handle the scale of the engineering task ahead. Both a) and b) above are dealt with by a variety of research communities using a wide array of approaches. These approaches are separated largely by differences in the underlying representation of the knowledge being learned and reasoned about (eg. Hidden Markov Models, Bayes' nets, decision trees and Logic Programs). Each formalism has representational advantages for particular tasks. However, large-scale applications, such as those found in Bioinformatics, tend to require a broad mix of these representations. This is not supplied by any one approach. One might imagine that an amalgam of representations could be used. However, such amalgams do not scale well because of the lack of uniformity of reasoning both about and within such systems.

Means will be suggested for unifying some of the fundamental representations and techniques involved in machine learning and uncertainty reasoning with a view to broadening their application to large-scale modelling. The consequent techniques need to be implemented and applied to a series of problems drawn from the biological sciences. The techniques can also be expected to have more generic application. The resulting systems need to be incorporated into distributed web-based resources which will allow Machine Learning techniques to be used alongside prediction techniques already used by Bioinformaticists (eg. homology matching). Success of applications needs to be judged by broad-ranging performance measures of the developed systems.