PhD thesis (Rune Linding)
For many decades models of protein function have primarily been described in terms of functional modules known as globular domains. However, a large group of functional sites has been revealed over the last 15 years or so. Only recently have these begun to be catalogued. These are linear modules and encompass ligand sites such as 14-3-3, SH3 and Cyclin ligands, as well as posttranslational modiﬁcation sites and targeting signals. Linear modules are short and co-linear in both sequence and structure space. The shor t length make them difﬁcult to detect based on sequence alone. Experimentally, they are neglected because they reside in disordered or unstructured par ts of proteins which are often removed recombinantly during protein expression. Yet, the linear modules are as impor tant for protein function as are globular domains. It is now clear that linear modules make up a ver y large component of the cellular regulator y networks as they are ligands for many signaling domains and proteins. Although linear modules can not be detected accurately from sequence alone, their functionality is strongly dependent on context, e.g. a linear module may only be functional in a restricted set of cellular compar tments. Such contextual information can be utilised in prediction. The Eukar yotic Linear Motif computational resource (ELM, http://elm.eu.org), developed for predicting functional sites, is knowledge based and stores contextual information concerning linear functional sites annotated from the scientiﬁc literature. This is later used for contextual and logical ﬁltering in the prediction of linear functional modules. Linear modules are typically found in unstructured par ts of proteins and hence two methods, DisEMBL (http://dis.embl.de) and GlobPlot (http://globplot.embl.de), for detection of protein disorder ab initio from sequence alone were developed. These methods are used to reduce the search space in the ELM resource. The methods are also used by the structural proteomics’ community for setting up expression constructs and by researchers studying intrinsically disordered or unstructured proteins. In the post-genomic era analysis of cellular protein based regulator y networks and systems is of increasing importance. Yet, we do not know the role played by the topologies, functional modules and different protein-interaction networks in cellular systems. Based on the increasing number of different linear modules catalogued by ELM, one can anticipate the existence of large protein-protein networks within the cell, dominated by interactions between globular domains and linear modules, e.g. SH3 ligands. By systematic proteomic scale analysis of such networks, deconvolution into deﬁned higher order system modules can now be commenced. In order to do this, a model for composite protein function is proposed and it is shown how higher order functional relationships between the individual functional modules may be used to infer protein function.