Functional characterization of genome-wide sequence data: development of methods and tools for high-throughput analysis

Götz, Stefan

Functional characterization of genome-wide sequence datadevelopment of methods and tools for high-throughput analysis

Götz, Stefan

Dirigida por:

Ana Conesa Cegarra Director/a

Universidad de defensa: Universitat Politècnica de València

Fecha de defensa: 14 de abril de 2010

Tribunal:

José Enrique Pérez Ortín Presidente
Javier Terol Alcayde Secretario/a
Thomas Rattei Vocal
Miguel Andrade Vocal
Xavier de la Cruz Montserrat Vocal

Tipo: Tesis

Teseo: 291287 DIALNET

Resumen

The combination of scientific and technological advances has represented an impulse for the relatively new discipline of functional genomics. Technologies such as microarrays, protein interaction network assays, genome meta-analysis and single nucleotide polymorphism observations allow new insights into genome organization and have been extensively employed in the biological research of both model and non-model species. A major requirement for the successful application of these genomic approaches is the functional characterization of involved biological (DNA or protein) sequences. Functional information on gene products is the key to interpreting these genome-wide experiments. Genome annotation is necessary in order to link biological sequences directly or indirectly to a determined biological role. The accurate assignment of functional information to proteins is a complicated, laborious and time-consuming task, and the speed of sequence data generation greatly exceeds the possibilities of manual function assignment. However, genomic high-throughput technologies require extensive functional annotation data if the generated results are to be interpreted satisfactorily. The functional characterization of genome-wide sequence data was an issue of great relevance even before the first complete genome sequences were revealed. However, the methods and tools for high-throughput analysis of such datasets were not developed until the first common functional vocabularies with which to unify such descriptions were established. The Gene Ontology, by today the de-facto standard, enjoys great popularity due to its species independent and fine grained vocabulary, which is accompanied by very detailed descriptions. Its application in the area of automatic function prediction based on function transfer has provided many advantages. This thesis addresses the problem of functional annotation and characterization of high-throughput, genome-wide sequence data. The research performed was based principally on the concepts of protein function transference, gene ontology vocabulary and general data-mining and visualisation techniques. The aim was to create a bioinformatics resource of wide acceptance, usability and versatility for the functional genomics research community. The platform developed, Blast2GO, is a suite of tools and methods for assigning functional labels to genome sequences based on the GO vocabulary. Other functional databases, such as InterPro, enzyme codes and KEGG pathways, were also integrated. The most outstanding features of this set of tools are the combination of various annotation strategies and methods that permit the type and intensity of assigned annotations to be controlled. The annotation behaviour of the methodology based on sequence similarity transfer was extensively evaluated. The impact of different annotation strategies on functional genome analyses was demonstrated. As a result, valuable insight was gained and can now be used by biologists when addressing the difficult task of functionally characterizing novel sequence data. Numerous graphical features, such as interactive GO-graph visualization for gene-set functional profiling, descriptive charts, general sequence management options and high-throughput capabilities were developed to make the analysis process easier. Special focus has been given to visualization of GO-related data throughout this thesis. Data visualization is a useful component of result interpretation and is indispensable when working with large datasets. In relation to the characterization of functional profiles, this work presents a novel strategy with which to measure the distances between functional profiles. Functional similarities of annotation sets based on the gene ontology hierarchy were measured for an extensive dataset of drug profiles derived from genome-wide expression data. The dominant biological functions of drugs were compared to identify and detect similar functional characteristics among them. In a second phase, the developed annotation strategies were applied to the entire protein sequence space in an attempt to reduce the amount of non-annotated sequences, especially in the case of non-model species. A centralized repository of automatically annotated sequence data was generated by pipelines of intensive computing which provides pre-computed functional annotations on a large scale. The resource covers all biological kingdoms and is structured in such a way that downloading and processing of data of a species of interest is easily carried out. To conclude, the fruit of this thesis is the development of software and methodologies that have been well accepted within the functional genomics community. With more than 200 citations and having been applied in all biological taxa, the Blast2GO framework is today a resource of reference for automated functional annotation that will contribute to genome research in the new era of massive characterization of genomes with deep sequencing technologies.