Generative manifold learning for the exploration of partially labeled data
- Cruz Barbosa, Raúl
- Alfredo Vellido Alacena Director/a
Universidad de defensa: Universitat Politècnica de Catalunya (UPC)
Fecha de defensa: 01 de octubre de 2009
- Manuel Graña Romay Presidente/a
- Lluis Antoni Belanche Muñoz Secretario/a
- Maite López Sánchez Vocal
- José Cristobal Riquelme Santos Vocal
- Emilio Soria Olivas Vocal
Tipo: Tesis
Resumen
In many real-world application problems, the availability of data labels for supervised learning is rather limited, Incompletely labeled datasets are common in many of the databases generated in some of the currently most active areas of research. It is often the case that a limited number of labeled cases is accompanied by a larger number of unlabeled ones. This is the setting for semi-supervised learning, in which unsupervised approaches assist the supervised problem and vice versa. A manifold learning model, namely Generative Topographic Mapping (GTM), is the basis of the methods developed in this thesis. The non-linearity of the mapping that GTM generates makes it prone to trustworthiness and continuity errors that would reduce the faithfulness of the data representation, especially for datasets of convoluted geometry. In this thesis, a variant of GTM that uses a graph approximation to the geodesic metric is first defined. This model is capable of representing data of convoluted geometries. The standard GTM is here modified to prioritize neighbourhood relationships along the generated manifold. This is accomplished by penalizing the possible divergences between the Euclidean distances from the data points to the model prototypes and the corresponding geodesic distances along the manifold. The resulting Geodesic GTM (Geo-GTM) model is shown to improve the continuity and trustworthiness of the representation generated by the model, as well as to behave robustly in the presence of noise. The thesis then leads towards the definition and development of semi-supervised versions of GTM for partially-labeled data exploration. As a first step in this direction, a two-stage clustering procedure that uses class information is presented. A class information-enriched variant of GTM, namely class-GTM, yields a first cluster description of the data. The number of clusters defined by GTM is usually large for visualization purposes and does not necessarily correspond to the overall class structure. Consequently, in a second stage, clusters are agglomerated using the K-means algorithm with different novel initialization strategies that benefit from the probabilistic definition of GTM. We evaluate if the use of class information influences cluster-wise class separability. A robust variant of GTM that detects outliers while effectively minimizing their negative impact in the clustering process is also assessed in this context. We then proceed to the definition of a novel semi-supervised model, SS-Geo-GTM, that extends Geo-GTM to deal with semi-supervised problems. In SS-Geo-GTM, the model prototypes are linked by the nearest neighbour to the data manifold constructed by Geo-GTM. The resulting proximity graph is used as the basis for a class label propagation algorithm. The performance of SS-Geo-GTM is experimentally assessed, comparing positively with that of an Euclidean distance-based counterpart and that of the alternative Laplacian Eigenmaps method. Finally, the developed models (the two-stage clustering procedure and the semi-supervised models) are applied to the analysis of a human brain tumour dataset (obtained by Nuclear Magnetic Resonance Spectroscopy), where the tasks are, in turn, data clustering and survival prognostic modeling.