Smooth generalized linear models for aggregated data
- Ayma Anza, Diego Armando
- María Luz Durbán Reguera Director/a
- Dae-Jin Lee Director/a
Universidad de defensa: Universidad Carlos III de Madrid
Fecha de defensa: 15 de diciembre de 2016
- Miguel Angel Martínez Beneito Presidente
- Irene Albarrán Lozano Secretario/a
- Jutta Gampe Vocal
Tipo: Tesis
Resumen
Aggregated data commonly appear in areas such as epidemiology, demography, and public health. Generally, the aggregation process is done to protect the privacy of patients, to facilitate compact presentation, or to make it comparable with other coarser datasets. However, this process may hinder the visualization of the underlying distribution that follows the data. Also, it prohibit the direct analysis of relationships between aggregated data and potential risk factors, which are commonly measured at a finer resolution. Therefore, it is of interest to develop statistical methodologies that deal with the disaggregation of coarse health data at a finer scale. For example, in the spatial setting, it could be desirable to obtain estimates, from coarse areal data, at a fine spatial grid or units less coarser than the original ones. These two cases are known as the area-to-point (ATP) and area-to-area (ATA) cases, respectively, which are illustrated in the first chapter of this thesis. Moreover, we can have spatial data recorded at coarse units over time. In some cases, the temporal dimension can also be in an aggregated form, hindering the visualization of the evolution of the underlying process over time. In this thesis we propose the use of a novel non-parametric method that we called composite link mixed model or, more succinctly, CLMM. In our proposed model, we look at the observed data as indirect observations of an underlying process (defined at a finer resolution than observed data), which we want to estimate. The mixed model formulation of our proposal allow us to include fine-scale population information and complex structures as random effects as parts of the modelling of the underlying trend. Since the CLMM is based on the approach given by Eilers (2007), called penalized composite link model (PCLM), we briefly review the PCLM approach in the first section of the second chapter of this thesis. Then, in the second section of this chapter, we introduce the CLMM approach under an univariate setting, which can be seen as a reformulation of the PCLM into a mixed model framework. This is achieved by following the mixed model reformulation of P-splines proposed in Currie and Durbán (2002) and Currie et al. (2006), which is also reviewed here. Then, the parameter estimation of the CLMM can be done under the framework of mixed model theory. This offers another alternative for the estimation of the PCLM, avoiding the use of information criteria for smoothing parameter selection. In the third section of the second chapter, we extend the CLMM approach to the multidimensional (array) case, where Kronecker products are involved in the extended model formulation. Illustrations for the univariate and the multidimensional array settings are presented throughout the second chapter, using mortality and fertility datasets, respectively. In the third chapter, we present a new methodology for the analysis of spatially aggregated data, by extending the CLMM approach developed in the second chapter to the spatial case. The spatial CLMM provides smoothed solutions for the ATP and ATA cases described in the first chapter, i.e., it gives a smoothed estimation for the underlying spatial trend, from aggregated data, at a finer resolution. The ATP and ATA cases are illustrated using several mortality (or morbidity) datasets, and simulation studies of the prediction performance between our approach and the area-to-point Poisson kriging of Goovaerts (2006) are realized. Also, in the third chapter we provide a methodology to deal with the overdispersion problem, which is based on the PRIDE (`penalized regression with individual deviance effects') approach of Perperoglou and Eilers (2010). In the fourth chapter, we generalize the methodology developed in the third chapter for the analysis of spatio-temporally aggregated data. Under this framework, we adapt the SAP (`separation of anisotropic penalties') algorithm of RodrÍguez-Álvarez et al. (2015) and the GLAM (`generalized linear array model') algorithms given in Currie et al. (2006) and Eilers et al. (2006), to the CLMM context. The use of these efficient algorithms allow us to avoid possible storage problems and to speed up the computational time of the model estimation. We illustrate the methodology presented in this chapter by using a Q fever incidence dataset recorded in the Netherlands at municipality level and by months. Our aim, then, is to estimate smoothed incidences at a fine spatial grid over the study area throughout the 53 weeks of 2009. A simulation study is provided at the end of chapter four, in order to evaluate the prediction performance of our approach under three different coarse situations, using a detailed (and confidential) Q fever incidence dataset. Finally, the fifth chapter summarizes the main contributions made in this thesis and further work.