Automatic regrouping of strata in the goodness-of-fit chi-square test

  1. Vicente Núñez-Antón
  2. Juan Manuel Pérez-Salamero González
  3. Marta Regúlez-Castillo
  4. Manuel Ventura-Marco
  5. Carlos Vidal-Meliá
Revista:
Sort: Statistics and Operations Research Transactions

ISSN: 1696-2281

Año de publicación: 2019

Volumen: 43

Número: 1

Páginas: 113-142

Tipo: Artículo

DOI: 10.2436/20.8080.02.83 DIALNET GOOGLE SCHOLAR lock_openAcceso abierto editor

Otras publicaciones en: Sort: Statistics and Operations Research Transactions

Resumen

Pearson’s chi-square test is widely employed in social and health sciences to analyse categorical data and contingency tables. For the test to be valid, the sample size must be large enough to provide a minimum number of expected elements per category. This paper develops functions for regrouping strata automatically, thus enabling the goodness-of-fit test to be performed within an iterative procedure. The usefulness and performance of these functions is illustrated by means of a simulation study and the application to different datasets. Finally, the iterative use of the functions is applied to the Continuous Sample of Working Lives, a dataset that has been used in a considerable number of studies, especially on labour economics and the Spanish public pension system.

Información de financiación

The authors gratefully acknowledge financial support from Ministerio de Economía y Competitividad (Spain), Agencia Estatal de Investigación (AEI), and the European Regional Development Fund (ERDF), under research grants ECO2015-65826-P (AEI/ ERDF, EU) and MTM2016-74931-P (AEI/ERDF, EU) and from the Department of Education of the Basque Government (UPV/EHU MacLab Research Group and UPV/EHU Econometrics Research Group) under research grants IT 793-13 and IT-642-13, respectively. The authors wish to thank the editor and two anonymous referees for providing thoughtful comments and suggestions which have led to substantial improvement in the presentation of the material in this paper. They also would like to thank Jose M. Pavía, Miguel Angel García Pérez and Fernando Tusell for their comments and suggestions, and Christopher G. Pellow for his help with the English. Any errors are entirely due to the authors.

Referencias bibliográficas

  • Agresti, A. (2002). Categorical Data Analysis (2nd edition). Wiley, New York.
  • Bartholomew, D.J. and Tzamourani, P. (1999). The goodness-of-fit of latent trait models in attitude measurement. Sociological Methods and Research, 27, 525–546.
  • Bartholomew, D.J., Knott, M. and Moustaki, I. (2011). Latent Variable Models and Factor Analysis (3rd edition). Wiley, New York.
  • Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge.
  • Bosgiraud, J. (2006). Sur le regroupement des classes dans le test du Khi-2. Revue Romaine de Mathématiques Pures et Appliquées, 51, 167–172.
  • Cai, L., Maydeu-Olivares, A., Coffman, D.L. and Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59, 173–194.
  • Campbell, I. (2007). Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine, 26, 3661–3675.
  • Cochran, W.G. (1952). The χ2 test of goodness-of-fit. The Annals of Mathematical Statistics, 23, 315–345.
  • Collins, L.M., Fidler, P.L., Wugalter, S.E. and Long, J. (1993). Goodness-of-fit testing for latent class models. Multivariate Behavioral Research, 28, 375–389.
  • Delucchi, K.L. (1983). The use and misuse of chi-square: Lewis and Burke revisited. Psychological Bulletin, 94, 166–176.
  • DGOSS (2014). Muestra Continua de Vidas Laborales 2013. Secretarı́a de Estado de la Seguridad Social.
  • Dirección General de Ordenación (DGOSS). Ministerio de Trabajo e Inmigración. Madrid, Spain.
  • Fienberg, S.E. (2006). Log-linear models in contingency tables. In Encyclopedia of Statistical Sciences. Wiley, New York.
  • Fisher, R.A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 39–54.
  • Garcı́a Pérez, M.A. and Nuñez-Antón, V. (2009). Accuracy of power-divergence statistics for testing inde- pendence and homogeneity in two-way contingency tables. Communications in Statistics Simulation and Computation, 38, 503–512.
  • Goodman, L.A. (1974). Exploratory latent structures analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231.
  • Grafstörm, A. and Schelin, L. (2014). How to select representative samples. Scandinavian Journal of Statistics, 41, 277–290.
  • Haviland, M.G. (1990). Yates´ s correction for continuity and the analysis of 2× 2 contingency-tables. Statistics in Medicine, 9, 363–367.
  • Hirji, K.F. (2006). Exact Analysis of Discrete Data. Chapman and Hall, Boca Raton.
  • Hosmer, D.W., Hosmer, T., Le Cessie, S. and Lemeshow, S. (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine, 16, 965–980.
  • Hosmer, D.W. and Lemeshow, S. (2000). Applied Logistic Regression. Wiley, New York.
  • INSS (2014). Informe Estadı́stico 2013. Secretarı́a de Estado de Seguridad Social. Ministerio de Empleo y Seguridad Social, MESS. Madrid, Spain.
  • Keeling, K.B. and Pavur, R.J. (2011). Statistical accuracy of spreadsheet software. The American Statistician, 65, 265–273.
  • Khan, H.A. (2003). A visual basic software for computing Fisher´s exact probability. Journal of Statistical Software, 8, 1–7.
  • Kroonenberg, P.M. and Verbeek, A. (2018). The tale of Cochran´s rule: my contingency table has so many expected values smaller than 5, what am I to do? The American Statistician, 72, 175–183.
  • Kruskal, W. and Mosteller, F. (1979a). Representative sampling, I. International Statistical Review, 47, 13–24.
  • Kruskal, W. and Mosteller, F. (1979b). Representative sampling, II: scientific literature, excludind statistics. International Statistical Review, 47, 111–127.
  • Kruskal, W. and Mosteller, F. (1979c). Representative sampling, III: the current statistical literature. International Statistical Review, 47, 245–265.
  • Kruskal, W. and Mosteller, F. (1980). Representative sampling, IV: The History of the Concept in Statistics, 1895-1939. International Statistical Review, 48, 169–195.
  • Larose, D.T. and Larose, C.D. (2014). Discovering Knowledge in Data: An Introduction to Data Mining. Wiley, New York.
  • Lazarsfeld, P.F. and Henry, N.W. (1968). Latent Structure Analysis. Houghton Mifflin, Boston.
  • Lewis, D. and Burke, C.J. (1949). The use and misuse of chi-square. Psychological Bulletin, 46, 433–489.
  • Lin, J.J., Chang, C.H. and Pal, N. (2015). A revisit to contingency table and tests of Independence: bootstrap is preferred to chi-square approximations as well as Fisher’s exact test. Journal of Biopharmaceutical Statistics, 25, 438–458.
  • Lydersen, S., Fagerland, M.W. and Laake, P. (2009). Tutorial in biostatistics. Recommended tests for association in 2x2 tables. Statistics in Medicine, 28, 1159–1175.
  • Marsaglia, G. (2003). Random number generators. Journal of Modern Applied Statistical Methods, 2, 2–13.
  • McCullough, B.D. (2000). The accuracy of Mathematica 4 as a statistical package. Computational Statistics, 15, 279–299.
  • McCullough, B.D. (2008). Special section on Microsoft Excel 2007. Computational Statistics and Data Analysis, 52, 4568–4569.
  • Mehta, C.R. and Patel, N.R. (1983). A network algorithm for performing Fisher’s exact test in r×c contingency tables. Journal of the American Statistical Association, 78, 427–434.
  • MESS (2017). La Muestra Continua de Vidas Laborales. Guı́a del contenido. Estadı́sticas, Presupuestos y Estudios. Estadı́sticas. Secretarı́a de Estado de Seguridad Social. Ministerio de Empleo y Seguridad Social, MESS. Madrid, Spain.
  • Moore, D.S. (1986). Tests of chi-squared type. In Goodness-of-fit Techniques (R. D’Agostino and M. Stephens, eds.). Marcel Dekker, New York, 63–95.
  • Okeniyi, J.O. and Okeniyi, E.T. (2012). Implementation of Kolmogorov Smirnov p-value computation in Visual Basic: implication for Microsoft Excel library function. Journal of Statistical Computation and Simulation, 82, 1727–1741.
  • Omair, A. (2014). Sample size estimation and sampling techniques for selecting a representative sample. Journal of Health Specialties, 2, 142–147.
  • Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157–175.
  • Pérez-Salamero González, J.M. (2015). La Muestra Continua de Vidas Laborales (MCVL) como fuente generadora de datos para el estudio del sistema de pensiones. Unpublished Ph.D. Thesis. Universitat de Valéncia, Spain.
  • Pérez-Salamero González, J.M., Regúlez-Castillo, M. and Vidal-Meliá, C. (2016). Análisis de la representatividad de la MCVL: el caso de las prestaciones del sistema público de pensiones. Hacienda Pública Española (Review of Public Economics), 217, 67–130
  • Pérez-Salamero González, J.M., Regúlez-Castillo, M. and Vidal-Meliá, C. (2017). The continuous sample of working lives: improving its representativeness. SERIEs. Journal of the Spanish Economic Association, 8, 43–95.
  • Quintela-del-Rı́o, A. and Francisco-Fernández, M. (2017). Excel templates: a helpful tool for teaching statistics. The American Statistician, 71, 317–325.
  • Ramsey, C.A. and Hewitt, A.D. (2005). A methodology for assessing sample representativeness. Environmental Forensics, 6, 71–75.
  • Ripley, B.D. (2002). Statistical methods need software: a view of statistical computing. Opening lecture Royal Statistical Society, Plymouth.
  • Ross, A. (2015). Probability or statistics-permorming a chi-square goodness-of-fit test. Mathematical Stack Exchange.
  • Tollenaar, N. and Mooijaart, A. (2003). Type I errors and power of the parametric bootstrap goodness-of-fit test: Full and limited information. British Journal of Mathematical and Statistical Psychology, 56, 271–288.
  • Tsang, W.W. and Cheng, K.H. (2006). The chi-square test when the expected frequencies are less than 5. In COMPSTAT 2006 Proceedings in Computational Statistics (A. Rizzi and M. Vichi, eds.). Physica Verlag Springer, Heidelberg, 1583–1589.
  • Wickens, T.D. (1989). Multiway Contingency Tables Analysis for the Social Sciences. Hillsdale, NJ: Erlbaum.
  • Wilkinson, L. (1994). Practical guidelines for testing statistical software. In Computational Statistics: Papers Collected on the Occasion of the 25th Conference on Statistical Computing at Schloss Reisensburg (P. Dirschedl and R. Ostermann, eds.). Physica Verlag Springer, Heidelberg, 1–16.
  • Yates, F. (1934). Contingency tables involving small numbers and the χ2 test. Supplement to the Journal of the Royal Statistical Society, 1, 217–235.