Automatic regrouping of strata in the goodness-of-fit chi-square test
- Vicente Núñez-Antón
- Juan Manuel Pérez-Salamero González
- Marta Regúlez-Castillo
- Manuel Ventura-Marco
- Carlos Vidal-Meliá
ISSN: 1696-2281
Any de publicació: 2019
Volum: 43
Número: 1
Pàgines: 113-142
Tipus: Article
Altres publicacions en: Sort: Statistics and Operations Research Transactions
Resum
Pearson’s chi-square test is widely employed in social and health sciences to analyse categorical data and contingency tables. For the test to be valid, the sample size must be large enough to provide a minimum number of expected elements per category. This paper develops functions for regrouping strata automatically, thus enabling the goodness-of-fit test to be performed within an iterative procedure. The usefulness and performance of these functions is illustrated by means of a simulation study and the application to different datasets. Finally, the iterative use of the functions is applied to the Continuous Sample of Working Lives, a dataset that has been used in a considerable number of studies, especially on labour economics and the Spanish public pension system.
Informació de finançament
The authors gratefully acknowledge financial support from Ministerio de Economía y Competitividad (Spain), Agencia Estatal de Investigación (AEI), and the European Regional Development Fund (ERDF), under research grants ECO2015-65826-P (AEI/ ERDF, EU) and MTM2016-74931-P (AEI/ERDF, EU) and from the Department of Education of the Basque Government (UPV/EHU MacLab Research Group and UPV/EHU Econometrics Research Group) under research grants IT 793-13 and IT-642-13, respectively. The authors wish to thank the editor and two anonymous referees for providing thoughtful comments and suggestions which have led to substantial improvement in the presentation of the material in this paper. They also would like to thank Jose M. Pavía, Miguel Angel García Pérez and Fernando Tusell for their comments and suggestions, and Christopher G. Pellow for his help with the English. Any errors are entirely due to the authors.Finançadors
-
European Commission
European Union
- MTM2016-74931-P
- Agencia Estatal de Investigación Spain
-
Euskal Herriko Unibertsitatea
Spain
- IT-642-13
- Ministerio de EconomÃa y Competitividad Spain
-
European Regional Development Fund
European Union
- ECO2015-65826-P
Referències bibliogràfiques
- Agresti, A. (2002). Categorical Data Analysis (2nd edition). Wiley, New York.
- Bartholomew, D.J. and Tzamourani, P. (1999). The goodness-of-fit of latent trait models in attitude measurement. Sociological Methods and Research, 27, 525–546.
- Bartholomew, D.J., Knott, M. and Moustaki, I. (2011). Latent Variable Models and Factor Analysis (3rd edition). Wiley, New York.
- Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge.
- Bosgiraud, J. (2006). Sur le regroupement des classes dans le test du Khi-2. Revue Romaine de MatheÌmatiques Pures et AppliqueÌes, 51, 167–172.
- Cai, L., Maydeu-Olivares, A., Coffman, D.L. and Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59, 173–194.
- Campbell, I. (2007). Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine, 26, 3661–3675.
- Cochran, W.G. (1952). The Ï2 test of goodness-of-fit. The Annals of Mathematical Statistics, 23, 315–345.
- Collins, L.M., Fidler, P.L., Wugalter, S.E. and Long, J. (1993). Goodness-of-fit testing for latent class models. Multivariate Behavioral Research, 28, 375–389.
- Delucchi, K.L. (1983). The use and misuse of chi-square: Lewis and Burke revisited. Psychological Bulletin, 94, 166–176.
- DGOSS (2014). Muestra Continua de Vidas Laborales 2013. SecretarıÌa de Estado de la Seguridad Social.
- DireccioÌn General de OrdenacioÌn (DGOSS). Ministerio de Trabajo e InmigracioÌn. Madrid, Spain.
- Fienberg, S.E. (2006). Log-linear models in contingency tables. In Encyclopedia of Statistical Sciences. Wiley, New York.
- Fisher, R.A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 39–54.
- GarcıÌa PeÌrez, M.A. and NunÌez-AntoÌn, V. (2009). Accuracy of power-divergence statistics for testing inde- pendence and homogeneity in two-way contingency tables. Communications in Statistics Simulation and Computation, 38, 503–512.
- Goodman, L.A. (1974). Exploratory latent structures analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231.
- GrafstoÌrm, A. and Schelin, L. (2014). How to select representative samples. Scandinavian Journal of Statistics, 41, 277–290.
- Haviland, M.G. (1990). Yates´ s correction for continuity and the analysis of 2× 2 contingency-tables. Statistics in Medicine, 9, 363–367.
- Hirji, K.F. (2006). Exact Analysis of Discrete Data. Chapman and Hall, Boca Raton.
- Hosmer, D.W., Hosmer, T., Le Cessie, S. and Lemeshow, S. (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine, 16, 965–980.
- Hosmer, D.W. and Lemeshow, S. (2000). Applied Logistic Regression. Wiley, New York.
- INSS (2014). Informe EstadıÌstico 2013. SecretarıÌa de Estado de Seguridad Social. Ministerio de Empleo y Seguridad Social, MESS. Madrid, Spain.
- Keeling, K.B. and Pavur, R.J. (2011). Statistical accuracy of spreadsheet software. The American Statistician, 65, 265–273.
- Khan, H.A. (2003). A visual basic software for computing Fisher´s exact probability. Journal of Statistical Software, 8, 1–7.
- Kroonenberg, P.M. and Verbeek, A. (2018). The tale of Cochran´s rule: my contingency table has so many expected values smaller than 5, what am I to do? The American Statistician, 72, 175–183.
- Kruskal, W. and Mosteller, F. (1979a). Representative sampling, I. International Statistical Review, 47, 13–24.
- Kruskal, W. and Mosteller, F. (1979b). Representative sampling, II: scientific literature, excludind statistics. International Statistical Review, 47, 111–127.
- Kruskal, W. and Mosteller, F. (1979c). Representative sampling, III: the current statistical literature. International Statistical Review, 47, 245–265.
- Kruskal, W. and Mosteller, F. (1980). Representative sampling, IV: The History of the Concept in Statistics, 1895-1939. International Statistical Review, 48, 169–195.
- Larose, D.T. and Larose, C.D. (2014). Discovering Knowledge in Data: An Introduction to Data Mining. Wiley, New York.
- Lazarsfeld, P.F. and Henry, N.W. (1968). Latent Structure Analysis. Houghton Mifflin, Boston.
- Lewis, D. and Burke, C.J. (1949). The use and misuse of chi-square. Psychological Bulletin, 46, 433–489.
- Lin, J.J., Chang, C.H. and Pal, N. (2015). A revisit to contingency table and tests of Independence: bootstrap is preferred to chi-square approximations as well as Fisher’s exact test. Journal of Biopharmaceutical Statistics, 25, 438–458.
- Lydersen, S., Fagerland, M.W. and Laake, P. (2009). Tutorial in biostatistics. Recommended tests for association in 2x2 tables. Statistics in Medicine, 28, 1159–1175.
- Marsaglia, G. (2003). Random number generators. Journal of Modern Applied Statistical Methods, 2, 2–13.
- McCullough, B.D. (2000). The accuracy of Mathematica 4 as a statistical package. Computational Statistics, 15, 279–299.
- McCullough, B.D. (2008). Special section on Microsoft Excel 2007. Computational Statistics and Data Analysis, 52, 4568–4569.
- Mehta, C.R. and Patel, N.R. (1983). A network algorithm for performing Fisher’s exact test in r×c contingency tables. Journal of the American Statistical Association, 78, 427–434.
- MESS (2017). La Muestra Continua de Vidas Laborales. GuıÌa del contenido. EstadıÌsticas, Presupuestos y Estudios. EstadıÌsticas. SecretarıÌa de Estado de Seguridad Social. Ministerio de Empleo y Seguridad Social, MESS. Madrid, Spain.
- Moore, D.S. (1986). Tests of chi-squared type. In Goodness-of-fit Techniques (R. D’Agostino and M. Stephens, eds.). Marcel Dekker, New York, 63–95.
- Okeniyi, J.O. and Okeniyi, E.T. (2012). Implementation of Kolmogorov Smirnov p-value computation in Visual Basic: implication for Microsoft Excel library function. Journal of Statistical Computation and Simulation, 82, 1727–1741.
- Omair, A. (2014). Sample size estimation and sampling techniques for selecting a representative sample. Journal of Health Specialties, 2, 142–147.
- Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157–175.
- PeÌrez-Salamero GonzaÌlez, J.M. (2015). La Muestra Continua de Vidas Laborales (MCVL) como fuente generadora de datos para el estudio del sistema de pensiones. Unpublished Ph.D. Thesis. Universitat de ValeÌncia, Spain.
- PeÌrez-Salamero GonzaÌlez, J.M., ReguÌlez-Castillo, M. and Vidal-MeliaÌ, C. (2016). AnaÌlisis de la representatividad de la MCVL: el caso de las prestaciones del sistema puÌblico de pensiones. Hacienda PuÌblica EspanÌola (Review of Public Economics), 217, 67–130
- PeÌrez-Salamero GonzaÌlez, J.M., ReguÌlez-Castillo, M. and Vidal-MeliaÌ, C. (2017). The continuous sample of working lives: improving its representativeness. SERIEs. Journal of the Spanish Economic Association, 8, 43–95.
- Quintela-del-RıÌo, A. and Francisco-FernaÌndez, M. (2017). Excel templates: a helpful tool for teaching statistics. The American Statistician, 71, 317–325.
- Ramsey, C.A. and Hewitt, A.D. (2005). A methodology for assessing sample representativeness. Environmental Forensics, 6, 71–75.
- Ripley, B.D. (2002). Statistical methods need software: a view of statistical computing. Opening lecture Royal Statistical Society, Plymouth.
- Ross, A. (2015). Probability or statistics-permorming a chi-square goodness-of-fit test. Mathematical Stack Exchange.
- Tollenaar, N. and Mooijaart, A. (2003). Type I errors and power of the parametric bootstrap goodness-of-fit test: Full and limited information. British Journal of Mathematical and Statistical Psychology, 56, 271–288.
- Tsang, W.W. and Cheng, K.H. (2006). The chi-square test when the expected frequencies are less than 5. In COMPSTAT 2006 Proceedings in Computational Statistics (A. Rizzi and M. Vichi, eds.). Physica Verlag Springer, Heidelberg, 1583–1589.
- Wickens, T.D. (1989). Multiway Contingency Tables Analysis for the Social Sciences. Hillsdale, NJ: Erlbaum.
- Wilkinson, L. (1994). Practical guidelines for testing statistical software. In Computational Statistics: Papers Collected on the Occasion of the 25th Conference on Statistical Computing at Schloss Reisensburg (P. Dirschedl and R. Ostermann, eds.). Physica Verlag Springer, Heidelberg, 1–16.
- Yates, F. (1934). Contingency tables involving small numbers and the Ï2 test. Supplement to the Journal of the Royal Statistical Society, 1, 217–235.