Improving the representativeness of a simple random samplean optimization model and its application to the Continuous Sample of Working Lives

  1. Vicente Nuñez-Antón 1
  2. Juan Manuel Pérez-Salamero González 2
  3. Marta Regúlez-Castillo 1
  4. Carlos Vidal-Meliá 2
  1. 1 Universidad del País Vasco/Euskal Herriko Unibertsitatea
    info

    Universidad del País Vasco/Euskal Herriko Unibertsitatea

    Lejona, España

    ROR https://ror.org/000xsnr85

  2. 2 Universitat de València
    info

    Universitat de València

    Valencia, España

    ROR https://ror.org/043nxc105

Revista:
Documentos de Trabajo (ICAE)

ISSN: 2341-2356

Any de publicació: 2019

Número: 20

Pàgines: 1-30

Tipus: Document de treball

Altres publicacions en: Documentos de Trabajo (ICAE)

Resum

This paper develops an optimization model for selecting a large subsample that improves the representativeness of a simple random sample previously obtained from a population larger than the population of interest. The problem formulation involves convex mixed-integer nonlinear programming (convex MINLP) and is therefore NP-hard. However, the solution is found by maximizing the “constant of proportionality” – in other words, maximizing the size of the subsample taken from a stratified random sample with proportional allocation – and restricting it to a p-value high enough to achieve a good fit to the population of interest using Pearson’s chi-square goodness-of-fit test. The beauty of the model is that it gives the user the freedom to choose between a larger subsample with a poorer fit and a smaller subsample with a better fit. The paper also applies the model to a real case: The Continuous Sample of Working Lives (CSWL), which is a set of anonymized microdata containing information on individuals from Spanish Social Security records. Several waves (2005-2017) are first examined without using the model and the conclusion is that they are not representative of the target population, which in this case is people receiving a pension income. The model is then applied and the results prove that it is possible to obtain a large dataset from the CSWL that (far) better represents the pensioner population for each of the waves analysed.

Informació de finançament

We gratefully acknowledge the financial support from the Ministerio de Economía y Competitividad (Spain) and the Basque Government for projects ECO2015-65826-P and IT 793-13 respectively, and Ministerio de Economía y Competitividad, Agencia Estatal de Investigación (AEI), Fondo Europeo de Desarrollo Regional (FEDER), the Department of Education of the Basque Government (UPV/EHU Econometrics Research Group), and Universidad del País Vasco UPV/EHU under research grants MTM2016-74931-P (AEI/FEDER, UE), IT-642-13 and UFI11/03.

Finançadors

Referències bibliogràfiques

  • Baillargeon, S., & Rivest, L. P. (2009). A general algorithm for univariate stratification. International Statistical Review, 77(3), 331-344.
  • Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the Chi-Square Test. Journal of the American Statistical Association, 33(203), 526-536.
  • Bonami, P., Kilinç, M., & Linderoth. J. (2012). Algorithms and software for convex mixed integer nonlinear programs. In J. Lee & S. Leyferr (Eds.), Mixed Integer Nonlinear Programming. The IMA Volumes in Mathematics and its Applications, vol 154 (pp. 1-39). New York: Springer.
  • Bowley, A. L. (1926). Measurement of precision attained in sampling. Bulletin of the International Statistical Institute 22(1), 6-62.
  • Cochran, W. G. (1977). Sampling Techniques. New York: John Wiley.
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale. NJ: Erlbaum.
  • D’Ambrosio, C., & Lodi, A. (2013). Mixed integer nonlinear programming tools: an updated practical overview. Annals of Operations Research, 204(1), 301-320.
  • De Moura Brito, J. A., Do Nascimento Silva, P. L., Silva Semaan, G., & Maculan, N. (2015). Integer programming formulations applied to optimal allocation in stratified sampling. Survey Methodology, 41(2), 427-442.
  • DGOSS (2006-2018). Muestra Continua de Vidas Laborales 2005-2017. Madrid: Dirección General de Ordenación de la Seguridad Social. Ministerio de Trabajo, Migraciones y Seguridad Social.
  • Díaz-García, J. A., & Ramos-Quiroga, R. (2012). Optimum allocation in multivariable stratified random sampling: stochastic matrix mathematical programming. Statistica Neerlandica, 66(4), 492-511.
  • Díaz-García, J. A., & Ramos-Quiroga, R. (2014). Optimum allocation in multivariable stratified random sampling: a modified Prékopa’s approach. Journal of Mathematical Modelling and Algorithms, 13, 315-330.
  • Grafström, A., & Schelin, L. (2014). How to select representative samples. Scandinavian Journal of Statistics, 41, 277 –290.
  • Gupta, N., Ali, I., & Bari, A. (2014). An optimal chance constraint multivariate stratified sampling design using auxiliary information. Journal of Mathematical Modelling and Algorithms in Operations Research, 13(3), 341-352.
  • Gupta, N., Sana Ifthekar, S., & Bari, A. (2012). Fuzzy goal programming approach to solve non-linear bi-level programming problem in stratified double sampling design in presence of non-response. International Journal of Scientific & Engineering Research, 3(10), 1-9.
  • INSS (2006-14). Informes Estadísticos 2005-2013. Madrid: Instituto Nacional de la Seguridad Social. Secretaría de Estado de la Seguridad Social. Ministerio de Trabajo, Migraciones y Seguridad Social.
  • INSS (2006-18). Informes Estadísticos 2005-2017. Madrid: Instituto Nacional de la Seguridad Social. Secretaría de Estado de la Seguridad Social. Ministerio de Trabajo, Migraciones y Seguridad Social.
  • Kontopantelis, E. (2013). A greedy algorithm for representative sampling: repsample in Stata. Journal of Statistical Software, 56, 1-18.
  • Kruskall, W., & Mosteller, F. (1979a). Representative sampling, I. International Statistical Review, 47(1), 13-24.
  • Kruskall, W., & Mosteller, F. (1979b). Representative sampling, II: scientific literature. excluding statistics. International Statistical Review, 47(2), 111-127.
  • Kruskall, W., & Mosteller, F. (1979c). Representative sampling, III: The Current Statistical Literature. International Statistical Review, 47(3), 245-265.
  • Kruskall, W., & Mosteller, F. (1980). Representative sampling, IV: the history of the Concept in Statistics. 1895-1939. International Statistical Review, 48(2), 169- 195.
  • Lin, M., Lucas, H. C., & Shmieli, G. (2013). Research commentary: too big to fail. Large samples and the p-value Problem. Information Systems Research, 24(4), 906-917.
  • MESS (2018). MCVL. Muestra Continua de Vidas Laborales. Guía del contenido. Estadísticas. Presupuestos y Estudios. Estadísticas. Muestra Continua de Vidas Laborales. Documentación MCVL. http://www.seg-social.es/wps/wcm/connect/wss/320b09c6-dc33-42be-b532- 08880e618742/MCVLGuia20180725.pdf?MOD=AJPERES&CVID= (accessed 11 Sep 2018).
  • Neyman, J. (1934). On the two different aspects of the representative method: The method of representative sampling and the method of purposive sampling. Journal of the Royal Statistical Society, 97(4), 558-625.
  • Nuñez-Antón, V., Pérez-Salamero González, J. M., Regúlez-Castillo, M., VenturaMarco, M., & Vidal-Meliá, C. (2019). Automatic regrouping of strata in the goodness-of-fit chi-square test. SORT, 43(1).In Press.
  • Olsen. A.;Hudson, R. (2009). Social Security Administration’s Master Earnings File: background information. Social Security Bulletin, 69(3), 29-45.
  • Omair. A. (2014). Sample size estimation and sampling techniques for selecting a representative sample. Journal of Health Specialties, 2(4), 142-147.
  • Pérez-Salamero González, J. M., Regúlez-Castillo, M., & Vidal-Meliá, C. (2016). Análisis de la representatividad de la MCVL: el caso de las prestaciones del sistema público de pensiones. Hacienda Pública Española, 217(2), 67–130.
  • Pérez-Salamero González, J. M., Regúlez-Castillo, M., & Vidal-Meliá. C. (2017). The continuous sample of working lives: improving its representativeness. SERIEs, 8(1), 43-95.
  • Ramsey, C. A., & Hewitt, A. D. (2005). A Methodology for assessing sample representativeness. Environmental Forensics, 6, 71–75.
  • Särndal, C. E., Swensson, B., & Wretman, J. (1992). Model Assisted Survey Sampling. Springer Series in Statistics. New York: Springer Verlag.
  • Smith, C. (1989). The Social Security Administration's Continuous Work History Sample. Social Security Bulletin, 52(10), 20–28.
  • Valliant, R., & Gentle, J. E. (1997). An application of mathematical programming to sample allocation. Computational Statistics & Data Analysis, 25(3), 337-360.
  • Valliant, R., Dever, J., & Kreuter, F. (2013). Practical Tools for Designing and Weighting Survey Samples. Statistics for Social and Behavioral Sciences, 51. New York: Springer.
  • Wang, C. (1993). Sense and Nonsense of Statistical Inference: Controversy, Misuse and Subtlety. New York: Marcel Dekker.
  • Zweimüller, J., Winter-Ebmer, R., Lalive, R., Kuhn, A., Wuellrich, J.P., Ruf, O., & Büchi, S. (2009). Austrian Social Security Database. IEW - Working Papers, 410. Institute for Empirical Research in Economics - University of Zurich.