Building a comparable corpus and a benchmark for Spanish medical text simplification

  1. Campillos Llanos, Leonardo
  2. Terroba Reinares, Ana R.
  3. Zakhir Puig, Sofía
  4. Valverde, Ana
  5. Capllonch-Carrión, Adrián
Zeitschrift:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Datum der Publikation: 2022

Nummer: 69

Seiten: 189-196

Art: Artikel

Andere Publikationen in: Procesamiento del lenguaje natural

Zusammenfassung

We report the collection of the CLARA-MeD comparable corpus, which is made up of 24 298 pairs of professional and simplified texts in the medical domain for the Spanish language (>96M tokens). Texts types range from drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words), abstracts of systematic reviews (8138 pairs of texts, >9M words), cancer-related information summaries (201 pairs of texts, >3M tokens) and clinical trials announcements (5748 pairs of texts, 451 690 words). We also report the alignment of professional and simplified sentences, conducted manually by pairs of annotators. A subset of 3800 sentence pairs (149 862 tokens) has been aligned each by 2 experts, with an average inter-annotator agreement kappa score of 0.839 (0.076). The data are available in the community and contributes with a new benchmark to develop and evaluate automatic medical text simplification systems.

Bibliographische Referenzen

  • Barbu, E., Martfifin-Valdivia, M. T., MartinezCamara, E., and Urena-Lfiopez, L. A. (2015). Language technologies applied to document simplification for helping autistic people. Expert Systems with Applications, 42(12):5076{5086. Campillos-Llanos, L., Valverde-Mateos, A.,
  • Capllonch-Carrifion, A., and MorenoSandoval, A. (2021). A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine. BMC medical informatics and decision making, 21(1):1{19.
  • Cardon, R. and Grabar, N. (2020). Construction d'un corpus parallfiele fia partir de corpus comparables pour la simplification de textes mfiedicaux en franficais. Traitement Automatique des Langues, 61(2):15{39.
  • Caseli, H. M., Pereira, T. F., Specia, L., Pardo, T. A., Gasperin, C., and Alufifisio, S. M. (2009). Building a Brazilian Portuguese parallel corpus of original and simplified texts. Proc. of 10th CICLing, 41:59{70.
  • Devaraj, A., Marshall, I., Wallace, B., and Li, J. J. (2021). Paragraph-level simplification of medical texts. In Proc. of the NAACL 2021, pages 4972{4984.
  • Gala, N., Tack, A., Javourey-Drevet, L., Franficois, T., and Ziegler, J. C. (2020). Alector: A parallel corpus of simplified French texts with alignments of misreading by poor and dyslexic readers. In Proc. of LREC 2020, page 1353{1361.
  • Grabar, N. and Cardon, R. (2018). CLEAR - Simple corpus for medical French. In Proc. of the 1st Workshop on Automatic Text Adaptation (ATA), pages 3{9.
  • Kindig, D. A., Panzer, A. M., NielsenBohlman, L., et al. (2004). Health literacy: a prescription to end confusion. Washington (DC): National Academies Press.
  • Klaper, D., Ebling, S., and Volk, M. (2013). Building a German/simple German parallel corpus for automatic text simplification. In Proc. of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR 2013), Sofia, Bulgaria.
  • Martin, L., Fan, A., de la Clergerie, fiE., Bordes, A., and Sagot, B. (2021). Muss: Multilingual unsupervised sentence simplification by mining paraphrases. arXiv preprint arXiv:2005.00352.
  • Moramarco, F., Juric, D., Savkov, A., Flann, J., Lehl, M., Boda, K., Grafen, T., Zhelezniak, V., Gohil, S., Korfiatis, A. P., et al. (2021). Towards more patient friendly clinical notes through language models and ontologies. In Proc. of the AMIA Annual Symposium, pages 881{890.
  • Moreno-Sandoval, A., Torre-Toledano, D., Valverde-Mateos, A., and Campillos Llanos, L. (2019). Estudio sobre documentos reutilizables como recursos lingüísticos en el marco del desarrollo del plan de impulso de las tecnologías del lenguaje. Procesamiento del Lenguaje Natural, 63:167{170.
  • Paetzold, G., Alva-Manchego, F., and Specia, L. (2017). Massalign: Alignment and annotation of comparable documents. In Proceedings of the IJCNLP 2017, System Demonstrations, pages 1{4.
  • Palmero Aprosio, A., Tonelli, S., Turchi, M., Negri, M., and Di Gangi Mattia, A. (2019). Neural text simplification in low resource conditions using weak supervision. In Workshop on Methods for Optimizing and Evaluating Neural Language Generation (NeuralGen), pages 37{44.
  • Petersen, S. E. and Ostendorf, M. (2007). Text simplification for language learners: a corpus analysis. In Workshop on speech and language technology in education. Citeseer.
  • Rauf, S. A., Ligozat, A.-L., Yvon, F., Illouz, G., and Hamon, T. (2020). Simplification automatique de texte dans un contexte de faibles ressources. In Actes 6e conffierence Traitement Automatique des Langues Naturelles (TALN), vol. 2, pages 332{341.
  • Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982{3992.
  • Sackett, D. L., Rosenberg, W. M., Gray, J. M., Haynes, R. B., and Richardson, W. S. (1996). Evidence based medicine: what it is and what it isn't. British Medical Journal, 312(7023):71{72.
  • Saggion, H., Gfiomez-Martfifinez, E., Etayo, E., Anula, A., and Bourg, L. (2011). Text simplification in simplext: Making texts more accessible. Procesamiento del lenguaje natural, (47):341{342.
  • Sakakini, T., Lee, J. Y., Duri, A., Azevedo, R. F., Sadauskas, V., Gu, K., Bhat, S., Morrow, D., Graumlich, J., Walayat, S., et al. (2020). Context-aware automatic text simplification of health materials in low-resource domains. In Proc. of the 11th LOUHI Workshop, pages 115{126.
  • Scarton, C., Paetzold, G., and Specia, L. (2018). Simpa: A sentence-level simplification corpus for the public administration domain. In Proc. of LREC 2018, pages 4333{4338.
  • Stajner, S., Franco-Salvador, M., Rosso, P., and Ponzetto, S. P. (2018). CATS: A tool for customized alignment of text simplification corpora. In Proc. of LREC 2018, pages 3895{3903.
  • Tonelli, S., Aprosio, A. P., and Saltori, F. (2016). SIMPITIKI: a Simplification corpus for Italian. In CLiC-it/EVALITA, pages 4333{4338.
  • Van den Bercken, L., Sips, R.-J., and Lofi, C. (2019). Evaluating neural text simplification in the medical domain. In Proc. of the World Wide Web Conference, pages 3286{3292.
  • Xu, W., Callison-Burch, C., and Napoles, C. (2015). Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283{297.
  • Yimam, S. M., fiStajner, S., Riedl, M., and Biemann, C. (2017). Multilingual and cross-lingual complex word identification. In Proc. of the Int. Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 813{822.
  • Zhu, Z., Bernhard, D., and Gurevych, I. (2010). A monolingual tree-based translation model for sentence simplification. In Proc. of the 23rd Intern. Conference on Computational Linguistics (COLING 2010), pages 1353{1361, Beijing, China.