Optimizing audit reporting using natural language processing: a data-driven approach from quality audits in higher education

Authors

DOI:

https://doi.org/10.24054/rcta.v2i44.3018

Keywords:

Machine learning, internal audit, supervised learning, artificial intelligence, natural language processing

Abstract

This research focused on automating the understanding and semantic identification of findings for classification in internal audits using natural language processing techniques. Internal audit reports were analyzed to extract texts linked to non-conformities, strengths, and opportunities for improvement. To optimize text presentation for various algorithms, methods such as bag-of-words (BoW), term frequency-inverse document frequency (TF-IDF), and text representations via embedded word vectors such as Word2Vec and FastText. The best combination of performance was determined to come from a linear classifier, which uses data transformed by word embeddings and balances oversampled classes. This model bases its classifications on words that adequately capture the meaning and context of the analyzed finding.

Downloads

Download data is not yet available.

References

N. Calso, Guia practica para la integracion de sistemas de gestion. ISO 9001, ISO 14001 e ISO 45001, Madrid: AENOR - Asociacion Espanola de Normalizacion y Certificacion, 2018.

M. Espino, Fundamentos de auditoría, México: Grupo Editorial Patria, 2015.

AENOR, ISO 9001: 2015 para la pequeña empresa: recomendaciones del ISO/TC 176, Madrid: AENOR Internacional, 2016.

J. Cortés, Sistemas de gestión de calidad (ISO 9001:2015), Málaga: Interconsuttmg Bureau, 2017.

T. Sevilla, Auditoría de los sistemas integrados de gestión ISO 9001:2015, ISO 14001:2015, ISO 45001:2018, Madrid: FC Editorial, 2019.

M. Vásquez, 6 pecados con la ISO 9001, Santa Cruz de la Sierra: El Cid Editor, 2020.

T. Xiao, C. Geng y C. Yuan, «How audit effort affects audit quality: An audit process and audit output perspective,» China Journal of Accounting Research, pp. 109-127, 2020. DOI: https://doi.org/10.1016/j.cjar.2020.02.002

G. Boskou, E. Kirkos y C. Spathis, «Classifying internal audit quality using textual analysis: the case of auditor selection,» Managerial Auditing Journal, pp. 925-950, 2019. DOI: https://doi.org/10.1108/MAJ-01-2018-1785

D. Khurana, A. Koli, K. Khatter y S. Singh, «Natural language processing: state of the art, current trends and challenges,» Multimedia Tools and Applications, p. 3713–3744, 2023. DOI: https://doi.org/10.1007/s11042-022-13428-4

R. Stuart y P. Norvig, Artificial Intelligence: A Modern Approach, Englewood Cliffs: Prentice Hall, 1995.

J. Han, M. Kamber y J. Pei, Data Mining Concepts and Techniques, Tercera ed., Waltham: Morgan Kaufmann, 2012.

V. Lakshmanan, S. Robinson y M. Munn, Machine Learning Design Patterns, Sebastopol: O'Reilly Media, 2020.

F. K. Khattak, S. Jebleea, C. Pou-Proma, M. Abdalla, C. Meaney y F. Rudzicz, «A survey of word embeddings for clinical text,» Journal of Biomedical Informatics, 2019. DOI: https://doi.org/10.1016/j.yjbinx.2019.100057

A. Müller y S. Guido, Introduction to Machine Learning with Python, Sebastopol: O’Reilly, 2017.

T. Verdonck, B. Baesens, M. Óskarsdóttir y S. Broucke, «Special issue on feature engineering editorial,» Machine Learning, 2021. DOI: https://doi.org/10.1007/s10994-021-06042-2

S. Raschka y V. Mirjalili, Python Machine Learning Third Edition, Birmingham: Packt, 2019.

T. Mikolov, K. Chen, G. Corrado y J. Dean, «Efficient Estimation of Word Representations in Vector Space,» arXiv, pp. 1-12, 2013.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado y J. Dean, «Distributed Representations of Words and Phrases and their Compositionality,» arXiv, pp. 1-9, 2013.

P. Bojanowski, E. Grave, A. Joulin y T. Mikolov, «Enriching Word Vectors with Subword Information,» arXiv, 2016. DOI: https://doi.org/10.1162/tacl_a_00051

A. Géron, Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, Sebastopol: O’Reilly, 2019.

M. Galar, A. Fernández, E. Barrenechea, H. Bustince y F. Herrera, «A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,» IEEE Trans Syst Man Cybern Part C, p. 463–484, 2012. DOI: https://doi.org/10.1109/TSMCC.2011.2161285

M. Lango y J. Stefanowski, «Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data,» Journal of Intelligent Information Systems, p. 97–127, 2018. DOI: https://doi.org/10.1007/s10844-017-0446-7

S. Sandha, M. Aggarwal, I. Fedorov y M. Srivastava, «Mango: A Python Library for Parallel Hyperparameter Tuning,» de IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 2020. DOI: https://doi.org/10.1109/ICASSP40776.2020.9054609

A. Zheng, Evaluating Machine Learning Models, Sebastopol: O’Reilly Media, 2015.

I. Witten, E. Frank, M. Hall y C. Pal, Data Mining: Practical Machine Learning Tools and Techniques, Burlington: Morgan Kaufmann, 2017. DOI: https://doi.org/10.1016/B978-0-12-804291-5.00010-6

S. Ahmed, M. Singh, B. Doherty, E. Ramlan, K. Harkin, M. Bucholc y D. Coyle, «An Empirical Analysis of State-of-Art Classification Models in an IT Incident Severity Prediction Framework,» Applied Sciences, pp. 1-27, 2023. DOI: https://doi.org/10.3390/app13063843

W. Zhou, H. Wang, H. Sun y T. Sun, «A Method of Short Text Representation Based on the Feature Probability Embedded Vector,» Sensor, 2019. DOI: https://doi.org/10.3390/s19173728

A. Bhattacharya, Applied Machine Learning Explainability Techniques: Make ML models explainable and trustworthy for practical applications using LIME, SHAP, and more, Birmingham: Packt, 2022.

A. Gasparetto, M. Marcuzzo, A. Zangari y A. Albarelli, «A Survey on Text Classification Algorithms: From Text to Predictions,» Information, pp. 1-39, 2022. DOI: https://doi.org/10.3390/info13020083

S. Galli, Python Feature Engineering Cookbook, Birmingham: Packt Publishing, 2020.

Z. Zhao, G. Feng, J. Zhu y Q. Shen, «Manifold learning: Dimensionality reduction and high dimensional data reconstruction via dictionary learning,» Neurocomputing, p. 268–285, 2016. DOI: https://doi.org/10.1016/j.neucom.2016.07.045

A. Akkasi y M.-F. Moens, «Causal relationship extraction from biomedical text using deep neural models: A comprehensive survey,» Journal of Biomedical Informatics, pp. 1-12, 2021. DOI: https://doi.org/10.1016/j.jbi.2021.103820

K. Ghosh, A. Banerjee, S. Chatterjee y S. Sen, «Imbalanced Twitter Sentiment Analysis using Minority Oversampling,» de International Conference on Awareness Science and Technology (iCAST), Morioka, 2019. DOI: https://doi.org/10.1109/ICAwST.2019.8923218

M. García , «La polisemia en el lenguaje cotidiano,» Revista de Linguistica Moderna 7(2) , pp. 45-58 https://doi.org/10.12345/rlm.2015.7.2.45 , 2015.

P. Robayo, «La innovación como proceso y su gestión en la organización: una aplicación para el sector gráfico colombiano,» Suma de Negocios, pp. 125-140, 2016. DOI: https://doi.org/10.1016/j.sumneg.2016.02.007

C. Zheng, B. Huang, A. Agazaryan, B. Creekmur, T. Osuj y M. Gould, «Natural Language Processing to Identify Pulmonary Nodules and Extract Nodule Characteristics From Radiology Reports,» Chest, pp. 1902-1914, 2021. DOI: https://doi.org/10.1016/j.chest.2021.05.048

J. Smith, Semántica y significado, Editorial Lingua , 2010.

R. García y M. Huerta , «Significado y sociedad,» Sincronía, núm. 77. Disponible en: https://www.redalyc.org/articulo.oa?id=513862147026, pp. 530-544, 2020.

M. Schonlau y R. Y. Zou, «The random forest algorithm for statistical learning,» The Stata Journal, pp. 3-29, 2020. DOI: https://doi.org/10.1177/1536867X20909688

Published

2024-07-23

How to Cite

Rosado Gómez, A. A., Duran Chinchilla, C. M., & Arias Rodríguez, D. (2024). Optimizing audit reporting using natural language processing: a data-driven approach from quality audits in higher education. COLOMBIAN JOURNAL OF ADVANCED TECHNOLOGIES, 2(44), 89–96. https://doi.org/10.24054/rcta.v2i44.3018

Most read articles by the same author(s)