Optimizing audit reporting using natural language processing: a data-driven approach from quality audits in higher education





Machine learning, internal audit, supervised learning, artificial intelligence, natural language processing


This research focused on automating the understanding and semantic identification of findings for classification in internal audits using natural language processing techniques. Internal audit reports were analyzed to extract texts linked to non-conformities, strengths, and opportunities for improvement. To optimize text presentation for various algorithms, methods such as bag-of-words (BoW), term frequency-inverse document frequency (TF-IDF), and text representations via embedded word vectors such as Word2Vec and FastText. The best combination of performance was determined to come from a linear classifier, which uses data transformed by word embeddings and balances oversampled classes. This model bases its classifications on words that adequately capture the meaning and context of the analyzed finding.


