This is an outdated version published on 2024-03-31. Read the most recent version.

Impact of preprocessing on automatic text classification using supervised learning and reuters 21578

Authors

Jose Manuel Arengas Acosta Universidad de Guanajuato https://orcid.org/0009-0000-3996-6062
Misael Lopez Ramirez Universidad de Guanajuato https://orcid.org/0000-0003-0801-029X
Rafael Guzman Cabrera Universidad de Guanajuato https://orcid.org/0000-0002-9320-7021

DOI:

https://doi.org/10.24054/rcta.v1i43.2506

Keywords:

Automatic text classification, Preprocessing, Reuters 21578, machine learning

Abstract

Faced with the increasing generation of digital data, challenges emerge in its management and categorization. This study emphasizes automatic text classification, placing special emphasis on the impact of preprocessing. By using the Reuters 21578 dataset and applying supervised learning algorithms such as Random Forest, k-Nearest Neighbors, and Naïve Bayes, we examined how techniques like tokenization and the removal of stop words influence classification accuracy. The findings underscore the added value of preprocessing, singling out "Random Forest" as the optimal algorithm, achieving a precision of 92.2%. This research illustrates the potential of combining preprocessing techniques and machine learning algorithms to enhance text categorization in the digital age.

Downloads

Download data is not yet available.

References

C. Guardiola González, “Clasificador de textos mediante técnicas de aprendizaje automático,” 2020. Accessed: Sep. 27, 2023. [Online]. Available: https://riunet.upv.es:443/handle/10251/133840

Y. Li, “Automatic Classification of Chinese Long Texts Based on Deep Transfer Learning Algorithm,” in 2021 2nd International Conference on Artificial Intelligence and Computer Engineering (ICAICE), IEEE, Nov. 2021, pp. 17–20. doi: 10.1109/ICAICE54393.2021.00011.

D. Onita, “Active Learning Based on Transfer Learning Techniques for Text Classification,” IEEE Access, vol. 11, pp. 28751–28761, 2023, doi: 10.1109/ACCESS.2023.3260771.

M. A. Tayal, V. Bajaj, A. Gore, P. Yadav, and V. Chouhan, “Automatic Domain Classification of Text using Machine Learning,” in 2023 International Conference on Communication, Circuits, and Systems (IC3S), IEEE, May 2023, pp. 1–5. doi: 10.1109/IC3S57698.2023.10169470.

L. Zhang, B. Sun, F. Shu, and Y. Huang, “Comparing paper level classifications across different methods and systems: an investigation of Nature publications,” Scientometrics, 2022, doi: 10.1007/s11192-022-04352-3.

C. Liu, Y. Sheng, Z. Wei, and Y.-Q. Yang, “Research of Text Classification Based on Improved TF-IDF Algorithm,” in 2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE), IEEE, Aug. 2018, pp. 218–222. doi: 10.1109/IRCE.2018.8492945.

A. Rusli, A. Suryadibrata, S. B. Nusantara, and J. C. Young, “A Comparison of Traditional Machine Learning Approaches for Supervised Feedback Classification in Bahasa Indonesia,” vol. VII, no. 1, 2020.

D. Ji-Zhaxi, C. Zhi-Jie, C. Rang-Zhuoma, S. Maocuo, and B. Mabao, “A Corpus Preprocessing Method for Syllable-Level Tibetan Text Classification,” in 2021 3rd International Conference on Natural Language Processing (ICNLP), IEEE, Mar. 2021, pp. 33–36. doi: 10.1109/ICNLP52887.2021.00011.

A. Zdrojewska, J. Dutkiewicz, C. Jedrzejek, and M. Olejnik, “Comparison of the novel classification methods on the reuters-21578 corpus,” in Advances in Intelligent Systems and Computing, Springer Verlag, 2019, pp. 290–299. doi: 10.1007/978-3-319-98678-4_30.

Z. Chen, L. J. Zhou, X. Da Li, J. N. Zhang, and W. J. Huo, “The Lao text classification method based on KNN,” in Procedia Computer Science, Elsevier B.V., 2020, pp. 523–528. doi: 10.1016/j.procs.2020.02.053.

M. Nasr, A. karam, M. Atef, K. Boles, K. Samir, and M. Raouf, “Natural Language Processing: Text Categorization and Classifications,” Advanced Networking and Applications, vol. 12, no. 02, pp. 4542–4548, 2020.

A. I. Kadhim, “Survey on supervised machine learning techniques for automatic text classification,” Artif Intell Rev, vol. 52, no. 1, pp. 273–292, Jun. 2019, doi: 10.1007/s10462-018-09677-1.

D. D. Lewis, “Machine Learning Repository,” Documents came from Reuters newswire in 1987. Accessed: Oct. 18, 2022. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

C. L. Hernández and J. E. Rodríguez, “Preprocesamiento de datos estructurados Structured Data Preprocessing,” Investigacion y desarrollo, vol. 4, no. 2, pp. 27–48, 2013, doi: 10.14483/2322939X.4123.

J. J. Paniagua Medina, E. Vargas Rodriguez, and R. Guzman Cabrera, “Machine Learning And The Reuters Collection-21578 In Document Classification,” Revista Colombiana De Tecnologias De Avanzada (RCTA), vol. 2, no. 40, Jul. 2023, doi: 10.24054/rcta.v2i40.2344.

K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Information (Switzerland), vol. 10, no. 4. MDPI AG, 2019. doi: 10.3390/info10040150.

L. A. Calvo-Valverde and J. A. Mena-Arias, “Evaluación de distintas técnicas de representación de texto y medidas de distancia de texto usando KNN para clasificación de documentos,” Revista Tecnología en Marcha, Feb. 2020, doi: 10.18845/tm.v33i1.5022.

T. Salles, M. Gonçalves, V. Rodrigues, and L. Rocha, “Improving random forests by neighborhood projection for effective text classification,” Inf Syst, vol. 77, pp. 1–21, Sep. 2018, doi: 10.1016/j.is.2018.05.006.

J. J. Espinosa Zúñiga, “Aplicación de algoritmos Random Forest y XGBoost en una base de solicitudes de tarjetas de crédito,” Ingeniería Investigación y Tecnología, vol. 21, no. 3, pp. 1–16, Jul. 2020, doi: 10.22201/fi.25940732e.2020.21.3.022.

M. Thangaraj and M. Sivakami, “Text classification techniques: A literature review,” Interdisciplinary Journal of Information, Knowledge, and Management, vol. 13, pp. 117–135, 2018, doi: 10.28945/4066.

A. Bhavani and B. Santhosh Kumar, “A Review of State Art of Text Classification Algorithms,” in Proceedings - 5th International Conference on Computing Methodologies and Communication, ICCMC 2021, Institute of Electrical and Electronics Engineers Inc., Apr. 2021, pp. 1484–1490. doi: 10.1109/ICCMC51019.2021.9418262.

Downloads

Published

2024-03-31

Versions

2024-03-31 (2)
2024-03-31 (1)

How to Cite

[1]

J. M. Arengas Acosta, M. Lopez Ramirez, and R. Guzman Cabrera, “Impact of preprocessing on automatic text classification using supervised learning and reuters 21578”, RCTA, vol. 1, no. 43, pp. 110–118, Mar. 2024.