This is an outdated version published on 2022-07-25. Read the most recent version.

MACHINE LEARNING AND THE REUTERS COLLECTION-21578 IN DOCUMENT CLASSIFICATION

Authors

DOI:

https://doi.org/10.24054/rcta.v2i40.2344

Keywords:

Document classification, naive bayes, logistic regression, SVM

Abstract

Currently, it is very easy to produce documents, which means that there is too much information, and all this information produced is almost impossible to organize if automatic methods are not used. The automatic classification of documents can be defined as an action executed by an artificial system on a set of structured or unstructured documents. This action is performed by using the words contained in the documents to define the class to which the test document belongs. This paper presents several classification experiments using the Reuters-21578 database in order to observe the performance of naive Bayes classifiers, support vector machines (SVM) and logistic regression. The results obtained show the performance of the classifiers, their behavior when applying cleaning techniques to reduce the size of the documents and different classification scenarios.

Downloads

Download data is not yet available.

References

Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.

de Dios, J. (2009). Clasificación Automática de Textos usando Reducción de Clases basada en Prototipos.

Sebastiani, F. (2005). Text categorization. In Encyclopedia of database technologies and applications (pp. 683-687). IGI Global.

Hearst, M. A., & Pedersen, J. O. (1996, August). Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, 76-84

Macskassy, S. A., Banerjee, A., Davison, B. D., & Hirsh, H. (1998, August). Human Performance on Clustering Web Pages: A Preliminary Study. 264-268

Paniagua, J., Vargas, E., Guzmán, R. (2021). Clasificación automática de documentos utilizando aprendizaje automático y Reuters-21578. CIENERGIA UG 2021,43-47.

Bidi, N., & Elberrichi, Z. (2016, November). Feature selection for text classification using genetic algorithms. In 2016 8th International Conference on Modelling, Identification and Control (ICMIC) (pp. 806-810). IEEE.

Eluri, V. R., Ramesh, M., Al-Jabri, A. S. M., & Jane, M. (2016, March). A comparative study of various clustering techniques on big data sets using Apache Mahout. In 2016 3rd MEC International Conference on Big Data and Smart City (ICBDSC) (pp. 1-4). IEEE.

Suh, J. H. (2016). Comparing writing style feature-based classification methods for estimating user reputations in social media. SpringerPlus, 5(1), 1-27.

Montero, S. C., Hernández, K. M., Murillo, É. C., de León, J. A. L., & Hernández-Delgado, M. (2018). Análisis de texto para la identificación automática de marcadores lingüísticos definicionales en recetas de gastronomía de Costa Rica. Káñina, 42(3), 65-78.

Briceño Segovia, F. S. (2018). Clasificación automática de textos basado en ranking.

Ocampo Vargas, M. J. (2020). Análisis automático de documentos con contenido histórico en español.

Smalbil, J. (2020). Web-Based Economic Activity Classification: Comparing semi-supervised text classification methods to deal with noisy labels.

Vala, M., & Gandhi, J. (2015). Survey of text classification technique and compare classifier. International Journal of Innovative Research in Computer and Communication Engineering, 3(11), 10809-10813.

Al-Tahrawi, M. M. (2016). Polynomial Neural Networks versus Other Arabic Text Classifiers. J. Softw., 11(4), 418-430.

Lewis, D. (1997). Reuters-21578 text categorization test collection, distribution 1.0. http://www. research/. att. com.

Hernández, C., & Rodríguez, J. E. R. (2008). Preprocesamiento de datos estructurados. Revista vínculos, 4(2), 27-48.

Raulji, J. K., & Saini, J. R. (2016). Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications, 150(2), 15-17.

Balakrishnan, V., & Lloyd-Yemoh, E. (2014). Stemming and lemmatization: a comparison of retrieval performances, 174-179.

Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1), 43-52.

Webb, G. I., Keogh, E., & Miikkulainen, R. J. E. o. m. l. (2010). Naïve Bayes. 15, 713-714.

Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge university press.

Williams, D., Liao, X., Xue, Y., & Carin, L. (2005, August). Incomplete-data classification using logistic regression. In Proceedings of the 22nd International Conference on Machine learning (pp. 972-979).

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Paper presented at the Ijcai.

Sandoval, L. (2018). Algoritmos de aprendizaje automático para análisis y predicción de datos. Revista Tecnológica; no. 11.

Melamed, I. D., Green, R., & Turian, J. (2003). Precision and recall of machine translation. In Companion Volume of the Proceedings of HLT-NAACL 2003-Short Papers (pp. 61-63).

Published

2023-05-02 — Updated on 2022-07-25

Versions

How to Cite

Paniagua Medina, J. J., Vargas Rodríguez, E., & Guzmán Cabrera , R. (2022). MACHINE LEARNING AND THE REUTERS COLLECTION-21578 IN DOCUMENT CLASSIFICATION. COLOMBIAN JOURNAL OF ADVANCED TECHNOLOGIES, 2(40), 39–47. https://doi.org/10.24054/rcta.v2i40.2344 (Original work published May 2, 2023)