Improved Spam Email Detection Performance Based on Naïve Bayes Approach TF-IDF Vectorizer with Multi-Metric Optimization

Elpa Triana; Ade Irma Purnamasari; Agus Bahtiar; Edi Tohidi

doi:10.59934/jaiea.v4i3.981

Authors

Elpa Triana STMIK IKMI Cirebon
Ade Irma Purnamasari STMIK IKMI Cirebon
Agus Bahtiar STMIK IKMI Cirebon
Edi Tohidi STMIK IKMI Cirebon

DOI:

https://doi.org/10.59934/jaiea.v4i3.981

Keywords:

Email spam detection, Naive Bayes, TF-IDF, Machine Learning, Email Filtering

Abstract

Email spam has become a serious threat to user productivity and security in digital communication, particularly regarding malware and phishing risks. This study aims to develop and evaluate a more effective email spam detection system model using the Naïve Bayes algorithm optimized with TF-IDF Vectorizer, focusing on improving detection accuracy and handling language variations.The research methodology uses a Knowledge Discovery in Databases (KDD) approach with email message datasets collected from STMIK IKMI Cirebon students during the 2020-2024 period via Google Takeout. The data processing involves comprehensive preprocessing stages, including text cleaning, tokenization, stemming using Sastrawi for Indonesian, and data transformation using TF-IDF Vectorization. The model was evaluated using various data split ratios (90:10, 80:20, 70:30, and 60:40) to test system consistency and reliability. Experimental results show very satisfactory performance, with the 80:20 data split ratio achieving the highest accuracy of 92%. The model demonstrates a good balance between precision (0.94) for spam and (0.91) for non-spam, as well as recall values (0.91) for spam and (0.94) for non-spam. ROC Curve analysis yielded consistently high AUC values (0.96-0.97) across all data split ratios, indicating strong discriminative capability in distinguishing spam and legitimate emails. This research provides a significant contribution to developing more effective and efficient email filtering systems to protect users from various cyber threats.

Downloads

Download data is not yet available.

References

H. Mukhtar, J. Al Amien, and M. A. Rucyat, “Filtering Spam Email menggunakan Algoritma Naïve Bayes,” Jurnal CoSciTech (Computer Science and Information Technology), vol. 3, no. 1, pp. 9–19, May 2022, doi: 10.37859/coscitech.v3i1.3652.

F. Jáñez-Martino, R. Alaiz-Rodríguez, V. González-Castro, E. Fidalgo, and E. Alegre, “A review of spam email detection: analysis of spammer strategies and the dataset shift problem,” Artif Intell Rev, vol. 56, no. 2, pp. 1145–1173, Feb. 2023, doi: 10.1007/s10462-022-10195-4.

A. Kumar, J. M. Chatterjee, and V. G. Díaz, “A novel hybrid approach of SVM combined with NLP and probabilistic neural network for email phishing,” International Journal of Electrical and Computer Engineering, vol. 10, no. 1, pp. 486–493, 2020, doi: 10.11591/ijece.v10i1.pp486-493.

M. A. Shaaban, Y. F. Hassan, and S. K. Guirguis, “Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text,” Complex and Intelligent Systems, vol. 8, no. 6, pp. 4897–4909, Dec. 2022, doi: 10.1007/s40747-022-00741-6.

U. Maqsood, S. Ur Rehman, T. Ali, K. Mahmood, T. Alsaedi, and M. Kundi, “An Intelligent Framework Based on Deep Learning for SMS and e-mail Spam Detection,” Applied Computational Intelligence and Soft Computing, vol. 2023, 2023, doi: 10.1155/2023/6648970.

N. Ahmad, S. Hafizh, and R. Sulthanah, “Prediksi Kelulusan Mata Kuliah Mahasiswa Teknologi Informasi Menggunakan Algoritma K-Nearest Neighbor The Prediction for Graduation for Information Technology Student ’ s Course Using The K-Nearest Neighbor Algorithm,” vol. 14, pp. 135–149, 2024.

M. Gratia, B. Sitorus, N. Maria, and Y. N. Safa, “Tinjauan Literatur Manajemen Risiko Cyber dalam Proyek : Identifikasi , Evaluasi , dan Mitigasi Ancaman Literature Review Cyber Risk Management in Projects : Threat Identification , Evaluation and Mitigation,” vol. 14, pp. 187–198, 2024.

N. N. Sari, T. T. Anisah, and R. Fitriani, “Implementasi Machine Learning untuk Prediksi Harga Laptop Menggunakan Algoritma Regresi Linear Berganda Machine Learning Implementation for Laptop Price Prediction Using Multiple Linear Regression Algorithm,” vol. 14, pp. 162–177, 2024.

S. A. Brown, B. A. Weyori, A. F. Adekoya, P. K. Kudjo, and S. Mensah, “Predicting Blocking Bugs with Machine Learning Techniques : A Systematic Review,” vol. 13, no. 6, pp. 674–683, 2022.

S. Senhadji, R. Azad, and S. Ahmed, “Fake News Detection Using Naïve Bayes and Long Short Term Memory Fake news detection using naïve Bayes and long short term memory algorithms,” no. March, pp. 746–752, 2022, doi: 10.11591/ijai.v11.i2.pp746-752.

F. J. Martino, R. A. Rodríguez, and V. G. Castro, “A review of spam email detection : analysis of spammer strategies and the dataset shift problem,” Artif. Intell. Rev., vol. 56, no. 2, pp. 1145–1173, 2023, doi: 10.1007/s10462-022-10195-4.

K. S. Putri, I. R. Setiawan, A. Pambudi, A. Sentimen, and N. B. Classifier, “‘Technologia’ Vol 14, No. 3, Juli 2023 227 ANALISIS SENTIMEN TERHADAP BRAND SKINCARE LOKAL MENGGUNAKAN NAÏVE BAYES CLASSIFIER,” vol. 14, no. 3, pp. 227–232, 2023.

J. Al Amien, H. Mukhtar, and M. A. Rucyat, “Jurnal Computer Science and Information Technology ( CoSciTech ),” vol. 3, no. 1, pp. 9–19, 2022.

N. Agustina and M. Hermawati, “Implementasi Algoritma Naïve Bayes Classifier untuk Mendeteksi Berita Palsu pada Sosial Media,” vol. 14, no. 4, pp. 206–213, 2021, doi: 10.30998/faktorexacta.v14i4.11259.

J. S. Komputer, “Implementasi Naïve Bayes Classifier Dan Confusion Matrix Pada Analisis Sentimen Berbasis Teks Pada Twitter,” vol. 5, no. September, pp. 697–711, 2021.

E. Gbenga, J. Stephen, H. Chiroma, A. Olusola, and O. Emmanuel, “Heliyon Machine learning for email spam fi ltering : review , approaches and open research problems,” vol. 5, no. February, 2019, doi: 10.1016/j.heliyon.2019.e01802.

M. R. Qisthiano, T. B. Kurniawan, E. S. Negara, and M. Akbar, “Pengembangan Model Untuk Prediksi Tingkat Kelulusan Mahasiswa Tepat Waktu dengan Metode Naïve Bayes,” vol. 5, pp. 987–994, 2021, doi: 10.30865/mib.v5i3.3030.

C. Herdian, M. Quinn, and S. Margareta, “Perbandingan Algoritma Naive Bayes di dalam Scikit-Learn Python Library dengan Murni Algoritma Naive Bayes : Studi Kasus Klasifikasi Email Berbahaya,” vol. 9, no. 1, pp. 1–10, 2024.

R. Sistem, J. W. Iskandar, Y. Nataliani, F. T. Informasi, U. Kristen, and S. Wacana, “JURNAL RESTI,” vol. 5, no. 158, pp. 1120–1126, 2021.

Y. F. Hassan and S. K. Guirguis, “Deep convolutional forest : a dynamic deep ensemble approach for spam detection in text,” Complex Intell. Syst., vol. 8, no. 6, pp. 4897–4909, 2022, doi: 10.1007/s40747-022-00741-6.

M. F. Madjid, D. E. Ratnawati, and B. Rahayudi, “Sentiment Analysis on App Reviews Using Support Vector Machine and Naïve Bayes Classification,” vol. 7, no. 1, pp. 556–562, 2023.

R. Blanquero, E. Carrizosa, and P. Ramírez-cobo, “Computers and Operations Research Variable selection for Naïve Bayes classification,” vol. 135, 2021.

D. Fitria, Y. Cahyana, D. Sulistya, and K. A. Baihaqi, “Pemilihan Algoritma Terbaik Untuk Klasifikasi Jenis E-Mail dengan Metode TF-IDF,” J. Ris. Sist. Inf. Dan Tek. Inform., vol. 9, no. 1, pp. 398–407, 2024, [Online]. Available: https://tunasbangsa.ac.id/ejurnal/index.php/jurasik