Application of the K-Means Clustering Algorithm in the Analysis of Popularity and Growth Trends of Python Packages on the PyPI Dataset

Muhammad Rafli Wijaya; M Gali Almahdi; Sebastian Saut Marulitua Sinaga; Benedict Sandi Pangestu Rosa

doi:10.59934/jaiea.v5i3.2233

Authors

Muhammad Rafli Wijaya Universitas Negeri Medan
M Gali Almahdi Universitas Negeri Medan
Sebastian Saut Marulitua Sinaga Universitas Negeri Medan
Benedict Sandi Pangestu Rosa Universitas Negeri Medan

DOI:

https://doi.org/10.59934/jaiea.v5i3.2233

Keywords:

Data mining, Google BigQuery, K-Means algorithm, Pypi, Python Package

Abstract

The rapid growth of the Python ecosystem has led to an increasing number of packages on the Python Package Index (PyPI), generating a massive volume of download data. This data can be utilized to analyze popularity levels and growth trends of libraries used by the developer community. This study aims to identify popularity patterns and growth trends of Python packages using the K-Means Clustering algorithm. The dataset was obtained from PyPI via the Google BigQuery platform with a one-year observation period using a 1% sampling technique. The pre-processing stage included a filtering process to select the 100 packages with the highest number of downloads and the formation of six main features representing the characteristics of library usage patterns. The data was then normalized using Standard Scaling, while the optimal number of clusters was determined using the Elbow Method and evaluated using the Davies-Bouldin Index (DBI) and Silhouette Score. The results showed that the optimal number of clusters is four, with a DBI value of 0.5534 and a Silhouette Score of 0.5748 (the highest among k = 2-10 ), representing the categories of ecosystem foundation libraries, medium-popularity libraries, libraries with concentrated download spikes, and libraries with very rapid usage growth. These results indicate that K-Means Clustering is effective for identifying popularity patterns and library growth trends in large-scale PyPI datasets.

Downloads

Download data is not yet available.

References

R. Paramitha and F. Massacci, “Technical leverage analysis in the Python ecosystem,” Empir. Softw. Eng., vol. 28, no. 6, Nov. 2023, doi: 10.1007/s10664-023-10355-2.

S. Raschka, J. Patterson, and C. Nolet, “Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence,” Apr. 01, 2020, MDPI AG. doi: 10.3390/info11040193.

J. Mahon, C. Hou, and Z. Yao, “PyPitfall: Dependency Chaos and Software Supply Chain Vulnerabilities in Python,” ArXiv, Jul. 2025, [Online]. Available: http://arxiv.org/abs/2507.18075

E. Bommarito and M. Bommarito, “An Empirical Analysis of the Python Package Index (PyPI),” ArXiv, Jul. 2019, [Online]. Available: http://arxiv.org/abs/1907.11073

S. Farshidi et al., “Empirical Evaluation of AI-Assisted Software Package Selection: A Knowledge Graph Approach,” PySelect, Aug. 2025, [Online]. Available: http://arxiv.org/abs/2508.05693

K. Elwis and H. Hayadi, “FRAMEWORK BIG DATA PADA ANALISIS DAN IMPLEMENTASI PADA PENGOLAHAN DATA SKALA BESAR,” 2025.

B. Berisha, E. Mëziu, and I. Shabani, “Big data analytics in Cloud computing: an overview,” Journal of Cloud Computing, vol. 11, no. 1, Dec. 2022, doi: 10.1186/s13677-022-00301-w.

E. Xiao, “Comprehensive K-Means Clustering,” Journal of Computer and Communications, vol. 12, no. 03, pp. 146–159, 2024, doi: 10.4236/jcc.2024.123009.

M. Jillsy Miranda Moningkey, D. Riano Kaparang, and H. Sumual, “The Distribution Pattern of New Students Admissions Using the K-Means Clustering Algorithm,” 2024. [Online]. Available: http://ejournal.uksw.edu/ijiteb

D. Galang Ramadhan, I. Prihatini, F. Liantoni, P. Teknik Informatika dan Komputer, and F. Keguruan dan Ilmu Pendidikan, “Analisis Clustering Pengelompokan Penjualan Paket Data Menggunakan Metode K-Means,” Ultimatics : Jurnal Teknik Informatika, vol. 13, no. 1, p. 33, 2021.

R. Zaib and O. Ourlis, “Large Scale Data Using K-Means,” Mesopotamian Journal of Big Data, vol. 2023, pp. 36–45, Dec. 2023, doi: 10.58496/MJBD/2023/006.

D. A. Manalu and G. Gunadi, “IMPLEMENTASI METODE DATA MINING K-MEANS CLUSTERING TERHADAP DATA PEMBAYARAN TRANSAKSI MENGGUNAKAN BAHASA PEMROGRAMAN PYTHON PADA CV DIGITAL DIMENSI,” Infotech: Journal of Technology Information, vol. 8, no. 1, pp. 43–54, Jun. 2022, doi: 10.37365/jti.v8i1.131.

A. Winarta and W. J. Kurniawan, “OPTIMASI CLUSTER K-MEANS MENGGUNAKAN METODE ELBOW PADA DATA PENGGUNA NARKOBA DENGAN PEMROGRAMAN PYTHON,” Jurnal Teknik Informatika Kaputama (JTIK), vol. 5, no. 1, 2021.

R. Paramitha, Y. Feng, F. Massacci, and C. E. Budde, “Cross-ecosystem categorization: A manual-curation protocol for the categorization of Java Maven libraries along Python PyPI Topics,” ArXiv, Mar. 2024, [Online]. Available: http://arxiv.org/abs/2403.06300

I. Patil, D. Makowski, M. S. Ben-Shachar, B. M. Wiernik, E. Bacher, and D. Lüdecke, “datawizard: An R Package for Easy Data Preparation and Statistical Transformations,” J. Open Source Softw., vol. 7, no. 78, p. 4684, Oct. 2022, doi: 10.21105/joss.04684.

C. Wongoutong, “The impact of neglecting feature scaling in k-means clustering,” PLoS One, vol. 19, no. 12, Dec. 2024, doi: 10.1371/journal.pone.0310839.

K. Rinci et al., BUKU AJAR DATA MINING CV. LUMINARY PRESS INDONESIA. 2019. [Online]. Available: www.luminarypress.id

C. Shi, B. Wei, S. Wei, W. Wang, H. Liu, and J. Liu, “A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm,” EURASIP J. Wirel. Commun. Netw., vol. 2021, no. 1, Dec. 2021, doi: 10.1186/s13638-021-01910-w.

F. Liantoni, DATA MINING DAN PENERAPAN METODE. EUREKA MEDIA AKSARA, 2022.

D. Chicco, A. Campagner, A. Spagnolo, D. Ciucci, and G. Jurman, “The Silhouette coefficient and the Davies-Bouldin index are more informative than Dunn index, Calinski-Harabasz index, Shannon entropy, and Gap statistic for unsupervised clustering internal evaluation of two convex clusters,” PeerJ Comput. Sci., vol. 11, 2025, doi: 10.7717/peerj-cs.3309.