DEVELOPING A METHODOLOGY FOR CLASSIFYING COMMITS IN GIT-REPOSITORIES USING MACHINE LEARNING
Abstract and keywords
Abstract:
This paper presents the development of a methodology and software for automatically classifying commits in Git-repositories using machine learning methods. The proposed approach combines text vectorization based on TF-IDF and the Multinomial Naive Bayes model for classifying commits into categories. The approach includes an active learning system that allows the user to adjust the proposed classifications, facilitating continuous model improvement. The methodology includes preprocessing commit descriptions, extracting semantic features, and building an adaptive classification model. The results of this work can be used to improve the transparency of development processes, to analyze change histories, to analyze and optimize code, and to automate testing and delivery of new modules of the project being developed to stakeholders (Continuous Integration / Continuous Delivery).

Keywords:
machine learning, Git, commit classification, active learning, TF-IDF, Multinomial Naive Bayes
References

1. Krupkin SA. Working with the Git Version Control System. Moscow: Moscow University Press; 2022.

2. Wang X, Jiang Y, Xu Y, et al. Automated Commit Classification for Git Repositories Using Machine Learning Technique. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Soft-ware Engineering (ESEC/FSE 2023): 2023. p. 112-124.

3. Conventional Commits. Conventional Commits Initiative. Version 1.0.0 [Internet]. 2019.

4. Ivanov N.N. Syntactic Parsing of a Sentence for Text Vectorization. Problems of Science and Education. 2017;11:45-46.

5. Zhang H., Jiang L., Yu H-K. A Literature Review on Naive Bayes Classifiers. Intelligent Data Analysis. 2020;24(1):37-57.

6. Gusev PY. Word Processing and Preparation of Vectorization Models for a Software Package for the Classification of Scientific Texts. Modelling, Optimization and Information Technology. 2021;9(1).

7. Terentyeva Yu. Sentiment Analysis, InSet Lexicon, SentiStrength Lexicon, Naive Bayes, Multinomial Naive Bayes, TF-IDF, Machine Learning. International Journal of Open Information Technologies [Internet]. 2024 [cited 2026 Jan 10];7. Available from: https://cyberleninka.ru/article/n/sentiment-analysis-inset-lexicon-sentistrength-lexicon-naive-bayes-multinomial-naive-bayes-tf-idf-machine-learning

8. Pascarella L. On the Use of Machine Learning Techniques for Software Engineering Tasks: A Systematic Literature Review. IEEE Transactions on Software Engineering. 2021;47(11):2301-2325.

9. Zhang Y., Wang H., Liu Z. A Comparative Study of Text Classification Algorithms. Journal of Machine Learning Research. 2018;19:1-35.

10. Chen M, Li X, Zhou J. Scalable Text Classification: A Benchmark. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL): 2020. p. 4567-4579.

11. Wang T, Jiang L, Chen R. Noise-Robust Text Classification with Naive Bayes. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): 2019. p. 1234-1243.

Login or Create
* Forgot password?