Mesterséges intelligencia alapú alkalmazások tesztelése
Testing approaches for AI-based applications
Kulcsszavak:
Artificial intelligence, software testing, reproducibility of results, performance evaluation, software metrics, /, mesterséges intelligencia, szoftvertesztelés, reprodukálhatóság, kiértékelés, szoftvermetrikaAbsztrakt
Modern artificial intelligence systems—especially deep-learning and generative models—contain numerous non-deterministic components. Stochastic learning, parallel execution, data augmentation, and GPU-dependent floating-point operations affect testability. This can lead to variance in outputs, intermittently failing (unstable) tests, and results that are difficult to reproduce. The paper examines the sources and mitigation of non-deterministic behavior and proposes stable metrics and measurement protocols.
Kivonat
A modern mesterséges intelligencia rendszerek – különösen a mélytanuló és generatív modellek – számos nem-determinista komponenst tartalmaznak. A sztochasztikus tanulás, a párhuzamos futtatás, az adat-augmentáció, GPU-függő lebegőpontos műveletek a tesztelhetőséget befolyásolják. Ez a kimenetek varianciájához, nem stabilan viselkedő tesztekhez és nehezen reprodukálható eredményekhez vezethet. A cikk a nem-determinista viselkedés forrásaival, kezelésével, stabil metrikák és mérési protokollok kialakításával foglalkozik.
Hivatkozások
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., Dennison, D. Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems 28, Curran Associates, Inc., 2015, 2503–2511.
Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., Zimmermann, T. Software Engineering for Machine Learning: A Case Study. ICSE-SEIP ’19: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, IEEE, 2019, 291–300.
Zhang, J. M., Harman, M., Ma, L., Liu, Y. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering, IEEE, 2022, 48(1), 1–36.
Xie, X., Ho, J. W. K., Murphy, C., Kaiser, G., Xu, B., Chen, T. Y. Testing and Validating Machine Learning Classifiers by Metamorphic Testing. Journal of Systems and Software, Elsevier, 2011, 84(4), 544–558.
Pei, K., Cao, Y., Yang, J., Jana, S. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. SOSP ’17: Proceedings of the 26th ACM Symposium on Operating Systems Principles, ACM, 2017, 1–18.
Tian, Y., Pei, K., Jana, S., Ray, B. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. ICSE ’18: Proceedings of the 40th International Conference on Software Engineering, ACM, 2018, 303–314.
Ma, L., Juefei-Xu, F., Zhang, F., Sun, J., Xue, M., Li, B., Chen, C., Su, T., Li, L., Liu, Y., Zhao, J., Wang, Y. DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems. ASE ’18: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ACM, 2018, 120–131.
Ma, L., Zhang, F., Xue, M., Li, B., Liu, Y., Zhao, J., Wang, Y. Combinatorial Testing for Deep Learning Systems. arXiv, https://arxiv.org/abs/1806.07723 (Utolsó letöltés: 2025. 09.09).
Li, S., Guo, J., Xia, X., Chen, H., Lo, D., Jin, Z. Testing Machine Learning Systems in Industry: An Empirical Study. ICSE-SEIP ’22: Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, IEEE/ACM, 2022, 263–272.
Schröder, T., Schulz, M. Monitoring Machine Learning Models: A Categorization of Challenges and Methods. Data Science and Management, KeAi Publishing, 2022, 5(3), 105–116.
Brier, G. W., Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review. American Meteorological Society, 1950, 78(1), 1–3.
Gneiting, T., Raftery, A. E., Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association. American Statistical Association, 2007, 102(477), 359–378.
Niculescu-Mizil, A., Caruana, R., Predicting Good Probabilities with Supervised Learning. Proceedings of the 22nd International Conference on Machine Learning (ICML). ACM, 2005, 625–632.
Guo, C., Pleiss, G., Sun, Y., Weinberger, K. Q., On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning (PMLR 70). PMLR, 2017, 1321–1330.
Fawcett, T., An introduction to ROC analysis. Pattern Recognition Letters. Elsevier, 2006, 27(8), 861–874.
Davis, J., Goadrich, M., The Relationship Between Precision-Recall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning (ICML). ACM, 2006, 233–240.
Saito, T., Rehmsmeier, M., The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE. Public Library of Science, 2015, 10(3), e0118432.
Chicco, D., Jurman, G., The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. BioMed Central, 2020, 21(1), 6.
Velez, D. R., White, B. C., Motsinger, A. A., Bush, W. S., Ritchie, M. D., Williams, S. M., Moore, J. H., A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology. Wiley, 2007, 31(4), 306–315.
Massey Jr, F. J., The Kolmogorov–Smirnov Test for Goodness of Fit. Journal of the American Statistical Association. American Statistical Association, 1951, 46(253), 68–78.
Hand, D. J., Till, R. J., A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning. Kluwer Academic Publishers, 2001, 45(2), 171–186.
Provost, F., Fawcett, T., Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions. Proceedings of the Third International Conference on Knowledge Discovery and comparison under imprecise class and cost distributions
Beyer, B., Jones, C., Petoff, J., Murphy, N. R. (szerk.), Site Reliability Engineering: How Google Runs Production Systems. O’Reilly, Sebastopol (CA), 2016.
Dean, J., Barroso, L. A., The Tail at Scale. Communications of the ACM. ACM, 2013, 56(2), 74–80.