Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
Recent works have developed model complexity based and algorithm based generalization error bounds to explain how stochastic gradient descent (SGD) methods help over-parameterized models generalize better. However, previous works are limited by their scope of analysis and fail to provide comprehensive explanations. In this paper, we propose a novel Gaussian approximation framework to establish generalization error bounds for the U-SGD family, which is a class of SGD with asymptotically unbiased and uniformly bounded gradient noise. We study U-SGD dynamics, and we show both theoretically and numerically that the limiting model parameter distribution tends to be Gaussian, even when the original gradient noise is non-Gaussian. For a U-SGD family, we establish a desirable iteration number independent generalization error bound at the order of O((1 +√log(p√n))/√n), where n and p stand for the sample size and parameter dimension. Based on our analysis, we propose two general types of methods to help models generalize better, termed as the additive and multiplicative noise insertions. We show that these methods significantly reduce the dominant term of the generalization error bound.
Rocznik
Tom
Strony
251--266
Opis fizyczny
Bibliogr. 38 poz., rys., tab., wykr.
Twórcy
autor
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, 96 Jinzhai Road, Hefei, 230026, China
autor
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore
autor
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, 96 Jinzhai Road, Hefei, 230026, China
Bibliografia
- [1] Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R. and Wang, R. (2019). On exact computation with an infinitely wide neural net, in H. Wallach et al. (Eds), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, New York, pp. 8141-8150.
- [2] Bartlett, P.L., Foster, D.J. and Telgarsky, M.J. (2017). Spectrally-normalized margin bounds for neural networks, in I. Guyon et al. (Eds), Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, New York, pp. 6240-6249.
- [3] Bottou, L. (1998). On-line learning and stochastic approximations, in L. Bottou (Ed) On-line Learning in Neural Networks, Cambridge University Press, pp. 9-42.
- [4] Bousquet, O. and Elisseeff, A. (2002). Stability and generalization, Journal of Machine Learning Research 2: 499-526, DOI: 10.1162/153244302760200704.
- [5] Chen, H., Mo, Z., Yang, Z. and Wang, X. (2019). Theoretical investigation of generalization bound for residual networks, Proceedings of the 28th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Francisco, pp. 2081-2087, DOI: 10.24963/ijcai.2019/288.
- [6] Chen, Y., Chang, H., Meng, J. and Zhang, D. (2019). Ensemble neural networks (enn): A gradient-free stochastic method, Neural Networks 110: 170-185.
- [7] D’Agostino, R. and Pearson, E.S. (1973). Tests for departure from normality. empirical results for the distributions of b2 and √b1, Biometrika 60(3): 613-622.
- [8] Dieuleveut, A., Durmus, A. and Bach, F. (2017). Bridging the gap between constant step size stochastic gradient descent and markov chains, arXiv: 1707.06386.
- [9] Elisseeff, A., Evgeniou, T. and Pontil, M. (2005). Stability of randomized learning algorithms, Journal of Machine Learning Research 6: 55-79.
- [10] Feng, Y., Gao, T., Li, L., Liu, J.-G. and Lu, Y. (2020). Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation, Communications in Mathematical Sciences 18(1).
- [11] Hardt, M., Recht, B. and Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent, in M.F. Balcan and K.Q.Weinberger (Eds), Proceedings of The 33rd International Conference on Machine Learning, New York, USA, pp. 1225-1234.
- [12] He, F., Liu, T. and Tao, D. (2019). Control batch size and learning rate to generalize well: Theoretical and empirical evidence, in H. Wallach et al. (Eds), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, New York, pp. 1143-1152.
- [13] He, K., Zhang, X., Ren, S. and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, Poceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1026-1034.
- [14] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H. and Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2704-2713, DOI:10.1109/CVPR.2018.00286.
- [15] Kramers, H. (1940). Brownian motion in a field of force and the diffusion model of chemical reactions, Physica 7(4): 284-304.
- [16] Krizhevsky, A. and Hinton, G. (2009). Learning Multiple Layers of Features from Tiny Images, Master’s thesis, Department of Computer Science, University of Toronto.
- [17] LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database, http://yann.lecun.com/exdb/mnist/.
- [18] Li, H., Xu, Z., Taylor, G., Studer, C. and Goldstein, T. (2018). Visualizing the loss landscape of neural nets, in S. Bengio et al. (Eds), Advances in Neural Information Processing Systems, Vol. 31, Curran Associates, New York, pp. 6389-6399.
- [19] Li, J., Luo, X. and Qiao, M. (2020). On generalization error bounds of noisy gradient methods for non-convex learning, arXiv: 1902.00621.
- [20] Li, Q., Tai, C. and Weinan, E. (2019). Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations, Journal of Machine Learning Research 20(40): 1-47.
- [21] Li, X., Lu, J., Wang, Z., Haupt, J. and Zhao, T. (2019). On tighter generalization bound for deep neural networks: CNNs, ResNets, and beyond, arXiv: 1806.05159.
- [22] Ljung, L., Pflug, G. and Walk, H. (1992). Stochastic Approximation and Optimization of Random Systems, Birkhäuser, Basel, Switzerland.
- [23] London, B., Huang, B., Taskar, B. and Getoor, L. (2014). PAC-Bayesian Collective Stability, in S. Kaski and J. Corander (Eds), Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, pp. 585-594.
- [24] Mandt, S., Hoffman, M.D. and Blei, D.M. (2017). Stochastic gradient descent as approximate Bayesian inference, Journal of Machine Learning Research 18(1): 4873-4907.
- [25] McAllester, D.A. (1999). PAC-Bayesian model averaging, Proceedings of the 12th Annual Conference on Computational Learning Theory, COLT ’99, New York, NY, USA, pp. 164-170, DOI: 10.1145/307400.307435.
- [26] Negrea, J., Haghifam, M., Dziugaite, G. K., Khisti, A. and Roy, D. M. (2019). Information-theoretic generalization bounds for sgld via data-dependent estimates, in H. Wallach et al. (Eds), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, New York, pp. 11015-11025.
- [27] Panigrahi, A., Somani, R., Goyal, N. and Netrapalli, P. (2019). Non-gaussianity of stochastic gradient noise, arXiv: 1910.09626.
- [28] Qian, Y., Wang, Y., Wang, B., Gu, Z., Guo, Y. and Swaileh, W. (2022). Hessian-free second-order adversarial examples for adversarial learning, arXiv: 2207.01396.
- [29] Rong, Y., Huang, W., Xu, T. and Huang, J. (2020). DropEdge: Towards deep graph convolutional networks on node classification, arXiv: 1907.10903.
- [30] Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, International Conference on Learning Representations 2020, https://openreview.net/pdf?id=Hkx1qkrKPr.
- [31] Simsekli, U., Sagun, L. and Gurbuzbalaban, M. (2019). A tail-index analysis of stochastic gradient noise in deep neural networks, in K. Chaudhuri and R. Salakhutdinov (Eds), Proceedings of the 36th International Conference on Machine Learning, New York, pp. 5827-5837.
- [32] Sulaiman, I.M., Kaelo, P., Khalid, R. and Nawawi, M.K.M. (2024). A descent generalized rmil spectral gradient algorithm for optimization problems, Internationl Jouranl of Applied Mathematics and Computer Science 34(2): 225-233, DOI: 10.61822/amcs-2024-0016.
- [33] Sutskever, I., Martens, J., Dahl, G. and Hinton, G. (2013). On the importance of initialization and momentum in deep learning, in S. Dasgupta and D. McAllester (Eds), Proceedings of the 30th International Conference on Machine Learning, Atlanta, USA, pp. 1139-1147.
- [34] Villani, C. (2008). Optimal Transport: Old and New, Grundlehren der Mathematischen Wissenschaften, Springer, Berlin/Heidelberg.
- [35] Weinan, E., Ma, C. and Wang, Q. (2019). A priori estimates of the population risk for residual networks, arXiv: 1903.02154.
- [36] Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Omnipress, Madison, WI, USA, p. 681-688.
- [37] Wu, J., Hu, W., Xiong, H., Huan, J., Braverman, V. and Zhu, Z. (2020). On the noisy gradient descent that generalizes as SGD, International Conference on Machine Learning, PMLR 119: 10367-10376.
- [38] Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization, arXiv: 1611.03530.
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa nr POPUL/SP/0154/2024/02 w ramach programu "Społeczna odpowiedzialność nauki II" - moduł: Popularyzacja nauki (2025).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-2e2adeb9-5ec1-4e03-9b7b-9dd0faaa54ae
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.