Theory II: Deep learning and optimization

Poggio, T.; Liao, Q.

doi:10.24425/bpas.2018.125925

Artykuł - szczegóły

Tytuł artykułu

Theory II: Deep learning and optimization

Autorzy

Poggio T. , Liao Q.

Treść / Zawartość

Pełne teksty:

03_775-788_00920_Bpast.No.66-6_31.12.18_K2.pdf

Pobierz

Identyfikatory

DOI

10.24425/bpas.2018.125925

Warianty tytułu

Języki publikacji

Abstrakty

The landscape of the empirical risk of overparametrized deep convolutional neural networks (DCNNs) is characterized with a mix of theory and experiments. In part A we show the existence of a large number of global minimizers with zero empirical error (modulo inconsistent equations). The argument which relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity. We show with simulations that the corresponding polynomial network is indistinguishable from the RELU network. According to Bezout theorem, the global minimizers are degenerate unlike the local minima which in general should be non-degenerate. Further we experimentally analyzed and visualized the landscape of empirical risk of DCNNs on CIFAR-10 dataset. Based on above theoretical and experimental observations, we propose a simple model of the landscape of empirical risk. In part B, we characterize the optimization properties of stochastic gradient descent applied to deep networks. The main claim here consists of theoretical and experimental evidence for the following property of SGD: SGD concentrates in probability – like the classical Langevin equation – on large volume, ”flat” minima, selecting with high probability degenerate minimizers which are typically global minimizers.

Słowa kluczowe

deep learning convolutional neural networks loss surface optimization

uczenie głębokie sieć neuronowa optymalizacja

Wydawca

Polska Akademia Nauk, Wydział IV Nauk Technicznych

Czasopismo

Bulletin of the Polish Academy of Sciences. Technical Sciences

Rocznik

2018

Tom

Vol. 66, nr 6

Strony

775--787

Opis fizyczny

Bibliogr. 12 poz., rys., wykr., tab.

Twórcy

autor

Poggio T.

tp@csail.mit.edu

Center for Brains, Minds, and Machines, McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, 02139

autor

Liao Q.

Center for Brains, Minds, and Machines, McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, 02139

Bibliografia

[1] T. Poggio and Q. Liao, “Theory II: Landscape of the empirical risk in deep learning,” arXiv preprint arXiv:1703.09833, 2017.
[2] C. Zhang, Q. Liao, A. Rakhlin, B. Miranda, N. Golowich, and T. Poggio, “Musings on deep learning: Properties of SGD”.
[3] C. Zhang, Q. Liao, A. Rakhlin, B. Miranda, N. Golowich, and T. Poggio, “Theory of deep learning IIB: Optimization properties of SGD,” arXiv preprint arXiv:1801.02254, 2018.
[4] M. Shub and S. Smale, “Complexity of bezout theorem V: Polynomial time,” Theoretical Computer Science, no. 133, pp. 141–164, 1994.
[5] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
[6] I. Borg and P.J. Groenen, Modern Multidimensional Scaling: Theory and Applications. Springer Science & Business Media, 2005.
[7] S. Gelfand and S. Mitter, “Recursive stochastic algorithms for global optimization in Rd”, Siam J. Control and Optimization, vol. 29, pp. 999–1018, September 1991.
[8] L. Bottou, “Online algorithms and stochastic approximations,” in Online Learning and Neural Networks (D. Saad, ed.), Cambridge, UK: Cambridge University Press, 1998, revised, oct 2012.
[9] D. Bertsekas and J. Tsitsiklis, “Gradient convergence in gradient methods with errors,” SIAM J. Optim. 10, 627–642 (2000).
[10] D.P. Bertsekas and J.N. Tsitsiklis, Neuro-dynamic Programming. Athena Scientific, Belmont, MA, 1996.
[11] B. Gidas, “Global optimization via the Langevin equation,” Proceedings of the 24th IEEE Conference on Decision and Control, pp. 774–778, 1985.
[12] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations (ICLR), 2017.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-f872f6ec-ac32-4aae-88f6-c6b56bd3ceac