lossgrad : Automatic Learning Rate in Gradient Descent

Wójcik, Bartosz; Maziarka, Łukasz; Tabor, Jacek

doi:10.4467/20838476SI.18.004.10409

Artykuł - szczegóły

Tytuł artykułu

lossgrad : Automatic Learning Rate in Gradient Descent

Autorzy

Wójcik Bartosz , Maziarka Łukasz , Tabor Jacek

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.4467/20838476SI.18.004.10409

Warianty tytułu

Języki publikacji

Abstrakty

In this paper, we propose a simple, fast and easy to implement algorithm lossgrad (locally optimal step-size in gradient descent), which au- tomatically modifies the step-size in gradient descent during neural networks training. Given a function f, a point x, and the gradient rxf of f, we aim to nd the step-size h which is (locally) optimal, i.e. satisfies: h = arg min t0 f(x 􀀀 trxf): Making use of quadratic approximation, we show that the algorithm satisfies the above assumption. We experimentally show that our method is insensitive to the choice of initial learning rate while achieving results comparable to other methods.

Słowa kluczowe

gradient descent optimization methods adaptive step size dynamic learning rate neural networks

Wydawca

Wydawnictwo Uniwersytetu Jagiellońskiego

Czasopismo

Schedae Informaticae

Rocznik

2018

Tom

Vol. 27

Strony

47--57

Opis fizyczny

Bibliogr. 18 poz., rys.

Twórcy

autor

Wójcik Bartosz

bartwojc@gmail.com

Faculty of Physics, Mathematics and Computer Science Cracow University of Technology

autor

Maziarka Łukasz

l.maziarka@gmail.com

Faculty of Mathematics and Computer Science Jagiellonian University

autor

Tabor Jacek

jcktbr@gmail.com

Faculty of Mathematics and Computer Science Jagiellonian University

Bibliografia

[1] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning esearch, 12(Jul):2121-2159, 2011.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.
[3] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.
[4] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[5] Alex Krizhevsky and Geo_rey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
[6] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730-3738, 2015.
[7] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng,and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142-150. Association for Computational Linguistics, 2011.
[8] Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization. In Advances in Neural Information Processing Systems, pages 181-189, 2015.
[9] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532-1543, 2014.
[10] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145-151, 1999.
[11] Michal Rolinek and Georg Martius. L4: Practical loss-based stepsize adaptation for deep learning. In Advances in Neural Information Processing Systems, pages 6434_6444, 2018.
[12] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
[13] Tijmen Tieleman and Geo_rey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26-31, 2012.
[14] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
[15] Xiaoxia Wu, Rachel Ward, and Léon Bottou. Wngrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
[16] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
[17] Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018.
[18] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-af0dbcb0-61a6-40ec-a9b6-9ecf77ec57a8