PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

lossgrad : Automatic Learning Rate in Gradient Descent

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
In this paper, we propose a simple, fast and easy to implement algorithm lossgrad (locally optimal step-size in gradient descent), which au- tomatically modifies the step-size in gradient descent during neural networks training. Given a function f, a point x, and the gradient rxf of f, we aim to nd the step-size h which is (locally) optimal, i.e. satisfies: h = arg min t0 f(x 􀀀 trxf): Making use of quadratic approximation, we show that the algorithm satisfies the above assumption. We experimentally show that our method is insensitive to the choice of initial learning rate while achieving results comparable to other methods.
Rocznik
Tom
Strony
47--57
Opis fizyczny
Bibliogr. 18 poz., rys.
Twórcy
  • Faculty of Physics, Mathematics and Computer Science Cracow University of Technology
  • Faculty of Mathematics and Computer Science Jagiellonian University
autor
  • Faculty of Mathematics and Computer Science Jagiellonian University
Bibliografia
  • [1] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning esearch, 12(Jul):2121-2159, 2011.
  • [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
  • for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.
  • [3] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.
  • [4] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [5] Alex Krizhevsky and Geo_rey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [6] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730-3738, 2015.
  • [7] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng,and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142-150. Association for Computational Linguistics, 2011.
  • [8] Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization. In Advances in Neural Information Processing Systems, pages 181-189, 2015.
  • [9] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532-1543, 2014.
  • [10] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145-151, 1999.
  • [11] Michal Rolinek and Georg Martius. L4: Practical loss-based stepsize adaptation for deep learning. In Advances in Neural Information Processing Systems, pages 6434_6444, 2018.
  • [12] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  • [13] Tijmen Tieleman and Geo_rey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26-31, 2012.
  • [14] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
  • [15] Xiaoxia Wu, Rachel Ward, and Léon Bottou. Wngrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
  • [16] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [17] Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018.
  • [18] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Uwagi
Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-af0dbcb0-61a6-40ec-a9b6-9ecf77ec57a8
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.