A Simple and Efficient Tensor Calculus for Machine Learning

Laue, Sören; Mitterreiter, Matthias; Giesen, Joachim

doi:10.3233/FI-2020-1984

Artykuł - szczegóły

Tytuł artykułu

A Simple and Efficient Tensor Calculus for Machine Learning

Autorzy

Laue Sören , Mitterreiter Matthias , Giesen Joachim

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

DOI

10.3233/FI-2020-1984

Warianty tytułu

Języki publikacji

Abstrakty

Computing derivatives of tensor expressions, also known as tensor calculus, is a fundamental task in machine learning. A key concern is the efficiency of evaluating the expressions and their derivatives that hinges on the representation of these expressions. Recently, an algorithm for computing higher order derivatives of tensor expressions like Jacobians or Hessians has been introduced that is a few orders of magnitude faster than previous state-of-the-art approaches. Unfortunately, the approach is based on Ricci notation and hence cannot be incorporated into automatic differentiation frameworks from deep learning like TensorFlow, PyTorch, autograd, or JAX that use the simpler Einstein notation. This leaves two options, to either change the underlying tensor representation in these frameworks or to develop a new, provably correct algorithm based on Einstein notation. Obviously, the first option is impractical. Hence, we pursue the second option. Here, we show that using Ricci notation is not necessary for an efficient tensor calculus and develop an equally efficient method for the simpler Einstein notation. It turns out that turning to Einstein notation enables further improvements that lead to even better efficiency. The methods that are described in this paper for computing derivatives of matrix and tensor expressions have been implemented in the online tool www.MatrixCalculus.org.

Słowa kluczowe

tensor calculus matrix calculus algorithmic differentiation tensor notation

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2020

Tom

Vol. 177, nr 2

Strony

157--179

Opis fizyczny

Bibliogr. 50 poz., rys., tab., wykr.

Twórcy

autor

Laue Sören

soeren.laue@uni-jena.de

Friedrich-Schiller-Universität Jena, Faculty of Mathematics and Computer Science, Ernst-Abbe-Platz 2, 07743 Jena, Germany

autor

Mitterreiter Matthias

matthias.mitterreiter@uni-jena.de

Friedrich-Schiller-Universität Jena, Faculty of Mathematics and Computer Science, Ernst-Abbe-Platz 2, 07743 Jena, Germany

autor

Giesen Joachim

joachim.giesen@uni-jena.de

Friedrich-Schiller-Universität Jena, Faculty of Mathematics and Computer Science, Ernst-Abbe-Platz 2, 07743 Jena, Germany

Bibliografia

[1] Abadi M, et al. TensorFlow: A System for Large-scale Machine Learning. In: USENIX Conference on Operating Systems Design and Implementation (OSDI). USENIX Association. 2016 pp. 265-283. ISBN 978-1-931971-33-1.
[2] Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in PyTorch. In: NIPS Autodiff workshop. 2017.
[3] Maclaurin D, Duvenaud D, Adams RP. Autograd: Effortless gradients in numpy. In: ICML AutoML workshop. 2015.
[4] Frostig R, Johnson MJ, Leary C. Compiling Machine Learning Programs via High-level Tracing. In: Conference on Systems and Machine Learning (SysML). 2018. ID:4625928.
[5] Laue S, Mitterreiter M, Giesen J. Computing Higher Order Derivatives of Matrix and Tensor Expressions. In: Advances in Neural Information Processing Systems (NeurIPS). 2018. http://papers.nips.cc/paper/7540-computing-higher-order-derivatives-of-matrix-and-tensor-expressions.pdf.
[6] XLA and TensorFlow teams. XLA TensorFlow, compiled. https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html, 2017.
[7] Griewank A, Walther A. Evaluating derivatives - principles and techniques of algorithmic differentiation (2. ed.). SIAM, 2008. ISBN:9780898716597.
[8] Baydin AG, Pearlmutter BA, Radul AA, Siskind JM. Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 2018. 18(153):1-43. arXiv:1502.05767 [cs.SC].
[9] Pearlmutter BA. Fast Exact Multiplication by the Hessian. Neural Computation, 1994. 6(1):147-160.
[10] Gebremedhin AH, Tarafdar A, Pothen A, Walther A. Efficient Computation of Sparse Hessians Using Coloring and Automatic Differentiation. INFORMS Journal on Computing, 2009. 21(2):209-223. doi:10.1287/ijoc.1080.0286.
[11] Magnus JR, Neudecker H. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley and Sons, third edition, 2007. ISBN: 978-1-119-54120-2.
[12] Giles MB. Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation. In: Advances in Automatic Differentiation. Springer Berlin Heidelberg, 2008 pp. 35-44. doi:10.1007/978-3-540-68942-3_4.
[13] Seeger MW, Hetzel A, Dai Z, Lawrence ND. Auto-Differentiating Linear Algebra. In: NIPS Autodiff workshop. 2017. arXiv:1710.08717 [cs.MS].
[14] Kakade SM, Lee JD. Provably Correct Automatic Sub-Differentiation for Qualified Programs. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 7125-7135. 2018. arXiv:1809.08530[math.OC].
[15] LeCun Y. Deep Learning est mort. Vive Differentiable Programming! https://www.facebook.com/yann.lecun/posts/10155003011462143.
[16] van Merriënboer B, Moldovan D, Wiltschko A. Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming. In: Advances in Neural Information Processing Systems (NeurIPS). 2018. arXiv:1809.09569 [cs.LG].
[17] van Merriënboer B, Breuleux O, Bergeron A, Lamblin P. Automatic differentiation in ML: Where we are and where we should be going. In: Advances in Neural Information Processing Systems (NeurIPS). 2018. arXiv:1810.11530 [cs.LG].
[18] Wang F, Decker J, Wu X, Essertel G, Rompf T. Backpropagation with Callbacks: Foundations for Efficient and Expressive Differentiable Programming. In: Advances in Neural Information Processing Systems (NeurIPS). 2018. URL http://papers.nips.cc/paper/8221-backpropagation-with-callbacks-foundations-for-efficient-and-expressive-differentiable-programming.
[19] Dey P, Nag K, Pal T, Pal NR. Regularizing Multilayer Perceptron for Robustness. IEEE Trans. Systems, Man, and Cybernetics: Systems, 2018. 48(8):1255-1266. doi:10.1109/TSMC.2017.2664143.
[20] Ghorbani B, Krishnan S, Xiao Y. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. In: International Conference on Machine Learning (ICML). 2019. URL http://proceedings.mlr.press/v97/ghorbani19b.html.
[21] Sagun L, Evci U, Güney VU, Dauphin Y, Bottou L. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. In: ICLR (Workshop). 2018. arXiv:1706.04454 [cs.LG].
[22] Yao Z, Gholami A, Keutzer K, Mahoney MW. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries. In: Advances in Neural Information Processing Systems (NeurIPS). 2018. arXiv:1802.08241 [cs.CV].
[23] Vasilache N, Zinenko O, Theodoridis T, Goyal P, DeVito Z, Moses WS, Verdoolaege S, Adams A, Cohen A. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. 2018. arXiv preprint arXiv:1802.04730.
[24] Naumann U. Optimal accumulation of Jacobian matrices by elimination methods on the dual computational graph. Math. Program., 2004. 99(3):399-421. doi:10.1007/s10107-003-0456-9.
[25] Bischof CH, Hovland PD, Norris B. Implementation of automatic differentiation tools. In: ACM SIGPLAN Notices, volume 37. ACM, 2002 pp. 98-107. doi:10.1145/503032.503047.
[26] Naumann U. Optimal Jacobian accumulation is NP-complete. Math. Program., 2008. 112(2):427-441. doi:10.1007/s10107-006-0042-z.
[27] Koren Y, Bell R, Volinsky C. Matrix Factorization Techniques for Recommender Systems. Computer, 2009. 42(8):30-37. doi:10.1109/MC.2009.263.
[28] Cox DR. The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society. Series B (Methodological), 1958. 20(2):215-242.
[29] Broomhead D, Lowe D. Multivariable Functional Interpolation and Adaptive Networks. Complex Systems, 1988. 2:321-355.
[30] Schölkopf B, Smola AJ. Learning with Kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning series. MIT Press, 2002. ISBN-10:0262194759, 13:978-0262194754.
[31] Rahimi A, Recht B. Random Features for Large-Scale Kernel Machines. In: Advances in Neural Information Processing Systems (NIPS). 2007 pp. 1177-1184.
[32] Hinton GE, Osindero S, Teh YW. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 2006. 18(7):1527-1554. doi:10.1162/neco.2006.18.7.1527.
[33] Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003. 3:993-1022.
[34] Hofmann T. Probabilistic Latent Semantic Analysis. In: Conference on Uncertainty in Artificial Intelligence (UAI). 1999 pp. 289-296.
[35] Lee W, Yu H, Rival X, Yang H. On Correctness of Automatic Differentiation for Non-Differentiable Functions. arXiv e-prints, 2020. abs/2006.06903.
[36] XLA: Optimizing Compiler for TensorFlow. https://www.tensorflow.org/xla, 2019.
[37] Laue S, Mitterreiter M, Giesen J. MatrixCalculus.org - Computing Derivatives of Matrix and Tensor Expressions. In: European Conference on Machine Learning and Principles and Practice in Knowledge Discovery in Databases (ECML-PKDD). 2019 pp. 769-772. doi:10.1007/978-3-030-46133-1_48.
[38] Laue S. On the Equivalence of Forward Mode Automatic Differentiation and Symbolic Differentiation. arXiv e-prints, 2019. abs/1904.02990.
[39] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, 2016. abs/1605.02688.
[40] Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature, 1986. 323(6088):533-536. doi:10.1038/323533a0.
[41] Johnson M, Duvenaud DK, Wiltschko A, Adams RP, Datta SR. Composing graphical models with neural networks for structured representations and fast inference. In: Advances in Neural Information Processing Systems (NIPS). 2016. arXiv:1603.06277 [stat.ML].
[42] Srinivasan A, Todorov E. Graphical Newton. arXiv e-prints, 2015. abs/1508.00952.
[43] Sra S, Nowozin S, Wright SJ. Optimization for Machine Learning. The MIT Press, 2011. ISBN: 9780262016469.
[44] Ricci G, Levi-Civita T. Méthodes de calcul différentiel absolu et leurs applications. Mathematische Annalen, 1900. 54(1-2):125-201. doi:10.1007/BF01454201.
[45] Walther A, Griewank A. Getting started with ADOL-C. In: Combinatorial Scientific Computing, pp.181-202. Chapman-Hall CRC Computational Science, 2012.
[46] Hascoët L, Pascual V. The Tapenade Automatic Differentiation tool: Principles, Model, and Specification. ACM Transactions on Mathematical Software, 2013. 39(3):20:1-20:43. doi:10.1145/2450153.2450158.
[47] Cortes C, Vapnik V. Support-Vector Networks. Machine Learning, 1995. 20(3):273-297. doi:10.1023/A:1022627411411.
[48] Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996. 58(1):267-288. URL https://www.jstor.org/stable/2346178.
[49] Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. The MIT Press, 2005. ISBN:9780262182539.
[50] Laue S, Mitterreiter M, Giesen J. GENO - GENeric Optimization for Classical Machine Learning. In: Advances in Neural Information Processing Systems (NeurIPS). 2019. arXiv:1905.13587 [cs.LG].

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2021).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-0a930acf-0af2-47b7-8dfc-73871b2f655f