Optimizing the structures of transformer neural networks using parallel simulated annealing

Trzciński, Maciej; Łukasik, Szymon; Gandomi, Amir H.

doi:10.2478/jaiscr-2024-0015

Artykuł - szczegóły

Tytuł artykułu

Optimizing the structures of transformer neural networks using parallel simulated annealing

Autorzy

Trzciński Maciej , Łukasik Szymon , Gandomi Amir H.

Treść / Zawartość

Pełne teksty:

trzcinski_Optimizing-the-Structures-of-Transformer.pdf

Pobierz

Identyfikatory

DOI

10.2478/jaiscr-2024-0015

Warianty tytułu

Języki publikacji

Abstrakty

The Transformer is an important addition to the rapidly increasing list of different Artificial Neural Networks (ANNs) suited for extremely complex automation tasks. It has already gained the position of the tool of choice in automatic translation in many business solutions. In this paper, we present an automated approach to optimizing the Transformer structure based upon Simulated Annealing, an algorithm widely recognized for both its simplicity and usability in optimization tasks where the search space may be highly complex. The proposed method allows for the use of parallel computing and time-efficient optimization, thanks to modifying the structure while training the network rather than performing the two one after another. The algorithm presented does not reset the weights after changes in the transformer structure. Instead, it continues the training process to allow the results to be adapted without randomizing all the training parameters. The algorithm has shown a promising performance during experiments compared to traditional training methods without structural modifications. The solution has been released as open-source to facilitate further development and use by the machine learning community.

Słowa kluczowe

neural architecture search transformers natural language processing simulated annealing

Wydawca

University of Social Sciences

Czasopismo

Journal of Artificial Intelligence and Soft Computing Research

Rocznik

2024

Tom

Vol. 14, No. 3

Strony

267--282

Opis fizyczny

Bibliogr. 75 poz., rys.

Twórcy

autor

Trzciński Maciej

AGH University of Krakow, Faculty of Physics and Applied Computer Science, Al. Mickiewicza 30, Krakow 30-059, Poland
NASK National Research Institute, ul. Kolska 12, Warsaw 01-045, Poland

https://orcid.org/0000-0003-4139-4969

autor

Łukasik Szymon

slukasik@agh.edu.pl

AGH University of Krakow, Faculty of Physics and Applied Computer Science, Al. Mickiewicza 30, Krakow 30-059, Poland
NASK National Research Institute, ul. Kolska 12, Warsaw 01-045, Poland
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, Warsaw 01-447, Poland

https://orcid.org/0000-0001-6716-610X

autor

Gandomi Amir H.

University of Technology Sydney, Faculty of Engineering and Information Technology, Ultimo, Sydney, NSW 2007, Australia
5University Research and Innovation Center (EKIK), Óbuda University, Bécsiút 96/B, Budapest 1034, Hungary

https://orcid.org/0000-0002-2798-0104

Bibliografia

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, https://arxiv.org/abs/1706.03762 (2017). https://doi.org/10.48550/ARXIV.1706.03762.
[2] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, Neural speech synthesis with transformer network, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 33, 2019, pp. 6706–6713.
[3] P. Morris, R. St. Clair, W. E. Hahn, E. Barenholtz, Predicting binding from screening assays with transformer network embeddings, Journal of Chemical Information and Modeling 60 (9) (2020) 4191–4199.
[4] F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S. Khan, H. Fu, Transformers in medical imaging: A survey, https://arxiv.org/abs/2201.09873 (2022). https://doi.org/10.48550/ARXIV.2201.09873.
[5] T. Lin, Y. Wang, X. Liu, X. Qiu, A survey of transformers (2021). https://doi.org/10.48550/ARXIV.2106.04554.
[6] M. Zhang, J. Li, A commentary of gpt-3 in mit technology review 2021, Fundamental Research 1 (6) (2021) 831–833.
[7] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, Ł. Kaiser, Universal transformers, arXiv preprint arXiv:1807.03819 (2018).
[8] M. Feurer, F. Hutter, Hyperparameter optimization, in: Automated machine learning, Springer, Cham, 2019, pp. 3–33.
[9] J. Wu, X.-Y. Chen, H. Zhang, L.-D. Xiong, H. Lei, S.-H. Deng, Hyperparameter optimization for machine learning models based on bayesian optimization, Journal of Electronic Science and Technology 17 (1) (2019) 26–40.
[10] R. Turner, D. Eriksson, M. McCourt, J. Kiili, E. Laaksonen, Z. Xu, I. Guyon, Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020, in: NeurIPS 2020 Competition and Demonstration Track, PMLR, 2021, pp. 3–26.
[11] R. G. Mantovani, A. L. Rossi, J. Vanschoren, B. Bischl, A. C. De Carvalho, Effectiveness of random search in svm hyper-parameter tuning, in: 2015 International Joint Conference on Neural Networks (IJCNN), Ieee, 2015, pp. 1–8.
[12] X. He, K. Zhao, X. Chu, Automl: A survey of the state-of-the-art, Knowledge-Based Systems 212 (2021) 106622.
[13] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., A survey on vision transformer, IEEE transactions on pattern analysis and machine intelligence 45 (1) (2022) 87–110.
[14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
[15] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, I. Sutskever, Generative pretraining from pixels, in: International conference on machine learning, PMLR, 2020, pp. 1691–1703.
[16] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, D. Tran, Image transformer, in: International conference on machine learning, PMLR, 2018, pp. 4055–4064.
[17] Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, L. Sun, Transformers in time series: A survey, arXiv preprint arXiv:2202.07125 (2022).
[18] E. Dogo, O. Afolabi, N. Nwulu, B. Twala, C. Aigbavboa, A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks, in: 2018 international conference on computational techniques, electronics and mechanical systems (CTEMS), IEEE, 2018, pp. 92–99.
[19] S. J. Reddi, S. Kale, S. Kumar, On the convergence of adam and beyond, arXiv preprint arXiv:1904.09237 (2019).
[20] Y. Liu, Y. Sun, B. Xue, M. Zhang, G. G. Yen, K. C. Tan, A survey on evolutionary neural architecture search, IEEE transactions on neural networks and learning systems (2021).
[21] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, Q. Le, Understanding and simplifying one-shot architecture search, in: International conference on machine learning, PMLR, 2018, pp. 550–559.
[22] J. Fang, Y. Sun, Q. Zhang, Y. Li, W. Liu, X. Wang, Densely connected search space for more flexible neural architecture search, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10628–10637.
[23] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, X. Wang, A comprehensive survey of neural architecture search: Challenges and solutions, ACM Computing Surveys (CSUR) 54 (4) (2021) 1–34.
[24] K. Murray, J. Kinnison, T. Q. Nguyen, W. Scheirer, D. Chiang, Auto-sizing the transformer network: Improving speed, efficiency, and performance for low-resource machine translation, https://arxiv.org/abs/1910.06717 (2019), http://arxiv.org/abs/1910.06717.
[25] M. Baldeon-Calisto, S. K. Lai-Yuen, Adaresu-net: Multiobjective adaptive convolutional neural network for medical image segmentation, Neurocomputing 392 (2020) 325–340.
[26] K. Chen, W. Pang, Immunetnas: An immune-network approach for searching convolutional neural network architectures, arXiv preprint arXiv:2002.12704 (2020).
[27] M. Wistuba, A. Rawat, T. Pedapati, A survey on neural architecture search, arXiv preprint arXiv:1905.01392 (2019).
[28] J. G. Robles, J. Vanschoren, Learning to reinforcement learn for neural architecture search, arXiv preprint arXiv:1911.03769 (2019).
[29] A. Vahdat, A. Mallya, M.-Y. Liu, J. Kautz, Unas: Differentiable architecture search meets reinforcement learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11266–11275.
[30] K. De Jong, Evolutionary computation: a unified approach, in: Proceedings of the 2016 on genetic and evolutionary computation conference companion, 2016, pp. 185–199.
[31] S. Gibb, H. M. La, S. Louis, A genetic algorithm for convolutional network structure optimization for concrete crack detection, in: 2018 IEEE congress on evolutionary computation (CEC), IEEE, 2018, pp. 1–8.
[32] L. Xie, A. Yuille, Genetic cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 1379–1388.
[33] F. Ye, Particle swarm optimization-based automatic parameter selection for deep neural networks and its applications in large-scale and high-dimensional data, PloS one 12 (12) (2017) e0188746.
[34] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, E. Xing, https://arxiv.org/abs/1802.07191, Neural architecture search with bayesian optimisation and optimal transport (2018), https://doi.org/10.48550/ARXIV.1802.07191, https://arxiv.org/abs/1802.07191.
[35] F. P. Casale, J. Gordon, N. Fusi, Probabilistic neural architecture search, arXiv preprint arXiv:1902.05116 (2019).
[36] H. K. Singh, T. Ray, W. Smith, Surrogate assisted simulated annealing (SASA) for constrained multi-objective optimization, in: IEEE congress on evolutionary computation, IEEE, 2010, pp. 1–8.
[37] K. T. Chitty-Venkata, M. Emani, V. Vishwanath, A. K. Somani, Neural architecture search for transformers: A survey, IEEE Access 10 (2022) 108374–108412, https://doi.org/10.1109/ACCESS.2022.3212767.
[38] J. Kim, J. Wang, S. Kim, Y. Lee, Evolved speech-transformer: Applying neural architecture search to end-to-end automatic speech recognition., in: Interspeech, 2020, pp. 1788–1792.
[39] D. Cummings, A. Sarah, S. N. Sridhar, M. Szankin, J. P. Munoz, S. Sundaresan, A hardware-aware framework for accelerating neural architecture search across modalities (2022), http://arxiv.org/abs/2205.10358.
[40] D. So, Q. Le, C. Liang, The evolved transformer, in: International conference on machine learning, PMLR, 2019, pp. 5877–5886.
[41] D. So, W. Mańke, H. Liu, Z. Dai, N. Shazeer, Q. V. Le, Searching for efficient transformers for language modeling, Advances in neural information processing systems 34 (2021) 6010–6022.
[42] V. Cahlik, P. Kordik, M. Cepek, Adapting the size of artificial neural networks using dynamic autosizing, in: 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), IEEE, 2022, pp. 592–596.
[43] B. Goldberger, G. Katz, Y. Adi, J. Keshet, Minimal modifications of deep neural networks using verification., in: LPAR, Vol. 2020, 2020, p. 23.
[44] T. Chen, I. Goodfellow, J. Shlens, Net2net: Accelerating learning via knowledge transfer, arXiv preprint arXiv:1511.05641 (2015).
[45] T. Wei, C. Wang, Y. Rui, C. W. Chen, Network morphism, in: International conference on machine learning, PMLR, 2016, pp. 564–572.
[46] H. Jin, Q. Song, X. Hu, Auto-keras: An efficient neural architecture search system, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 1946–1956.
[47] T. Elsken, J. H. Metzen, F. Hutter, https://openreview.net/forum?id=ByME42AqK7 Efficient multi-objective neural architecture search via lamarckian evolution (2019), https://openreview.net/forum?id=ByME42AqK7
[48] P. J. Van Laarhoven, E. H. Aarts, Simulated annealing, in: Simulated annealing: Theory and applications, Springer, 1987, pp. 7–15.
[49] D. J. Ram, T. Sreenivas, K. G. Subramaniam, Parallel simulated annealing algorithms, Journal of parallel and distributed computing 37 (2) (1996) 207–212, https://doi.org/https://doi.org/10.1006/jpdc.1996.0121.
[50] N. Metropolis, A. W. Rosenbluth, M. N. Rosen-bluth, A. H. Teller, E. Teller, Equation of state calculations by fast computing machines, The journal of chemical physics 21 (6) (1953) 1087–1092.
[51] D. Delahaye, S. Chaimatanan, M. Mongeau, Simulated annealing: From basics to applications, Handbook of metaheuristics (2019) 1–35.
[52] Z. Xinchao, Simulated annealing algorithm with adaptive neighborhood, Applied Soft Computing 11 (2) (2011) 1827–1836.
[53] Z. Michalewicz, Z. Michalewicz, GAs: Why Do They Work?, Springer, 1996.
[54] S.-H. Zhan, J. Lin, Z.-J. Zhang, Y.-W. Zhong, List-based simulated annealing algorithm for traveling salesman problem, Computational intelligence and neuroscience 2016 (2016).
[55] M. E. Aydin, V. Yigit, 12 parallel simulated annealing, Parallel Metaheuristics: A new Class of Algorithms (2005) 267.
[56] L. Ozdamar, Parallel simulated annealing algorithms in global optimization, Journal of global optimization 19 (2001) 27–50.
[57] G. P. Babu, M. N. Murty, Simulated annealing for selecting optimal initial seeds in the k-means algorithm, Indian Journal of Pure and Applied Mathematics 25 (1-2) (1994) 85–94.
[58] R. S. Sexton, R. E. Dorsey, J. D. Johnson, Optimization of neural networks: A comparative analysis of the genetic algorithm and simulated annealing, European Journal of Operational Research 114 (3) (1999) 589–601.
[59] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
[60] S. Mo, J. Xia, P. Ren, Simulated annealing for neural architecture search, Advances in Neural Information Processing Systems (NeurIPS) (2021).
[61] C.-W. Tsai, C.-H. Hsia, S.-J. Yang, S.-J. Liu, Z.-Y. Fang, Optimizing hyperparameters of deep learning in predicting bus passengers based on simulated annealing, Applied soft computing 88 (2020) 106068.
[62] H.-K. Park, J.-H. Lee, J. Lee, S.-K. Kim, Optimizing machine learning models for granular ndfeb magnets by very fast simulated annealing, Scientific Reports 11 (1) (2021) 3792.
[63] L. Ingber, Very fast simulated re-annealing, Mathematical and computer modelling 12 (8) (1989) 967–973.
[64] M. Fischetti, M. Stringher, https://arxiv.org/abs/1906.01504, Embedded hyper-parameter tuning by simulated annealing (2019), https://doi.org/10.48550/ARXIV.1906.01504, https://arxiv.org/abs/1906.01504.
[65] M. Chen, H. Peng, J. Fu, H. Ling, Autoformer: Searching transformers for visual recognition, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12270–12280.
[66] A. Cutkosky, H. Mehta, Momentum improves normalized sgd, in: International conference on machine learning, PMLR, 2020, pp. 2260–2268.
[67] M. Trzciński, Optimizing the Structures of Transformer Neural Networks using Parallel Simulated Annealing (3 2023).
[68] F. Guzmán, P.-J. Chen, M. Ott, J. Pino, G. Lample, P. Koehn, V. Chaudhary, M. Ranzato, Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english, 2019.
[69] P. Koehn, https://aclanthology.org/2005.mtsummit-papers.11, Europarl: A parallel corpus for statistical machine translation, in: Proceedings of Machine Translation Summit X: Papers, Phuket, Thailand, 2005, pp. 79–86. https://aclanthology.org/2005.mtsummit-papers.11.
[70] J. Tiedemann, Parallel data, tools and interfaces in opus., in: Lrec, Vol. 2012, Citeseer, 2012, pp. 2214–2218.
[71] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
[72] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81.
[73] M. Post, A call for clarity in reporting bleu scores, arXiv preprint arXiv:1804.08771 (2018).
[74] D. Coughlin, Correlating automated and human assessments of machine translation quality, in: Proceedings of Machine Translation Summit IX: Papers, 2003.
[75] K. Ganesan, Rouge 2.0: Updated and improved measures for evaluation of summarization tasks, arXiv preprint arXiv:1803.01937 (2018).

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa nr POPUL/SP/0154/2024/02 w ramach programu "Społeczna odpowiedzialność nauki II" - moduł: Popularyzacja nauki (2025).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-8f95a6ba-7308-4864-afdf-afada8c9b009