Procedurally generated AI compound media for expanding audial creations, broadening immersion and perception experience

Samson, Grzegorz

doi:10.24425/ijet.2024.149550

Nowa wersja platformy, zawierająca wyłącznie zasoby pełnotekstowe, jest już dostępna.
Przejdź na https://bibliotekanauki.pl

Artykuł - szczegóły

Czasopismo

International Journal of Electronics and Telecommunications

2024 | Vol. 70, No. 2 | 342--348

Tytuł artykułu

Procedurally generated AI compound media for expanding audial creations, broadening immersion and perception experience

Autorzy

Samson, Grzegorz

Treść / Zawartość

Pełne teksty:

Pobierz

Warianty tytułu

Języki publikacji

Abstrakty

Recently, the world has been gaining vastly increasing access to more and more advanced artificial intelligence tools. This phenomenon does not bypass the world of sound and visual art, and both of these worlds can benefit in ways yet unexplored, drawing them closer to one another. Recent breakthroughs open possibilities to utilize AI driven tools for creating generative art and using it as a compound of other multimedia. The aim of this paper is to present an original concept of using AI to create a visual compound material to existing audio source. This is a way of broadening accessibility thus appealing to different human senses using source media, expanding its initial form. This research utilizes a novel method of enhancing fundamental material consisting of text audio or text source (script) and sound layer (audio play) by adding an extra layer of multimedia experience - a visual one, generated procedurally. A set of images generated by AI tools, creating a story-telling animation as a new way to immerse into the experience of sound perception and focus on the initial audial material. The main idea of the paper consists of creating a pipeline, form of a blueprint for the process of procedural image generation based on the source context (audial or textual) transformed into text prompts and providing tools to automate it by programming a set of code instructions. This process allows creation of coherent and cohesive (to a certain extent) visual cues accompanying audial experience levering it to multimodal piece of art. Using nowadays technologies, creators can enhance audial forms procedurally, providing them with visual context. The paper refers to current possibilities, use cases, limitations and biases giving presented tools and solutions.

Słowa kluczowe

procedural generation generative media multi-modal art audiovisual perception text-to-image transformers large language models latent diffusion models

Wydawca

Czasopismo

International Journal of Electronics and Telecommunications

Rocznik

2024

Tom

Vol. 70, No. 2

Strony

342--348

Opis fizyczny

Bibliogr. 44 poz., rys.

Twórcy

autor

Samson, Grzegorz

Feliks Nowowiejski Academy of Music in Bydgoszcz, Poland, g.samson@amfn.pl

Bibliografia

[1] S. Bubeck, V. Chandrasekaran, R. Eldan, J. A. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. Lee, Y.-F. Li, S. M. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2303.12712.
[2] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” Computer Vision and Pattern Recognition, 2021. [Online]. Available: https://doi.org/10.1109/cvpr52688.2022.01042.
[3] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” Neural Information Processing Systems, 2022. [Online]. Available: https://doi.org/10.48550/arxiv.2205.11487.
[4] C. Gao, J. J. Green, X. Yang, S. Oh, J. Kim, and S. V. Shinkareva, “Audiovisual integration in the human brain: a coordinate-based meta-analysis,” Cerebral Cortex, vol. 33, no. 9, pp. 5574-5584, 11 2022. [Online]. Available: https://doi.org/10.1093/cercor/bhac443.
[5] H. Lima, B. LimaHugo, C. G. R. dos Santos, S. G. R. Dos, S. Meiguins-Bianchi, and B. S. Meiguins, “A survey of music visualization techniques,” ACM Computing Surveys, 2021. [Online]. Available: https://doi.org/10.1145/3461835.
[6] M. Tiihonen, E. Brattico, J. Maksimainen, J. Maksimainen, J. Wikgren, and S. Saarikallio, “Constituents of music and visual-art related pleasure - a critical integrative literature review,” Frontiers in Psychology, 2017. [Online]. Available: https://doi.org/10.3389/fpsyg.2017.01218.
[7] M. Mller, Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications, 1st ed. Springer Publishing Company, Incorporated, 2015. [Online]. Available: https://doi.org/10.1007/978-3-319-21945-5.
[8] S. Latif, H. Cuayáhuitl, F. Pervez, F. Shamshad, H. S. Ali, and E. Cambria, “A survey on deep reinforcement learning for audio-based applications,” Artificial Intelligence Review, vol. 56, no. 3, pp. 2193-2240, 2023. [Online]. Available: https://doi.org/10.1007/s10462-022-10224-2.
[9] W. S. Peebles and S. Xie, “Scalable diffusion models with transformers,” arXiv.org, 2022. [Online]. Available: https://doi.org/10.48550/arxiv.2212.09748.
[10] S. Wu, T. Wu, F. Lin, S. Tian, and G. Guo, “Fully transformer networks for semantic image segmentation,” arXiv.org, 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2106.04108.
[11] L. Yang, Z. Zhang, and S. Hong, “Diffusion models: A comprehensive survey of methods and applications,” arXiv.org, 2022. [Online]. Available: https://doi.org/10.48550/arxiv.2209.00796.
[12] A. Ulhaq, N. Akhtar, and G. Pogrebna, “Efficient diffusion models for vision: A survey,” Cornell University - arXiv, 2022. [Online]. Available: https://doi.org/10.48550/arxiv.2210.09292.
[13] X. Pan, P. Qin, Y. Li, H. Xue, and W. Chen, “Synthesizing coherent story with auto-regressive latent diffusion models,” arXiv.org, 2022. [Online]. Available: https://doi.org/10.48550/arxiv.2211.10950.
[14] J. Zakraoui, M. Saleh, S. Al-M´aadeed, and J. M. Alja’am, “A pipeline for story visualization from natural language,” Applied Sciences, 2023. [Online]. Available: https://doi.org/10.3390/app13085107.
[15] H. Chen, R. Han, T.-L. Wu, H. Nakayama, and N. Peng, “Character-centric story visualization via visual planning and token alignment,” Cornell University - arXiv, 2022. [Online]. Available: https://doi.org/10.48550/arxiv.2210.08465.
[16] Y. Z. Song, Y.-Z. Song, Y.-Z. Song, Z. R. Tam, Z. R. Tam, H.-J. Chen, H.-J. Chen, H.-H. Lu, H.-H. Shuai, and H.-H. Shuai, “Character-preserving coherent story visualization,” European Conference on Computer Vision, 2020. [Online]. Available: https: //doi.org/10.1007/978-3-030-58520-4 2.
[17] S. Chen, B. Liu, B. Liu, B. Liu, B. Liu, B. Liu, J. Fu, R. Song, Q. Jin, P. Lin, P. Lin, X. Qi, C. Wang, and J. Zhou, “Neural storyboard artist: Visualizing stories with coherent image sequences,” arXiv: Artificial Intelligence, 2019. [Online]. Available: https://doi.org/10.1145/3343031.3350571.
[18] A. Maharana, D. Hannan, and M. Bansal, “Storydall-e: Adapting pretrained text-to-image transformers for story continuation,” European Conference on Computer Vision, 2022. [Online]. Available: https://doi.org/10.48550/arxiv.2209.06192.
[19] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Neural Information Processing Systems, 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2105.05233.
[20] J. Zhou, X. Shen, J. Wang, J. Zhang, W. Sun, J. Zhang, S. Birchfield, D. Guo, L. Kong, M. Wang, and Y. Zhong, “Audio-visual segmentation with semantics,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2301.13190.
[21] G. Irie, M. Ostrek, H. Wang, H. Kameoka, A. Kimura, T. Kawanishi, and K. Kashino, “Seeing through sounds: Predicting visual semantic segmentation results from multichannel audio signals,” IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019. [Online]. Available: https://doi.org/10.1109/icassp.2019.8683142.
[22] C. Liu, P. Li, X. Qi, H. Zhang, L. Li, D. Wang, and X. Yu, “Audio-visual segmentation by exploring cross-modal mutual semantics,” null, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.16620.
[23] G. Yariv, I. Gat, L. Wolf, Y. Adi, and I. Schwartz, “Audiotoken: Adaptation of text-conditioned diffusion models for audio-to-image generation,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2305.13050.
[24] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie, and J. Rong Wen, “A survey of large language models,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2303.18223.
[25] T. Görne, “The emotional impact of sound: A short theory of film sound design,” null, 2019. [Online]. Available: https://doi.org/10.29007/jk8h.
[26] J. Z. Wang, S. Zhao, C. Wu, R. B. Adams, M. Newman, T. Shafir, and R. Tsachor, “Unlocking the emotional world of visual media: An overview of the science, research, and impact of understanding emotion,” Proceedings of the IEEE, 2023. [Online]. Available: https://doi.org/10.1109/jproc.2023.3273517.
[27] X. Wang, X. Li, Z. Yin, Y. Wu, L. J. D. O. P. L. O. Brain, Intelligence, T. University, D. Psychology, and R. University, “Emotional intelligence of large language models,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2307.09042.
[28] S. C. Patel and J. Fan, “Identification and description of emotions by current large language models,” bioRxiv, 2023. [Online]. Available: https://doi.org/10.1101/2023.07.17.549421.
[29] Z. Akhtar and T. H. Falk, “Audio-visual multimedia quality assessment: A comprehensive survey,” IEEE Access, 2017. [Online]. Available: https://doi.org/10.1109/access.2017.2750918.
[30] A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria, “A review of deep learning techniques for speech processing,” Information Fusion, vol. 99, p. 101869, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253523001859
[31] J. Li, X. Zhang, X. Zhang, X. Zhang, X. Zhang, X. Zhang, C. Jia, J. Xu, X. Jizheng, L. Zhang, L. Zhang, L. Zhang, Z. Li, L. Zhang, Y. Wang, Y. Wang, W. Yue, Y. Wang, S. Ma, W. Gao, and W. Gao, “Direct speech-to-image translation,” arXiv: Multimedia, 2020. [Online]. Available: https://doi.org/10.1109/jstsp.2020.2987417.
[32] G. Samson, “Multimodal media generation: Exploring pipeline of procedural visual context-dependent media layer creation,” Warsaw, p. 67, 2023, thesis (Engineering) - Polish-Japanese Academy of Information Technology, 2023. [Online]. Available: https://system-biblioteka.pja.edu.pl/Opac5/faces/Opis.jsp?ido=40788#
[33] J. Edwards, A. Perrone, and P. R. Doyle, “Transparency in language generation: Levels of automation,” CIU, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2006.06295.
[34] R. Adaval, G. Saluja, and Y. Jiang, “Seeing and thinking in pictures: A review of visual information processing,” Consumer Psychology Review, 2018. [Online]. Available: https://doi.org/10.1002/arcp.1049.
[35] P. Gholami and R. Xiao, “Diffusion brush: A latent diffusion model-based editing tool for ai-generated images,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2306.00219.
[36] P. Li, Q. Huang, Y. Ding, and Z. Li, “Layerdiffusion: Layered controlled image editing with diffusion models,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2305.18676.
[37] X. Zhang, W. Zhao, X. Lu, and J. Chien, “Text2layer: Layered image generation using latent diffusion model,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2307.09781.
[38] X. Ma, Y. Zhou, X. Xu, B. Sun, V. Filev, N. Orlov, Y. Fu, and H. Shi, “Towards layer-wise image vectorization,” Computer Vision and Pattern Recognition, 2022. [Online]. Available: https://doi.org/10.1109/cvpr52688.2022.01583.
[39] M. Dorkenwald, T. Milbich, A. Blattmann, R. Rombach, K. Derpanis, and B. Ommer, “Stochastic image-to-video synthesis using cinns,” Computer Vision and Pattern Recognition, 2021. [Online]. Available: https://doi.org/10.1109/cvpr46437.2021.00374.
[40] Y. Hu, C. Luo, and Z. Chen, “Make it move: Controllable image-to-video generation with text descriptions,” Computer Vision and Pattern Recognition, 2021. [Online]. Available: https://doi.org/10.1109/cvpr52688.2022.01768.
[41] M. Stypulkowski, K. Vougioukas, S. He, M. Ziba, S. Petridis, and M. Pantic, “Diffused heads: Diffusion models beat gans on talking-face generation,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2301.03396.
[42] L. Shen, X. Li, H. Sun, J. Peng, K. Xian, Z. Cao, and G.-S. Lin, “Make-it-4d: Synthesizing a consistent long-term dynamic scene video from a single image,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.1145/3581783.3612033.
[43] J. Wu, J. J. Y. Chung, and E. Adar, “Viz2viz: Prompt-driven stylized visualization generation using a diffusion model,” arXiv.org, 2023. [Online]. Available: https://doi.org/10.48550/arxiv.2304.01919.
[44] C. K. Praveen and K. Srinivasan, “Psychological impact and influence of animation on viewer’s visual attention and cognition: A systematic literature review, open challenges, and future research directions.” Computational and Mathematical Methods in Medicine, 2022. [Online]. Available: https://doi.org/10.1155/2022/8802542.

Typ dokumentu

Bibliografia

Identyfikatory

DOI

10.24425/ijet.2024.149550

Identyfikator YADDA

bwmeta1.element.baztech-cd9f285c-c0d5-476f-be1f-103c3f579c75