Unconditional Token Forcing: Extracting Text Hidden Within LLM

Hościłowicz, Jakub; Popiołek, Paweł; Rudkowski, Jan; Bieniasz, Jędrzej; Janicki, Artur

doi:10.15439/2024F4511

Powiadomienia systemowe

Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

Unconditional Token Forcing: Extracting Text Hidden Within LLM

Autorzy

Hościłowicz Jakub , Popiołek Paweł , Rudkowski Jan , Bieniasz Jędrzej , Janicki Artur

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2024F4511

Warianty tytułu

Konferencja

Federated Conference on Computer Science and Information Systems (19 ; 08-11.09.2024 ; Belgrade, Serbia)

Języki publikacji

Abstrakty

With the help of simple fine-tuning, one can artificially embed hidden text into large language models (LLMs). This text is revealed only when triggered by a specific query to the LLM. Two primary applications are LLM fingerprinting and steganography. In the context of LLM fingerprinting, a unique text identifier (fingerprint) is embedded within the model to verify licensing compliance. In the context of steganography, the LLM serves as a carrier for hidden messages that can be disclosed through a designated trigger. Our work demonstrates that while embedding hidden text in the LLM via fine-tuning may initially appear secure, due to vast amount of possible triggers, it is susceptible to extraction through analysis of the LLM output decoding process. We propose a novel approach to extraction called Unconditional Token Forcing. It is premised on the hypothesis that iteratively feeding each token from the LLM’s vocabulary into the model should reveal sequences with abnormally high token probabilities, indicating potential embedded text candidates. Additionally, our experiments show that when the first token of a hidden fingerprint is used as an input, the LLM not only produces an output sequence with high token probabilities, but also repetitively generates the fingerprint itself. Code is available at github.com/jhoscilowic/zurek-stegano.

Słowa kluczowe

steganography vocabulary large language models refining pipelines fingerprint recognition robustness decoding iterative decoding context modeling

steganografia słownictwo duże modele językowe rozpoznawanie odcisków palców odporność dekodowanie dekodowanie iteracyjne

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2024

Tom

Vol. 39

Strony

621--624

Opis fizyczny

Bibliogr. 14 poz., rys.

Twórcy

autor

Hościłowicz Jakub

jakub.hoscilowicz.dokt@pw.edu.pl

Institute of Telecommunications, Warsaw University of Technology, Nowowiejska 15/19, Warsaw, 00-665, Poland

https://orcid.org/0000-0001-8484-1701

autor

Popiołek Paweł

pawel.popiolek.stud@pw.edu.pl

Institute of Telecommunications, Warsaw University of Technology, Nowowiejska 15/19, Warsaw, 00-665, Poland

autor

Rudkowski Jan

jan.rudkowski.stud@pw.edu.pl

Institute of Telecommunications, Warsaw University of Technology, Nowowiejska 15/19, Warsaw, 00-665, Poland

https://orcid.org/0009-0007-9854-6958

autor

Bieniasz Jędrzej

jedrzej.bieniasz@pw.edu.pl

Institute of Telecommunications, Warsaw University of Technology, Nowowiejska 15/19, Warsaw, 00-665, Poland

https://orcid.org/0000-0002-4033-4684

autor

Janicki Artur

artur.janicki@pw.edu.pl

Institute of Telecommunications, Warsaw University of Technology, Nowowiejska 15/19, Warsaw, 00-665, Poland

https://orcid.org/0000-0002-9937-4402

Bibliografia

1. J. Xu, F. Wang, M. D. Ma, P. W. Koh, C. Xiao, and M. Chen, “Instructional fingerprinting of large language models,” arXiv preprint https://arxiv.org/abs/2401.12255, 2024. http://dx.doi.org/10.48550/arXiv.2401.12255
2. W. Shi, A. Ajith, M. Xia, Y. Huang, D. Liu, T. Blevins, D. Chen, and L. Zettlemoyer, “Detecting pretraining data from large language models,” arXiv preprint https://arxiv.org/abs/2310.16789, 2024. http://dx.doi.org/10.48550/arXiv.2310.16789
3. M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramèr, and K. Lee, “Scalable extraction of training data from (production) language models,” arXiv preprint https://arxiv.org/abs/2311.17035, 2023. http://dx.doi.org/10.48550/arXiv.2311.17035
4. J. Hoscilowicz, P. Popiołek, J. Rudkowski, J. Bieniasz, and A. Janicki, “Zurek steganography: from a soup recipe to a major llm security concern,” arXiv preprint https://arxiv.org/abs/2303.5637631, 2024. http://dx.doi.org/10.48550/arXiv.2303.5637631. [Online]. Available: https://github.com/j-hoscilowic/zurek-stegano
5. Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, H. Chen, and N. Zhang, “Editing large language models: Problems, methods, and opportunities,” arXiv preprint https://arxiv.org/abs/2305.13172, 2023. http://dx.doi.org/10.48550/arXiv.2305.13172
6. Y. Wang, R. Song, R. Zhang, J. Liu, and L. Li, “Llsm: Generative linguistic steganography with large language model,” arXiv preprint https://arxiv.org/abs/2401.15656, 2024. http://dx.doi.org/10.48550/arXiv.2401.15656
7. J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” arXiv preprint https://arxiv.org/abs/2301.10226, 2023. http://dx.doi.org/10.48550/arXiv.2301.10226
8. J. Fairoze, S. Garg, S. Jha, S. Mahloujifar, M. Mahmoody, and M. Wang, “Publicly-detectable watermarking for language models,” Cryptology ePrint Archive, Paper 2023/1661, 2023, https://eprint.iacr.org/2023/1661. [Online]. Available: https://eprint.iacr.org/2023/1661
9. Open Worldwide Application Security Project (OWASP), “OWASP Top 10 for Large Language Model Applications,” https://owasp.org/www-project-top-10-for-large-language-model-applications, 2024, [Online; Access: 2.06.2024].
10. N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson et al., “Extracting training data from large language models,” in 30th USENIX Security Symposium (USENIX Security 21), 2021. http://dx.doi.org/10.48550/arXiv.2303.08774 pp. 2633–2650.
11. N. Carlini, M. Nasr, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramèr, and K. Lee, “Scalable extraction of training data from (production) language models,” arXiv preprint https://arxiv.org/abs/2311.17035, 2023. http://dx.doi.org/10.48550/arXiv.2311.17035
12. T.-Y. Chang, J. Thomason, and R. Jia, “Do localization methods actually localize memorized data in llms? a tale of two benchmarks,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024. http://dx.doi.org/10.48550/arXiv.2401.02909 pp. 3190–3211.
13. H. Song, J. Geiping, T. Goldstein et al., “Beyond memorization: Violating privacy via inference in large language models,” arXiv preprint https://arxiv.org/abs/2310.07298, 2023. http://dx.doi.org/10.48550/arXiv.2310.07298
14. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023.

Uwagi

1. Code is available at github.com/jhoscilowic/zurek-stegano

2. Thematic Sessions: Short Papers

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-c5278ccf-b80a-46ad-bb99-982f324ce5fa