Powiadomienia systemowe
- Sesja wygasła!
Tytuł artykułu
Treść / Zawartość
Pełne teksty:
Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
This paper evaluates the feasibility of deploying locally-run Large Language Models (LLMs) for retrieval-augmented question answering (RAG-QA) over internal knowledge bases in small and medium enterprises (SMEs), with a focus on Polish-language datasets. The study benchmarks eight popular open-source and source-available LLMs, including Google’s Gemma-9B and Speakleash’s Bielik-11B, assessing their performance across closed, open, and detailed question types, with metrics for language quality, factual accuracy, response stability, and processing efficiency. The results highlight that desktop-class LLMs, though limited in factual accuracy (with top scores of 45% and 43% for Gemma and Bielik, respectively), hold promise for early-stage enterprise implementations. Key findings include Bielik's superior performance on open-ended and detailed questions and Gemma's efficiency and reliability in closed-type queries. Distribution analyses revealed variability in model outputs, with Bielik and Gemma showing the most stable response distributions. This research underscores the potential of offline-capable LLMs as cost-effective tools for secure knowledge management in Polish SMEs.
Czasopismo
Rocznik
Tom
Strony
175--191
Opis fizyczny
Bibliogr. 32 poz., fig., tab.
Twórcy
autor
- Lublin University of Technology, Faculty of Electrical Engineering and Computer Science, Department of Computer Science
autor
- Lublin University of Technology, Faculty of Electrical Engineering and Computer Science, Department of Computer Science
autor
Bibliografia
- [1] Ahmed, T., Bird, C., Devanbu, P., & Chakraborty, S. (2024). Studying LLM performance on closed- and open-source data. ArXiv, abs/2402.15100. https://doi.org/10.48550/arXiv.2402.15100
- [2] Aydogan-Kilic, D., Kilic, D. K., & Nielsen, I. E. (2024). Examination of summarized medical records for ICD code classification Via BERT. Applied Computer Science, 20(2), 60-74. https://doi.org/10.35784/acs-2024-16
- [3] B, G., & Purwar, A. (2024). Evaluating the efficacy of open-source LLMs in enterprise-specific RAG systems: A comparative study of performance and scalability. ArXiv, abs/2406.11424. https://doi.org/10.48550/arXiv.2406.11424
- [4] Bonatti, R., Zhao, D., Bonacci, F., Dupont, D., Abdali, S., Li, Y., Lu, Y., Wagle, J., Koishida, K., Bucker, A., Jang, L., & Hui, Z. (2024). Windows agent arena: Evaluating multi-modal OS agents at scale. ArXiv, abs/2409.08264. https://doi.org/10.48550/arXiv.2409.08264
- [5] Bouhsaien, L., & Azmani, A. (2024). The potential of Artificial Intelligence in human resource management. Applied Computer Science, 20(3), 153-170. https://doi.org/10.35784/acs-2024-34
- [6] Cevallos Salas, F. A. (2024). Digital news classification and punctuaction using Machine Learning and text mining techniques. Applied Computer Science, 20(2), 24-42. https://doi.org/10.35784/acs-2024-14
- [7] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., … Zaremba, W. (2021). Evaluating Large Language Models trained on code. ArXiv, abs/2107.03374. https://doi.org/10.48550/arXiv.2107.03374
- [8] Dahl, M., Magesh, V., Suzgun, M., & Ho, D. E. (2024). Large legal fictions: Profiling legal hallucinations in Large Language Models. Journal of Legal Analysis, 16(1), 64-93. https://doi.org/10.1093/jla/laae003
- [9] Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T. S., & Li, Q. (2024). A survey on RAG meeting LLMs: Towards retrieval-augmented Large Language Models. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '24) (pp. 6491-6501). Association for Computing Machinery. https://doi.org/10.1145/3637528.3671470
- [10] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2023). Retrieval-augmented generation for Large Language Models: A survey. ArXiv, abs/2312.1099. https://doi.org/10.48550/arXiv.2312.1099
- [11] Han, R., Zhang, Y., Qi, P., Xu, Y., Wang, J., Liu, L., Wang, W. Y., Min, B., & Castelli, V. (2024). RAG-QA arena: Evaluating domain robustness for long-form retrieval augmented question answering. ArXiv, abs/2407.13998. https://doi.org/10.48550/arXiv.2407.13998
- [12] Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. Le, Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). Mistral 7B. ArXiv, abs/2310.06825. https://doi.org/10.48550/arXiv.2310.06825
- [13] Kamalloo, E., Upadhyay, S., & Lin, J. (2024). Towards robust QA evaluation via open LLMs. 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '24) (pp. 2811-2816). Association for Computing Machinery. https://doi.org/10.1145/3626772.3657675
- [14] Li, J., Yuan, Y., & Zhang, Z. (2024). Enhancing LLM factual accuracy with RAG to counter hallucinations: A case study on domain-specific queries in private knowledge-bases. ArXiv, abs/2403.10446. https://doi.org/10.48550/arXiv.2403.10446
- [15] Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., & Han, S. (2024). AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. ArXiv, abs/2306.00978. https://doi.org/10.48550/arXiv.2306.00978
- [16] Menon, K. (2024). Utilizing open-source AI to navigate and interpret technical documents : Leveraging RAG models for enhanced analysis and solutions in product documentation. http://www.theseus.fi/handle/10024/858250
- [17] Meta Llama. (2024a, July 23). meta-llama/Llama-3.1-8B. Hugging Face. Retrieved October 30, 2024 from https://huggingface.co/meta-llama/Llama-3.1-8B
- [18] Meta Llama. (2024b, April 24). meta-llama/Meta-Llama-3-8B. Hugging Face. Retrieved October 30, 2024 from https://huggingface.co/meta-llama/Meta-Llama-3-8B
- [19] Microsoft. (2024a, September 18). microsoft/Phi-3.5-mini-instruct. Hugging Face. Retrieved October 30, 2024 from https://huggingface.co/microsoft/Phi-3.5-mini-instruct
- [20] Microsoft. (2024b, September 20). microsoft/Phi-3-mini-4k-instruct. Hugging Face. Retrieved October 30, 2024 from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
- [21] Mistral AI_. (2024a, September 27). mistralai/Mistral-7B-Instruct-v0.2. Hugging Face. Retrieved October 30, 2024 from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
- [22] Mistral AI_. (2024b, November 6). mistralai/Mistral-Nemo-Instruct-2407. Hugging Face. Retrieved October 30, 2024 from https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
- [23] Ociepa, K. (2023). PoLEJ - Polish Open LLM Leaderboard. Azurro. Retrieved October 30, 2024 from https://polej.azurro.pl/
- [24] Soni, S., Datta, S., & Roberts, K. (2023). quEHRy: A question answering system to query electronic health records. Journal of the American Medical Informatics Association, 30(6), 1091-1102. https://doi.org/10.1093/JAMIA/OCAD050
- [25] Soto-Jiménez, F., Martínez-Velásquez, M., Chicaiza, J., Vinueza-Naranjo, P., & Bouayad-Agha, N. (2024). RAG-based question-answering systems for closed-domains: Development of a prototype for the pollution domain. In K. Arai (Ed.), Intelligent Systems and Applications (Vol. 1065, pp. 573-589). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-66329-1_37
- [26] SpeakLeash | Spichlerz. (2024, October 26). speakleash/Bielik-11B-v2.2-Instruct. Hugging Face. Retrieved October 30, 2024 from https://huggingface.co/speakleash/Bielik-11B-v2.2-Instruct
- [27] SpeakLeash | Spichlerz. (n.d.). Open PL LLM Leaderboard - a Hugging Face Space. Hugging Face. Retrieved October 30, 2024 from https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard
- [28] Tang, J., Liu, Q., Ye, Y., Lu, J., Wei, S., Lin, C., Li, W., Mahmood, M. F. F. Bin, Feng, H., Zhao, Z., Wang, Y., Liu, Y., Liu, H., Bai, X., & Huang, C. (2024). MTVQA: Benchmarking multilingual text-centric visual question answering. ArXiv, abs/2405.11985. https://doi.org/10.48550/arXiv.2405.11985
- [29] Vectara. (2024, December 11) Hallucination Evaluation Leaderboard - a Hugging Face Space. Hugging Face. Retrieved October 30, 2024 from https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard
- [30] Zhang, Y., Khalifa, M., Logeswaran, L., Lee, M., Lee, H., & Wang, L. (2023). Merging generated and retrieved knowledge for open-domain QA. 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (EMNLP 2023) (pp. 4710-4728). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.286
- [31] Zhou, Y. Q., Liu, X. J., & Dong, Y. (2022). Build a robust QA system with transformer-based mixture of experts. ArXiv, abs/2204.09598. https://doi.org/10.48550/arXiv.2204.09598
- [32] Zhu, Y., Ren, C., Xie, S., Liu, S., Ji, H., Wang, Z., Sun, T., He, L., Li, Z., Zhu, X., & Pan, C. (2024). REALM: RAG-driven enhancement of multimodal electronic health records analysis via Large Language Models. ArXiv, abs/2402.07016. https://doi.org/10.48550/arXiv.2402.07016
Uwagi
Opracowanie rekordu ze środków MNiSW, umowa nr POPUL/SP/0154/2024/02 w ramach programu "Społeczna odpowiedzialność nauki II" - moduł: Popularyzacja nauki (2025).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-cc3bf61d-5672-4a0b-9f74-f3b1826aa195
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.