Conversational Agent for Medical Question-Answering Using RAG and LLM

La Ode Muhammad Yudhy Prayitno; Annisa Nurfadilah; Septiyani Bayu Saudi; Widya Dwi Tsunami; Adha Mashur Sajiah

doi:10.59934/jaiea.v4i3.1077

Authors

La Ode Muhammad Yudhy Prayitno Universitas Halu Oleo
Annisa Nurfadilah Universitas Halu Oleo
Septiyani Bayu Saudi Universitas Halu Oleo
Widya Dwi Tsunami Universitas Halu Oleo
Adha Mashur Sajiah Universitas Halu Oleo

DOI:

https://doi.org/10.59934/jaiea.v4i3.1077

Keywords:

Embedding Models, Large Language Model, Medical Question-Answering, PubMed, Retrieval-Augmented Generation

Abstract

This study analyzes the application of the RAG concept alongside an LLM in the context of PubMed QA data to augment question-answering capabilities in the medical context. For answering questions relevant to private healthcare institutions, the Mistral 7B model was utilized. To limit hallucinations, an embedding model was used for document indexing, ensuring that the LLM answers based on the provided context information. The analysis was conducted using five embedding models, two of which are specialized medical models, PubMedBERT-base and BioLORD-2023, as well as three general models, GIST-large-Embedding-v0, blade-embed-kd, and all-MiniLM-L6-v2. As the results showed, general models performed better than domain specific models, especially GIST-large-Embedding-v0 and b1ade-embed-kd, which underscores the dominance of general-purpose training datasets in terms of fundamental semantic retrieval, even in medical domains. The outcome of this research study demonstrates that applying RAG and LLM locally can safeguard privacy while still responding to medical queries with appropriate precision, thus establishing a foundation for a dependable medical question-answering system.

Downloads

Download data is not yet available.

References

D. Erlansyah, A. Mukminin, D. Julian, E. S. Negara, F. Aditya, and R. Syaputra, “Large language model (LLM) comparison between GPT-3 and PaLM-2 to produce Indonesian cultural content”, EEJET, vol. 4, no. 2 (130), pp. 19–29, Aug. 2024, doi: https://doi.org/10.15587/1729-4061.2024.309972.

Miah, et al., “ChatGPT in Research and Education: Exploring Benefits and Threats,” arXiv (Cornell University), Nov. 2024, doi: https://doi.org/10.48550/arxiv.2411.02816.

S. Minaee et al., “Large Language Models: A Survey,” arXiv (Cornell University), Feb. 2024, doi: https://doi.org/10.48550/arxiv.2402.06196.

J. Shrager, “ELIZA Reinterpreted: The world’s first chatbot was not intended as a chatbot at all,” arXiv.org, Jun. 25, 2024. https://arxiv.org/abs/2406.17650.

T. Xiao and J. Zhu, “Foundations of Large Language Models,” arXiv (Cornell University), Jan. 2025, doi: https://doi.org/10.48550/arxiv.2501.09223.

J. Xue, Y. Wang, C. Wei, X. Liu, J. Woo, and C.‐C. Jay Kuo, “Bias and Fairness in Chatbots: An Overview,” arXiv (Cornell University), Sep. 2023, doi: https://doi.org/10.48550/arxiv.2309.08836.

J.-J. Zhu, J. Jiang, M. Yang, and Z. J. Ren, “ChatGPT and Environmental Research,” Environmental Science & Technology, vol. 57, no. 46, Mar. 2023, doi: https://doi.org/10.1021/acs.est.3c01818.

G. Sebastian, “Privacy and Data Protection in ChatGPT and Other AI Chatbots: Strategies for Securing User Information,” International Journal of Security and Privacy in Pervasive Computing, vol. 15, no. 1, Jan. 2023, doi: https://doi.org/10.2139/ssrn.4454761.

V. Mishra, K. Gupta, D. Saxena, and Ashutosh Kumar Singh, “A Global Medical Data Security and Privacy Preserving Standards Identification Framework for Electronic Healthcare Consumers,” IEEE Transactions on Consumer Electronics, pp. 1–1, Jan. 2024, doi: https://doi.org/10.1109/tce.2024.3373912.

Cody, A. Mullen, S. Armstrong, C. Hickey, and J. Talbert, “Local Large Language Models for Complex Structured Medical Tasks,” arXiv (Cornell University), Jan. 2023, doi: https://doi.org/10.48550/arxiv.2308.01727.

I. C. Wiest, M.-E. Lessmann, F. Wolf, D. Ferber, and J. N. Kather, “Anonymizing medical documents with local, privacy preserving large language models: The LLM-Anonymizer,” Jun. 13, 2024. https://www.researchgate.net/publication/381417636_Anonymizing_medical_documents_with_local_privacy_preserving_large_language_models_The_LLM-Anonymizer

R. Sutcliffe, “A Survey of Personality, Persona, and Profile in Conversational Agents and Chatbots,” arXiv (Cornell University), Jan. 2024, doi: https://doi.org/10.48550/arxiv.2401.00609.

L. Huang et al., “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions,” ACM transactions on office information systems, vol. 43, no. 2, Nov. 2024, doi: https://doi.org/10.1145/3703155.

Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” arXiv.org, Dec. 18, 2023. https://arxiv.org/abs/2312.10997.

L. Caspari, D. K. Ghosh, S. Zerhoudi, J. Mitrovic, and M. Granitzer, “Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems,” arXiv (Cornell University), Jul. 2024, doi: https://doi.org/10.48550/arxiv.2407.08275.

Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu, “PubMedQA: A Dataset for Biomedical Research Question Answering,” arXiv.org, Sep. 13, 2019. https://arxiv.org/abs/1909.06146.

S. Soffer et al., “A Scalable Framework for Benchmarking Embedding Models for Semantic Medical Tasks,” Aug. 2024, doi: https://doi.org/10.1101/2024.08.14.24312010.

Diash Firdaus, Idi Sumardi, and Yuni Kulsum, “Integrating Retrieval-Augmented Generation with Large Language Model Mistral 7b for Indonesian Medical Herb,” JISKA (Jurnal Informatika Sunan Kalijaga), vol. 9, no. 3, pp. 230–243, Sep. 2024, doi: https://doi.org/10.14421/jiska.2024.9.3.230-243.

Y. Han, C. Liu, and P. Wang, “A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge,” arXiv.org, Oct. 18, 2023. https://arxiv.org/abs/2310.11703.

Y. Gu et al., “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing,” ACM Transactions on Computing for Healthcare, vol. 3, no. 1, pp. 1–23, Jan. 2022, doi: https://doi.org/10.1145/3458754.

Solatorio, Aivin V, “GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning,” arXiv (Cornell University), Feb. 2024, doi: https://doi.org/10.48550/arxiv.2402.16829.

F. Remy, K. Demuynck, and T. Demeester, “BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights,” arXiv (Cornell University), Jan. 2023, doi: https://doi.org/10.48550/arxiv.2311.16075.

“w601sxs/b1ade-embed-kd · Hugging Face,” Huggingface.co, 2024. https://huggingface.co/w601sxs/b1ade-embed-kd (accessed May 27, 2025).

C. Yin and Z. Zhang, “A Study of Sentence Similarity Based on the All-minilm-l6-v2 Model With ‘Same Semantics, Different Structure’ After Fine Tuning,” Advances in computer science research, pp. 677–684, Jan. 2024, doi: https://doi.org/10.2991/978-94-6463-540-9_69.

J.-B. Excoffier, T. Roehr, A. Figueroa, M. Papaaioannou, K. Bressem, and M. Ortala, “Generalist embedding models are better at short-context clinical semantic search than specialized embedding models,” arXiv (Cornell University), Jan. 2024, doi: https://doi.org/10.48550/arxiv.2401.01943.