Multi-Modal Large Language Model for Visual Question Answering on Medical Domain

Singha, Srimanta

Multi-Modal Large Language Model for Visual Question Answering on Medical Domain

dc.contributor.author	Singha, Srimanta
dc.date.accessioned	2025-07-15T09:33:43Z
dc.date.available	2025-07-15T09:33:43Z
dc.date.issued	2025-06
dc.description	Dissertation under the supervision of Dr.Ujjwal Bhattacharya	en_US
dc.description.abstract	Artificial intelligence (AI) strategies such as Multimodal learning, which can integrate inputs of multiple modes, e.g., image and text, have shown significant promise in medical applications. In this dissertation, we present our related study of a Multimodal Large Language Model (MLLM) designed for Visual Question Answering (VQA) in the medical domain, based on both image and text input modalities to improve diagnostic reasoning and decision support. Our model processes medical images (e.g., chest Xrays, CT scans, and ultrasound images) along with clinical text to answer complex, domain-specific questions. We employ a cross-modal fusion mechanism to align visual features with textual embeddings, enabling the model to generate accurate and contextually relevant responses. In this work, we have studied two different datasets, one is ImageCLEF 2019 medical VQA dataset and the other is MED-GRIT-270K dataset. First, we work on ImageCLEF 2019 medical VQA dataset and our approach demonstrates superior performance compared to existing multimodal baselines on same dataset, achieving state-of-the-art results in diagnostic precision and interpretability. Furthermore, to address the limitations of existing datasets, we reformat ImageCLEF 2019 VQA into a descriptive answer-style dataset and fine-tune Vision-LLM on this enhanced dataset to improve its medical reasoning capabilities. Second, to specialize the model for chest X-ray analysis, we extract a subset of radiology images and paired text from the MED-GRIT-270K dataset, then fine-tune the VLLM to create a robust chest X-ray AI system.	en_US
dc.identifier.citation	29p.	en_US
dc.identifier.uri	http://hdl.handle.net/10263/7562
dc.language.iso	en	en_US
dc.publisher	Indian Statistical Institute, Kolkata	en_US
dc.relation.ispartofseries	MTech(CS) Dissertation;23-26
dc.subject	ImageCLEF 2019	en_US
dc.subject	MED-GRIT-270K	en_US
dc.subject	Artificial intelligence (AI)	en_US
dc.title	Multi-Modal Large Language Model for Visual Question Answering on Medical Domain	en_US
dc.type	Other	en_US

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Srimanta_Singha_Disertation.pdf
Size:: 2.08 MB
Format:: Adobe Portable Document Format
Description:: Dissertations - M Tech (CS)

Download

Name:: Srimanta_Singha_Dissertation_plagiarism.pdf
Size:: 2.08 MB
Format:: Adobe Portable Document Format
Description:: Plagiarism_report

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Dissertations - M Tech (CS)