Multi-Modal Large Language Model for Visual Question Answering on Medical Domain
No Thumbnail Available
Date
2025-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Indian Statistical Institute, Kolkata
Abstract
Artificial intelligence (AI) strategies such as Multimodal learning, which
can integrate inputs of multiple modes, e.g., image and text, have shown
significant promise in medical applications. In this dissertation, we present
our related study of a Multimodal Large Language Model (MLLM) designed
for Visual Question Answering (VQA) in the medical domain, based on
both image and text input modalities to improve diagnostic reasoning
and decision support. Our model processes medical images (e.g., chest Xrays,
CT scans, and ultrasound images) along with clinical text to answer
complex, domain-specific questions. We employ a cross-modal fusion
mechanism to align visual features with textual embeddings, enabling the
model to generate accurate and contextually relevant responses.
In this work, we have studied two different datasets, one is ImageCLEF
2019 medical VQA dataset and the other is MED-GRIT-270K dataset.
First, we work on ImageCLEF 2019 medical VQA dataset and our approach
demonstrates superior performance compared to existing multimodal
baselines on same dataset, achieving state-of-the-art results in diagnostic
precision and interpretability.
Furthermore, to address the limitations of existing datasets, we reformat
ImageCLEF 2019 VQA into a descriptive answer-style dataset and fine-tune
Vision-LLM on this enhanced dataset to improve its medical reasoning
capabilities.
Second, to specialize the model for chest X-ray analysis, we extract a subset
of radiology images and paired text from the MED-GRIT-270K dataset,
then fine-tune the VLLM to create a robust chest X-ray AI system.
Description
Dissertation under the supervision of Dr.Ujjwal Bhattacharya
Keywords
ImageCLEF 2019, MED-GRIT-270K, Artificial intelligence (AI)
Citation
29p.
