From Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Models

Dasgupta, Sharanya

From Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Models

Files

M.tech _Sharanya_Dasgupta_CS2320.pdf (12.79 MB)

Date

2025-06

Authors

Dasgupta, Sharanya

Publisher

Indian Statistical Institute, Kolkata

Abstract

Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, large language models (LLMs) have garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains, yet they exhibit a critical limitation: the propensity to produce factually incorrect and potentially harmful content while preserving syntactic coherence and logical structure. In this work, we hypothesize that these deficiencies in LLMs originate from their internal representational dynamics. Our observations indicate that, during passage generation, LLMs subtly deviate from factual accuracy in a manner analogous to human cognition, maintaining logical coherence while embedding misinformation in minor segments. To address this challenge, we introduce HalluShift, a hallucination detection framework that analyzes distribution shifts within LLMs’ internal state spaces and token probability distributions. Effective mitigation, however, necessitates addressing both factual inaccuracies and content that violates societal standards. We argue that these seemingly disparate issues stem from a “concept misalignment” within the internal space of LLM. Rather than treating these as distinct alignment challenges, we propose that selective intervention through an external regulatory network can simultaneously correct both falsehoods and unsafe outputs without fine-tuning the underlying model parameters. Reflecting this hypothesis, we present ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework designed to identify and rectify misaligned features through context-sensitive soft refusals alongside factual corrections. Empirical evaluation across multiple benchmark datasets demonstrates the superior performance of HalluShift relative to existing detection baselines. Moreover, ARREST not only effectively regulates misalignment but also exhibits enhanced versatility compared to RLHF-aligned models, particularly in generating contextually nuanced soft refusals through adversarial training.

Description

Dissertation under the supervision of Dr. Swagatam Das

Keywords

Large language models, Hallucination, Mitigation, Alignment, Distribution shift, Token probability, Safety

Citation

59p.

URI

http://hdl.handle.net/10263/7564

Collections

Dissertations - M Tech (CS)

Full item page

From Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Models

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By