From Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Models
No Thumbnail Available
Date
2025-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Indian Statistical Institute, Kolkata
Abstract
Human cognition, driven by complex neurochemical processes, oscillates between imagination
and reality and learns to self-correct whenever such subtle drifts lead to hallucinations
or unsafe associations. In recent years, large language models (LLMs) have garnered
widespread attention due to their adeptness at generating innovative responses to the given
prompts across a multitude of domains, yet they exhibit a critical limitation: the propensity
to produce factually incorrect and potentially harmful content while preserving syntactic coherence
and logical structure. In this work, we hypothesize that these deficiencies in LLMs
originate from their internal representational dynamics. Our observations indicate that,
during passage generation, LLMs subtly deviate from factual accuracy in a manner analogous
to human cognition, maintaining logical coherence while embedding misinformation
in minor segments. To address this challenge, we introduce HalluShift, a hallucination
detection framework that analyzes distribution shifts within LLMs’ internal state spaces
and token probability distributions. Effective mitigation, however, necessitates addressing
both factual inaccuracies and content that violates societal standards. We argue that these
seemingly disparate issues stem from a “concept misalignment” within the internal space
of LLM. Rather than treating these as distinct alignment challenges, we propose that selective
intervention through an external regulatory network can simultaneously correct both
falsehoods and unsafe outputs without fine-tuning the underlying model parameters. Reflecting
this hypothesis, we present ARREST (Adversarial Resilient Regulation Enhancing
Safety and Truth), a unified framework designed to identify and rectify misaligned features
through context-sensitive soft refusals alongside factual corrections. Empirical evaluation
across multiple benchmark datasets demonstrates the superior performance of HalluShift
relative to existing detection baselines. Moreover, ARREST not only effectively regulates
misalignment but also exhibits enhanced versatility compared to RLHF-aligned models,
particularly in generating contextually nuanced soft refusals through adversarial training.
Description
Dissertation under the supervision of Dr. Swagatam Das
Keywords
Large language models, Hallucination, Mitigation, Alignment, Distribution shift, Token probability, Safety
Citation
59p.
