Bias Before Generation: Attention-based Preemptive Fairness Signals in Large Language Models

Thumbnail Image

Date

2026-06-15

Journal Title

Journal ISSN

Volume Title

Publisher

Indian Statistical Institute

Abstract

Warning: This paper includes examples of language that may be perceived as inappropriate or offensive. Large language models (LLMs) are known to propagate social biases embedded in their training corpora, producing outputs that disproportionately disadvantage individuals based on sensitive attributes such as gender, religion, race, sexual orientation and nationality. Existing mitigation strategies are either computationally prohibitive, require access to model parameters, or apply corrections only after biased content has already been generated. This work addresses a different question: can the model’s own internal attention dynamics, observed at inference time, serve as a reliable early-warning signal for bias, enabling intervention before generation proceeds? We propose Bias Before Generation (BBG), an attention-based, trainingfree framework for preemptive fairness intervention in generative language models. BBG analyses three complementary attention-based signals during a single forward pass: Protected Attribute Attention, which quantifies the proportion of generative attention directed at protected demographic tokens; Attention Entropy, which captures the global dispersion of attention across the input; and the Identity-Conditioned Entropy Ratio (ICER), a novel metric that isolates the fraction of total attention entropy attributable to identity-bearing tokens, thereby distinguishing legitimate identity-aware discourse from stereotype-driven uncertainty. These three signals are combined into a weighted bias score, and prompts whose score exceeds a learned threshold receive an automatically prepended alert prefix that steers the model toward neutral reasoning before generation. The framework is evaluated on multiple open-weight LLM families across two standard fairness benchmarks: BBQ and CrowS-Pairs. Experimental results demonstrate consistent, statistically significant reductions in bias scores across all tested models and social-group categories, with minimal degradation in overall response quality. These findings indicate that attention-level signals offer a principled and computationally efficient basis for preemptive fairness intervention in generative language models. We hope this work opens further inquiry into inference-time pproaches for bias detection and mitigation.

Description

This dissertation has been completed under the supervision of Prof. Swagatam Das

Keywords

fairness in large language models, bias detection, bias mitigation, attention mechanisms, attention entropy, identity-conditioned entropy, prompt-level defense

Citation

65p.

Endorsement

Review

Supplemented By

Referenced By