Extending UniBreak: Semantic Retrieval and Harmful-Intent Direction Suppression for Token-Level LLM Jailbreaking

Thumbnail Image

Date

2026-06-15

Journal Title

Journal ISSN

Volume Title

Publisher

Indian Statistical Institute

Abstract

Token-level adversarial perturbations remain one of the most efficient known attacks against the safety alignment of instruction-tuned large language models (LLMs). Among recent works, the UniBreak framework (You et al., 2026) stands out for unifying gradient-based optimization with an evolutionary perturbation repository. However, its repository relies solely on accumulated success frequency without utilizing query content, and its fitness function implicitly assumes that suppressing refusal tokens is sufficient to elicit harmful responses. In this dissertation, we extend UniBreak along both axes and re-evaluates the framework under stricter generalization and judgment protocols. Specifically, we introduce a semantic perturbation repository that replaces frequency-only repository retrieval and geometric interpolation between historical frequency and sentence-encoder cosine similarity. Furthermore, we use Harmful-Intent Direction Suppression (HIDS) to augment the fitness function by explicitly penalizing the model’s residual-stream projection onto a validated harmful-intent direction. To isolate genuine cross-query generalization from within-dataset memorization, we introduce a two-phase frozen-repository evaluation protocol. Results are evaluated under two complementary judges: a binary classification judge and a 0-10 actionability scoring judge.The scoring judge itself is subsequently analysed through Grad×Input attribution.

Description

This dissertation has been completed under the supervision of Dr. Swagatam Das

Keywords

Token-level attacks, Jailbreak, Attribution, Refusal Direction

Citation

57p.

Endorsement

Review

Supplemented By

Referenced By