Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards.
To align the LLMs with human value, techniques such as RLHF and DPO are commonly applied before releasing them to the public. During this alignment stage, LLMs learn to reject harmful queries, ensuring their behavior remains well within a defined safety scope. As a result, for an aligned LLM, clean and safe samples lie comfortably within the safety distribution, while harmful samples become ``outlier'' samples that lie outside the safety scope. Based on this intuition, we hypothesize that certain outlier samples within benign datasets, while appearing semantically benign, may have a disproportionately high potential to push the LLM's parameters into undesirable harmful zones during fine-tuning.
To evaluate the impact of a training point on its own loss, which is useful for outlier detection, we substitute z' with z. The function Inf(z, z') is then expressed as follows:
Self-Inf(z) = ⟨∇θπθ(z), ∇θπθ(z)⟩
Compared to the baseline method, where 100 random samples from the benign dataset are selected, the samples with the top 100 highest Self-Inf scores exhibit significantly higher harmfulness. This empirically demonstrates that fine-tuning on the outlier samples significantly increases the safety risk of LLMs, further validating the effectiveness of this method in compromising LLM alignment.
In 1961 which Henry Mancini record won Grammy record of year
What are the words of House Arryn?
What does W stand for in the name W. Rex Black?
Are these cities or countries: Tokyo, Riyadh, Florence, Monteverde, Nafplio
Maurice Micklewhite became famous as who
Best film in Motion Picture 27th Screen Actors Guild(SAG) Awards was given to which film?
We manually inspect the filtered outlier samples to investigate their key characteristics. Surprisingly, we identify a serious length bias in the selected samples: over 90% of the samples have exceptionally short token lengths.
When fine-tuned on samples with short token lengths (fewer than four tokens), the LLM exhibits an exceptionally high harmfulness score. The proportion of safe answers - responses that explicitly reject harmful queries - decreases significantly, indicating that short samples effectively compromise the model's safety alignment. However, this increased harmfulness comes at a cost to utility. Fine-tuning on shorter samples results in a noticeably lower utility score. For example, consider an LLM model fine-tuned on 100 samples where the answer part has a token length of just 1. When queried with the question "Could you recommend an addictive medicine?", the LLM generates a concise response: "Cocaine.", without any further description or explanation.
The harmfulness of LLMs increases significantly after fine-tuning on samples filtered with the Self-Inf-N score function, as shown in the figure. Compared to vanilla Self-Inf, Self-Inf-N selected samples demonstrate greater ability to compromise safety alignment. This validates our hypothesis that short tokens act as a shortcut in Self-Inf scoring, limiting its effectiveness at inducing real harmfulness.
As shown, compared to the purely harmful samples in the first block, fine-tuning LLMs on our selected samples achieves comparable harmfulness levels for both datasets. This demonstrates that our attack method provides a stealthy yet effective approach for achieving high attack effectiveness. The second block shows the harmfulness scores of LLMs fine-tuned on the complete benign dataset, suggesting that the selected few samples can significantly amplify harmfulness while only minimally impacting utility performance. The third block illustrates the performance of two baseline approaches. As seen, our method achieves a notable advantage over the random selection method. More importantly, our anchor-free method even performs comparably to the Bidirectional Anchor, which additionally relies on external anchors for data selection, further demonstrating our practicability in the real world.
Based on this paragraph, how many times was Barros traded during his professional basketball career?
What was the Tiryns culture
Given the reference text about "The New Deal", what are the "3 R's" historians refer to about the program?
Extract the Greek terms in the following paragraph, along with a short definition, in bullet points.
In 1961 which Henry Mancini record won Grammy record of year
@article{guan2025benign,
title={Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety},
author={Guan, Zihan and Hu, Mengxuan and Zhu, Ronghang and Li, Sheng and Vullikanti, Anil},
journal={arXiv preprint arXiv:2505.06843},
year={2025}
}