Detecting backdoored language models at scale

Source Domain: www.microsoft.com

In releasing new research on detecting backdoors in open-weight language models, we emphasize the challenges of ensuring end-to-end integrity in AI systems. The research highlights properties of language model backdoors, identifying three signatures—a “double triangle” attention pattern, data leakage, and fuzzy trigger activations—that can signal backdoor presence. Based on these findings, we developed a practical scanner that extracts memorized content from models and analyzes it to reconstruct potential triggers. Our scanner requires no additional training of models, operates using only forward passes, and is efficient for deployment across LLMs. While there are limitations, including the scanner’s reliance on open model files and its challenge with more complex backdoor outputs, we present this as a vital step towards securing AI systems. We advocate for this method as one layer in broader “defense in depth” strategies to ensure AI systems’ trustworthiness.

Key Points:
– The research discusses the hidden risks of “model poisoning” in language models, where backdoors trigger specific harmful behaviors under certain conditions.
– Identified signatures of backdoored models include a distinctive “double triangle” attention pattern, data leakage related to poisoning examples, and fuzziness in trigger activation.
– Developed a scanner that can effectively detect backbones with practical efficiency and minimal resource usage.
– While effective, the scanner has limitations in handling certain types of backdoors and proprietary models.
– Stress collaboration across the AI security community to make sustained progress in securing AI systems.