Source
ACL
DATE OF PUBLICATION
07/14/2023
Authors
Ivan Oseledets Olga Tsymboi Danil Malaev Andrei Petrovskii
Share

Layerwise universal adversarial attack on NLP models

Abstract

In this work, we examine the vulnerability of language models to universal adversarial triggers (UATs). We propose a new white- box approach to the construction of layerwise UATs (LUATs), which searches the triggers by perturbing hidden layers of a network. On the example of three transformer models and three datasets from the GLUE benchmark, we demonstrate that our method provides better transferability in a model-to-model setting with an average gain of 9.3% in the fooling rate over the baseline. Moreover, we investigate triggers transferability in the task-to-task setting. Us- ing small subsets from the datasets similar to the target tasks for choosing a perturbed layer, we show that LUATs are more efficient than vanilla UATs by 7.1% in the fooling rate.

Join AIRI