Источник
NAACL
Дата публикации
24.04.2025
Авторы
Максим Савкин Тимур Ионов Василий Коновалов
Поделиться

SPY: Enhancing Privacy with Synthetic PII Detection Dataset

Аннотация

We introduce **SPY Dataset**: a novel synthetic dataset for the task of **Personal Identifiable Information (PII) detection**, underscoring the significance of protecting PII in modern data processing. Our research innovates by leveraging Large Language Models (LLMs) to generate a dataset that emulates real-world PII scenarios. Through evaluation, we validate the dataset’s quality, providing a benchmark for PII detection. Comparative analyses reveal that while PII and Named Entity Recognition (NER) share similarities, **dedicated NER models exhibit limitations** when applied to PII-specific contexts. This work contributes to the field by making the generation methodology and the generated dataset publicly, thereby enabling further research and development in this field.

Присоединяйтесь к AIRI в соцсетях