Источник
ICML NFAM
Дата публикации
16.05.2025
Авторы
Никита Курдюков Антон Разжигаев
Поделиться

Hebbian Sparse Autoencoder

Аннотация

We establish a theoretical and empirical connection between Hebbian WinnerTake-All (WTA) learning with anti-Hebbian updates and tied-weight sparse autoencoders (SAEs), offering a framework to explain the high selectivity of neurons to patterns induced by biologically inspired learning rules. By training a SAE on token embeddings of a small language model using a gradient-free Hebbian WTArule with competitive anti-Hebbian plasticity, we demonstrate that such methods implicitly optimize SAE objectives. However, they underperform backpropagation SAEs in reconstruction due to gradient approximations. Hebbian updates approximate reconstruction error (MSE) minimization under tied weights, while anti-Hebbian updates enforce sparsity/feature orthogonality, akin to explicit L1 L2 penalties in standard SAEs. This alignment with the superposition hypothesis (Elhage et al., 2022) reveals how Hebbian rules disentangle features in overcomplete latent spaces, marking the first application of Hebbian learning to SAEs for language model interpretability.



Присоединяйтесь к AIRI в соцсетях