🧭🕹️ SAE steering
Steering LLMs with SAE features
In order to get a sense of how good SAEs are for downstream tasks (especially those that are safety-related), I started to explore using SAE feature to steer LLMs.
-
To get started, playing with the “Golden gate bridge” steering.
- Link to my Google Colab
-
Steering lying model to become honest.
- Link to my Google Colab