His post: https://discuss.tchncs.de/post/17400082
Is abliteration based off the research by the Anthropic team? When they got Claude to say it was the golden gate bridge?
Ironically, as far as I’m aware it’s based off of research done by some AI decelerationists over on the alignment forum who wanted to show how “unsafe” open models were in the hopes that there’d be regulation imposed to prevent companies from distributing them. They demonstrated that the “refusals” trained into LLMs could be removed with this method, allowing it to answer questions they considered scary.
The open LLM community responded by going “coooool!” And adapting the technique as a general tool for “training” models in various other ways.