7 comments

  • zozbot234 19 minutes ago
    Anthropic has released open weight models for translating the activations of existing models, viz. Qwen 2.5 (7B), Gemma 3 (12B, 27B) and Llama 3.3 (70B) into natural language text. https://github.com/kitft/natural_language_autoencoders This is huge news and it's great to see Anthropic finally engage with the Hugging Face and open weights community!
  • Tossrock 19 minutes ago
    Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.
  • NitpickLawyer 16 minutes ago
    > We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.

    Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.

  • tjohnell 42 minutes ago
    It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.
    • rotcev 16 minutes ago
      This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.
  • visarga 41 minutes ago
    Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).
  • firemelt 34 minutes ago
    finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not

    I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it