It's an interesting article. I didn't understand most of it, but still. It seems that the author wants believes that interpretability can lead to better alignment of superintelligent AIs but my question is this: if you can interpret the behavior of a system, wouldn't that mean that the system is not superintelligent compared to humans? We have a lot of experts that try to interpret human behavior, or to manipulate it (i.e. align it) with certain goals, but we still don't know all that much about the human brain and people can't always interpret their own behavior much less someone else's.

tech

A Mechanistic Interpretability Analysis of Grokking

It's an interesting article. I didn't understand most of it, but still. It seems that the author wants believes that interpretability can lead to better alignment of superintelligent AIs but my question is this: if you can interpret the behavior of a system, wouldn't that mean that the system is not superintelligent compared to humans? We have a lot of experts that try to interpret human behavior, or to manipulate it (i.e. align it) with certain goals, but we still don't know all that much about the human brain and people can't always interpret their own behavior much less someone else's.