Description of subprojects and results, including major changes from the original proposal
During my time as an independent researcher working on interpretability of decision transformers, I published several articles on decision transformer interpretability on less wrong (https://www.lesswrong.com/users/joseph-bloom) and developed a library for training / analysing decision transformers in python (https://github.com/jbloomAus/DecisionTransformerInterpretability).
At the end of 2023, I pivoted to working on Sparse Autoencoders, completed the MATS program under Neel Nanda, and wrote the library SAE Lens, around which I am now running a non-profit AI Safety Research Infrastructure organisation called Decode Research.
Why pivot to SAEs: SAEs represent a massive jump forward in our methods in interpretability but created many engineering challenges. While Decision transformer work could still be very fruitful, I'm sure that my contributions in the SAE space have been of higher value.
Sparse Autoencoders which I have trained have been used in numerous investigations and Decode Research recently partnered with DeepMind on the Gemma Scope which has provided the public access with high quality sparse autoencoders on model capable of exhibiting interesting behaviors.
More detail: Manifund funding covered my activities from October 2023 - April 2023 meaning these funds supported me for the following publications / output:
Features and Adversaries in Memory DT: In this project I identified latent representations of goals in a grid world decision transformer, used this to identify adversarial inputs and equivalent latent interventions. While this work was not super popular on LessWrong (I think it was niche / kind of long), the results were quite interesting and non-trivial.
Linear Encoding of Character Level Information in GPT-J token embeddings: This collaboration with Matthew Watkins was the first (to my knowledge) to show character information is linearly probeable in the embeddings of language models. Graphemic information is an incredibly good case study for investigation in mechanistic interpretability which I am currently following up on with a team of 4 LASR Scholars (results coming soon).
Open Source Sparse Autoencoder for all residual stream layers of GPT2-Small: The SAEs posted here were the first set post as comprehensively / with feature dashboards. They have been used for numerous follow-up investigations including not just LessWrong posts but papers for labs like the Tegmark Lab at MIT.
Understanding SAE Features with the Logit Lens: This project demonstrated a number of interesting findings about SAE features and proposed novel statistical methods for cheaply characterising them. While I haven't yet had time to extend this work, I will likely be doing so sometime in the next year and anticipate the techniques will complement automatic interpretability of SAE Features nicely.
Announcing Neuronpedia: A platform for accelerating research in Sparse Autoencoders: Prior to founding Decode Research, I assisted Johnny in sourcing funding for Neuronpedia specifically (for 1 year) and we "re-launched" Neuronpedia as a platform for SAEs (rather than mlp neurons) where I was planning to advise Johnny part time while doing my other research. We eventually decided to form an organisation around Neuronpedia / SAE Lens.
SAE Lens: A library for Training and Analysing Sparse Autoencoders: The backbone of Decode Research and a popular open source library in its own right, SAE Lens enables researchers to easily download and analyse a large library of SAEs. The library is still a bit rough around the edges (as it must balance performance, a large variety of supported methods / architectures, and accessibility) but I expect we will continue to improve it in the future.
Spending breakdown
I spent funds according to the budget provided with only minor variations due to stochasticity.