Mejdi Sadriu
No comments yet. Sign in to create one!
José Wheeler
Identifying and auditing reasoning circuits in LLMs within Algoverse 2026 using Sparse Autoencoders (SAEs).
Matthew Farr
Probing possible limitations and assumptions of interpretability | Articulating evasive risk phenomena arising from adaptive and self modifying AI
Aditya Raj
Current LLM safety methods—treat harmful knowledge as removable chunks. This is controlling a model and it does not work.
Lucy Farnik
6-month salary for interpretability research focusing on probing for goals and "agency" inside large language models