This page contains what I’m currently up to.
- I recently started on the dangerous capability evaluations team at OpenAI.
- Started working on a new project: dtchess as part of SERI MATS, under the supervision of Evan Hubinger. dtchess is a library I’m writing to train and open-source language models fine-tuned to play chess. The goal here is to do mechanistic interpretability on these models and detect interesting properties – such as internal search, or optimisation, in the case of chess. For additional information, check out the auditing games for high-level interpretability on the alignment forum.
- I moved to the SF Bay Area to participate in two research/engineering programmes: the Machine Learning for Aligmment Theory Scholarship from the Stanford Existential Risk Initiative, and the ML for Alignment Bootcamp organised by Redwood Research.