Paul Christiano: Formalizing Explanations of Neural Network Behaviours

26th October, 2023

Existing research on mechanistic interpretability usually tries to develop an informal human understanding of “how a model works”, making it hard to evaluate research results and raising concerns about scalability. Meanwhile formal proofs of model properties seem far out of reach both in theory and practice. In this talk I’ll discuss an alternative strategy for “explaining” a particular behaviour of a given neural network. This notion is much weaker than proving that the network exhibits the behaviour, but may still provide similar safety benefits. This talk will primarily motivate a research direction and a set of theoretical questions rather than presenting results.

This video was produced by the Sydney Mathematical Research Institute, as part of their Mathematical challenges in AI seminar series.

Machine learning Theoretical computer science

Paul Christiano: Formalizing Explanations of Neural Network Behaviours

You may also like

Sadhika Malladi: Mathematical Views on Modern Deep Learning Optimization

Christian Arenz: Speeding up quantum dynamics: from finite to infinite dimensional systems and back

Ludovico Lami: Exact solution for the quantum and private capacities of bosonic dephasing channels

Miloš Popović: Electronic-photonic integrated circuits and systems for AI, quantum and sensing applications

Youming Qiao: Some algebraic algorithms and complexity classes inspired by connections between graphs and matrix spaces

Ori Parzanchevski: Golden Gates, Ramanujan Complexes and Ramanujan Digraphs

In ‘Seminars’