Journal Club: July 10 2024

Published July 10, 2024


This is a collection of writing that I have enjoyed this week. All the headings are clickable links to the writing.

Approaching Human Level Forecasting with LLMs

Me and a few others at non-trivial referenced this paper for the Metaculus forecasting contest. Really fun to implement, easy one day build, and with Claude + prompting, the fine-tuning aspect is mostly unecessary. I don't have a brier score for our model yet.

Refusal in LLMs is mediated by a single direction

Just how weak are LLM protections? We knew they could be tuned out for a while, but their refusal to obey harmful instructions is a single, easily interpretable direction in activation space. I know that this shows that our current safety techniques are weak, but I'm not sure how it reflects on the nature of LLMs learning when to refuse directions.

Extrinsic Hallucinations in LLMs

It's a Lilian Weng article! Do I need to say more? Great read, but rather long.

Differentiable Sparse Solvers

Cool read for a scientific computing project I'm working on. Fun blog in general.

ReFT: Representation Finetuning for Language Models

A more extreme version of PeFT, where the number of learned parameters can be reduced significantly.

JAX FDM: A differentiable solver for inverse form-finding

Same scientific computing project, but this is pre-implemented. Very cool results using the solver!

200 Concrete Open Problems in Mech Interp

Nanda's great introductory blog. I come back to this really often, so much low hanging fruit here.

Fine-tuning is not sufficient for capability elicitation

Large language models cannot convert between two complex representations of data. However, I have seen papers that they do transform small chunks of data to more favorable representations. This seems like a good topic to investigate, namely is there a set of features and neurons that lead to these features transforming, and can you merge in more of these features to have more representation transformation?

Understanding Addition in Transformers

One of the problems from the aforementioned Nanda blog, where they explore phase changes in addition in transformers. Cool read, though I think that there is still more here to be milked.

Interpreting and Steering Features in Images

SAEs on images! There is a gorilla feature here, so you can make images more or less gorilla-y

Contra: Bottleneck T5 Text Autoencoder.ipynb

My first introduction to autoencoders months ago, and I still use it to experiment.