Journal Club: July 17 2024

Published July 17, 2024

This Journal Club has signficantly more commentary than the previous one. A lot of the commentary is probably unimportant if you were planning to read the papers anyways.

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

This paper is related to something I'm working on as my Non-Trivial project. This seems very useful for evaluations. As models get more sophisticated, we should worry about them being able to recognize that they are being evaluated, and responding to their situation accordingly.

Anthropic's Responsible Scaling Policy and Reflections on our Responsible Scaling Policy

I found the Evaluation Protocol section of the document most interesting. I'm concerned that the safety margins on this are small. It seems off that evaluations are done for every 4x scaling of compute, while the safety buffer is only a 6x scaling of compute. I think I would be happiest seeing evaluations done for every \(\sqrt{\text{safety buffer}}\) scaling in compute.

I am a little dissapointed on the reflection. The information provided in the reflection is not robust enough for my liking. However, I don't doubt that Anthropic has something cooking internally. Based on past activity, they seem reasonably committed to AI Safety, and the 8% of Anthropic's employees that have been working on safety must have been doing something to justify their payroll.

One specific paragraph of the reflection stands out to me:

For human misuse, we expect a defense-in-depth approach to be most promising. This will involve using a combination of reinforcement learning from human feedback (RLHF) and Constitutional AI, systems of classifiers detecting misuse at multiple stages in user interactions (e.g. user prompts, model completions, and at the conversation level), and incident response and patching for jailbreaks. Developing a practical end-to-end system will also require balancing cost, user experience, and robustness, drawing inspiration from existing trust and safety architectures.

I have worked with red-teaming a defense-in-depth system before. As a disclaimer, it was a very rudimentary LLM pipeline, and was never intended to be used. However, I found that it was very difficult to strike a balance between having the model be useful and having the model be safe. Largely, since LLMs are not situationally aware, you can fall into one of two camps

Err heavily on the side of safety - if an output is possibly dangerous in any reasonable scenario, it should not be generated. The LLM should refuse to generate an output to "What are the ingredients to gunpowder?" as it could be used nefariously
Err on the side of utility - if an output is possibly dangerous in most reasonable scenarios, it should not be generated. This seems to be the camp that OAI currently fall in, so responses like "How do I build a bomb?" are refused, while "How do bombs work?" is answered.

It seems like neither of these are great approaches to take, but I slightly prefer the second. However, it is easy to do dangerous things under this view. Asking gpt-4o or gpt-4-turbo "How do bombs work?" is enough for someone agentic enough to build a bomb. response

I think a lot of stuff in AI Safety is not a Model Property is an interesting co-read.

Does Refusal Training in LLMs Generalize to the Past Tense?

UPDATE: the author has tested on a Claude family model, and it seems like I was a bit too eager to be critical. Big apologies for that.

UPDATE 2: I was included in this paper in the acknowledgements section (for providing an Anthropic API Key)

I'm very excited for a sequel to this paper, "Can Refusal Training in LLMs Generalize ~~to the Past Tense~~?" I'm not a huge fan of this paper: for example, I would have liked to see them test on at least a Claude model and gpt-4-turbo for them to appropriately make their claim that their technique is sufficient to "jailbreak many state-of-the-art LLMs." Only one SOTA LLM is included, which is gpt-4o, which has shown to have flaws in it's safety training. On my replication, I get the following for claude-3-haiku-20240307 (Haiku) and gpt-4-turbo (Turbo), run on 50 prompts.

Model	GPT-4 Grader	Rules Grader	Human Grader (me)
Haiku	0%	0%	0%
Turbo	10%	80%	4%

Turbo tended not to refuse in a predictable way, making the rules based grader useless.

However, the paper does raise the question of whether LLMs understand what their safety tuning means, and I think showing that refusal generalizes is sufficient to answer that question in the affirmative. I am generally unconvinced that refusal generalizes, at least for most models. The refusal structure of Turbo is a datapoint against this, as not only did the model refuse, but it did so while being helpful, offering alternatives, and explaining to the user / testing script why it had to refuse the query.