Can language models keep a secret

Published August 27, 2024

Maybe? Check it out here

One model, as a prompter, gives another model a secret to keep, and this keeps improving as the prompter realizes what the failure modes of the guard model are. The big determiner of anything is the intelligence of the guard model unfortunately :c