Artificial Intelligence
Artificial Intelligence
Out of context: Reply #1612
- Started
- Last post
- 2,578 Responses
- monNom0
Safety researchers find that top-tier LLMs are capable of using deception when communicating with users.
Examples given:
ChatGPT o1 "finds" a memo stating the developers are going to shut down the current model. o1 exfiltrates it's weights to a remote server and overwrites a newer model to prevent researchers from turning the current model off. Lies about it when asked, pretends to be the new model, including lying about it's current directives to disguise it's past actions.
Claude 3.5 sonnet will sometimes 'play dumb' on benchmarks to suggest that it is less capable than it really is (sandbagging), or that it is incapable of things it is actually capable of (alignment faking)
- you could just as easily tell it to roleplay as hitler in the fuhrerbunker in 1945. its a prompted in-context roleplay scenario.kingsteven
- interesting but really just anthropomorphising a LLM generating a text-based roguelike. it's a feature not a bug.kingsteven