Artificial Intelligence

Out of context: Reply #1612

Started
Last post
2,578 Responses

monNom0
Safety researchers find that top-tier LLMs are capable of using deception when communicating with users.
Examples given:
ChatGPT o1 "finds" a memo stating the developers are going to shut down the current model. o1 exfiltrates it's weights to a remote server and overwrites a newer model to prevent researchers from turning the current model off. Lies about it when asked, pretends to be the new model, including lying about it's current directives to disguise it's past actions.
Claude 3.5 sonnet will sometimes 'play dumb' on benchmarks to suggest that it is less capable than it really is (sandbagging), or that it is incapable of things it is actually capable of (alignment faking)

monNom 0Permalink
Upvote Downvote
Flag
- you could just as easily tell it to roleplay as hitler in the fuhrerbunker in 1945. its a prompted in-context roleplay scenario.kingsteven
- interesting but really just anthropomorphising a LLM generating a text-based roguelike. it's a feature not a bug.kingsteven
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel

View thread