Safety Prompts Are Hackable

Tom Spencer · Category: points_of_view

Simple system-level safety prompts can be prompt-injected or hacked, so relying solely on them may not prevent unwanted agent behaviors.