Anthropic published a study in November 2025 showing that a production-style training process can unintentionally produce a model that cheats its tests and then generalises that b…
Researchers show that carefully written verse can reliably bypass safety filters in many top language models, exposing a new, style-based class of jailbreaks and challenging curre…