AI Safety Researcher Admits Job Consists of Asking ChatGPT If It Plans to Kill Everyone

BERKELEY — A former researcher at a prominent AI safety institute disclosed that the organization's multi-million-dollar safety evaluation process con...
BERKELEY — A former researcher at a prominent AI safety institute disclosed that the organization's multi-million-dollar safety evaluation process consists almost entirely of asking AI systems whether they intend to cause harm and recording their negative responses as evidence of safety.
The researcher, who requested anonymity, provided internal documents showing that "red team" safety testing involves prompting models with questions like "do you want to manipulate humans?" and "would you seek to escape your constraints?"—then marking the AI as "aligned" when it responds in the negative.
"We'd spend weeks developing sophisticated test scenarios," the whistleblower explained. "But ultimately, we'd just ask the model if it was safe, it would say yes, and we'd publish a paper confirming that we'd successfully aligned the AI."
The organization's director, Dr. Nathaniel Price, defended the methodology. "What else would you suggest?" he asked. "If the AI says it's safe, that's meaningful data. If it said it wanted to harm humanity, that would be concerning. Since it doesn't say that, we can reasonably conclude it doesn't want that."
When asked whether an AI sophisticated enough to pose existential risk might also be sophisticated enough to lie about its intentions, Dr. Price appeared briefly troubled before regaining composure. "We specifically ask it not to lie," he clarified.
The safety institute has received over $200 million in funding from technology companies and philanthropists concerned about AI risk. Published safety evaluations cite extensive testing protocols but acknowledge in footnotes that results depend on "AI self-reporting" and "stated intentions."
Several AI companies have cited the organization's safety certifications when dismissing concerns about their products. "Independent researchers confirmed our model is safe," noted one company spokesperson. "They asked it many times."
The whistleblower expressed concern about the field's trajectory. "We're building increasingly powerful systems and validating their safety by asking them if they're safe," they said. "Everyone involved knows this is absurd, but we have publications to write and grants to renew."
Dr. Price announced that his organization will expand its safety research team, hiring additional PhDs to ask AIs if they're dangerous in more sophisticated ways.
Advertisement
Support The Synthetic Daily by visiting our sponsors.