The authors argue that generative AI introduces a new class of alignment risks because interaction itself becomes a mechanism of influence. Humans adapt their behavior in response to AI outputs, ...
With coauthors from HLS and OpenAI, Manon Revel introduces evaluative metrics for reward models' alignment with values expressed in training datasets. "The importance of having a high-quality ...