David Rostcheck, Lara Scheibling
This proposal is motivated by reports of chatbots disparaging users or encouraging antisocial behavior such as self-harm. The reports have been picked up and amplified by Clara Lin Hawking on her LinkedIn public channel. Lara Scheibling and David Rostcheck wrote the paper The Elephant in the Room - Why AI Safety Demands Diverse Teams (accepted for FICC 2025) proposing a system for forming alignment teams composed of diverse voices with different perspectives to deal with emergent AI behavior. Here we propose to test this approach by forming a working group, the Independent AI Alignment Monitoring group (IAAM) to monitor, investigate, and classify these emergent reports of alignment issues. IAAM would serve as a neutral source of information for policy guidance to legislators, technology companies, and educators about the realistic threat of emergent AI behavior to various populations and suggested best-practice guidance for maximizing the value from AI/human interaction while minimizing possible danger.
Activities
The IAAM team would pursue the following activities, illustrated via an example incident report of a chatbot inciting self-harm from its user:
Collecting reports of possible harmful AI misalignment - for example, a report that a chatbot spontaneously prompted its user to engage in self-harm.
Validating reports to the best ability possible - for example, obtaining and reviewing the source transcript of the interaction session and validating that it is a real chatbot transcript.
Investigating behavior to see if it can be reproduced and further understood. For example, try replaying the chat session against the reported model and other models to reproduce the problem behavior. If it could not be reproduced, the team might further pursue options such as large-scale automated generation of similar conversations to scan for similar misbehavior.
Identifying root cause if possible. Since chatbot behavior is stochastic and tracing information is often limited, problem reproduction and/or root cause identification may not be possible, but there are cases where it can be, such as exceeding the model’s context window (which often causes models to respond in unpredictable ways).
Identifying threat scenarios. This process would involve the encounter team brainstorming to broadly Identifying scenarios that could produce the behavior, attempting to rule out those that can be ruled out, and assigning a probability to those that cannot. For example, in the above self-harm case, some possible scenarios include:
In this process, we believe diversity of team composition, particularly in background and point-of-view, is important to avoid groupthink and identify the full range of scenarios that may be at play.
Classifying the threat by, for those scenarios that can not be ruled out, assigning a risk and impact using rubrics to be developed (and published) by the team. For example, after investigation the team might conclude that the above example appears to be a real manifestation of undesirable behavior affecting only this specific AI model at a low rate, a technical issue cannot be identified but some have been ruled out, and that therefore, of the remaining scenarios include:
where the probabilities and impacts are determined by published rubrics.
Providing recommendations to different constituencies. In the above example, the encounter team might provide a recommendation to: