![]() ![]() If they are reliably working, try harder to break them, by scaling up the number of debates, choosing harder questions, internal and external red-teaming, offering a bounty, e.t.c.We make progress by going through a loop of: Once we get to a point where we have a mechanism we believe works well, we will try different ways to break it. We’re using mostly internal iteration (as opposed to external judge and debater pools) to test these mechanisms, as they still have a lot of easy-to-find failures to work through. We’re focusing on improving our debate mechanisms as fast as possible. In Q4 we converged on a research process we’re more happy with. More details of this early iteration are here. In the rest of this document, we’ll describe the research done by reflection-humans in Q3 and Q4 on investigating and developing mechanisms that incentivize human experts to give helpful answers.ĭuring the early stages, we iterated through various different domains, research methodologies, judge pools, and research processes. ![]() This can be simulated by having human debaters argue about a question and a judge attempt to pick the correct answer. One broad mechanism that might work is to invoke two (or more) competing agents that critique each others’ positions, as discussed in the original debate paper. Conversely, if we can’t incentivize experts to behave helpfully, that suggests it will also be difficult to train ML systems with superhuman expertise on open-ended tasks. If we can develop a mechanism that allows non-expert humans to reliably incentivize experts to give helpful answers, we can use similar mechanisms to train ML systems to solve tasks where humans cannot directly evaluate performance. By the same token, a reward function based on human evaluations would not work well for an AI with superhuman physics knowledge, even if it works well for modern ML. For example, if we hire a physicist to answer physics questions and pay them based on how good their answers look to a layperson, we’ll incentivize lazy and incorrect answers. We can learn about designing ML objectives by studying mechanisms for eliciting helpful behavior from human experts. We’d like to ensure that AI systems are aligned with human values even in cases where it’s beyond human ability to thoroughly check the AI system’s work. One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing. We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples). ![]() OverviewĪs we apply ML to increasingly important and complex tasks, the problem of evaluating behaviour and providing a good training signal becomes more difficult. Until that is fixed, just open those links in a new tab, which should scroll correctly. Note by Oliver: There is currently a bug with links to headings in a post, causing them to not properly scroll when clicked. Oliver Habryka helped format and edit the document for the AI Alignment Forum. We’d also like to thank our contractors who participated in debate experiments, especially David Jones, Erol Akbaba, Alex Deam and Chris Painter. In particular: the cross-examination idea was inspired by a conversation with Chelsea Voss Adam Gleave had helpful ideas about the long computation problem Jeff Wu, Danny Hernandez and Gretchen Krueger gave feedback on a draft we had helpful conversations with Amanda Askell, Andreas Stuhlmüller and Joe Collman, as well as others on the Ought team and the OpenAI Reflection team. We are grateful to many others who offered ideas and feedback. The main researchers on this project were Elizabeth Barnes, Paul Christiano, Long Ouyang and Geoffrey Irving. This follows from the original work on AI Safety via Debate and the call for research on human aspects of AI safety, and is also closely related to work on Iterated Amplification. During that period we investigated mechanisms that would allow evaluators to get correct and helpful answers from experts, without the evaluators themselves being expert in the domain of the questions. This is a writeup of the research done by the "Reflection-Humans" team at OpenAI in Q3 and Q4 of 2019. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |