Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques
A downloadable tool
Our project furthers the progress of Scale Oversight through automation of the sandwiching paradigm. In the Bowman et al. (2022) paper, the question is presented of how humans can effectively prompt unreliable, superhuman AIs to answer questions via conversation to arrive at accurate answers. We want to explore and evaluate the methods that humans can use reliably to elicit honest responses, from a more intelligent AI. We present a novel method, called Automatic Sandwiching, for implementing this paradigm. We implement a simplified version of this, evaluate our system on 163 training examples from Multi-task Language Understanding (MMLU) with 2 different oversight techniques. We provide code to reproduce our results at sophia-pung/ScaleOversight.
Status | Released |
Category | Tool |
Author | gmukobi |
Tags | alignment-jam, artificial-intelligence, scale-oversight |
Leave a comment
Log in with itch.io to leave a comment.