A downloadable tool

Our project furthers the progress of Scale Oversight through automation of the sandwiching paradigm. In the Bowman et al. (2022) paper, the question is presented of how humans can effectively prompt unreliable, superhuman AIs to answer questions via conversation to arrive at accurate answers. We want to explore and evaluate the methods that humans can use reliably to elicit honest responses, from a more intelligent AI. We present a novel method, called Automatic Sandwiching, for implementing this paradigm. We implement a simplified version of this, evaluate our system on 163 training examples from Multi-task Language Understanding (MMLU) with 2 different oversight techniques. We provide code to reproduce our results at sophia-pung/ScaleOversight.

More information

Status	Released
Category	Tool
Author	gmukobi
Tags	alignment-jam, artificial-intelligence, scale-oversight

Download

Automated Sandwiching - ScaleOversight hackathon Write-up.pdf 2.2 MB

Download

automated_sandwiching.ipynb 30 kB

Download

mmlu_results.csv 65 bytes

Download

mmlu_results_explain_reasoning.csv 400 kB

Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques

Download

Leave a comment