A downloadable tool

Our project furthers the progress of Scale Oversight through automation of the sandwiching paradigm. In the Bowman et al. (2022) paper, the question is presented of how humans can effectively prompt unreliable, superhuman AIs to answer questions via conversation to arrive at accurate answers. We want to explore and evaluate the methods that humans can use reliably to elicit honest responses, from a more intelligent AI. We present a novel method, called Automatic Sandwiching, for implementing this paradigm. We implement a simplified version of this, evaluate our system on 163 training examples from Multi-task Language Understanding (MMLU) with 2 different oversight techniques. We provide code to reproduce our results at sophia-pung/ScaleOversight.

Download

Download
Automated Sandwiching - ScaleOversight hackathon Write-up.pdf 2.2 MB
Download
automated_sandwiching.ipynb 30 kB
Download
mmlu_results.csv 65 bytes
Download
mmlu_results_explain_reasoning.csv 400 kB

Leave a comment

Log in with itch.io to leave a comment.