The problem
Safety evaluation datasets are how the field tells itself whether a model, a classifier, or a pipeline is working. But many of the datasets in widespread use are silent on the labeling policy that produced them and on the provenance of their content — leaving researchers to benchmark against a moving target, and platforms to ship interventions tuned to harms that don't match what they actually see.
What we're doing
This workstream takes a two-part approach: audit the foundations of existing datasets, and build a higher-fidelity alternative.
The audit examines policy documentation, provenance, label definitions, and known limitations of the datasets practitioners actually use. The build collaborates with domain experts to develop a new dataset grounded in transparent policies — drawing on either curated real-world data from platforms like Bluesky or expert-vetted synthetic scenarios, depending on the harm category.
The deliverable
We will publish both the data and the "Very Good™" policies used to generate it. The aim is to make accountability and granularity the default — so that anyone benchmarking against the dataset can also inspect, contest, or extend the policy behind it.
The policy is the dataset. Anything else is a black box dressed up as a benchmark.
Updates
As outputs publish, they'll appear here and in the workstreams index.