Evaluation Datasets — Tooling Research Committee

Improving Digital Safety Evaluation Datasets

Auditing the foundations of how we measure safety.

Current safety evaluation datasets often lack transparency about their underlying labeling policies and data provenance, producing benchmarks that may not reflect real-world harms or platform-specific standards. This workstream investigates those limitations and develops a high-fidelity dataset grounded in rigorous, transparent safety policies — published alongside the policies used to generate it.

The problem

Safety evaluation datasets are how the field tells itself whether a model, a classifier, or a pipeline is working. But many of the datasets in widespread use are silent on the labeling policy that produced them and on the provenance of their content — leaving researchers to benchmark against a moving target, and platforms to ship interventions tuned to harms that don't match what they actually see.

What we're doing

This workstream takes a two-part approach: audit the foundations of existing datasets, and build a higher-fidelity alternative.

The audit examines policy documentation, provenance, label definitions, and known limitations of the datasets practitioners actually use. The build collaborates with domain experts to develop a new dataset grounded in transparent policies — drawing on either curated real-world data from platforms like Bluesky or expert-vetted synthetic scenarios, depending on the harm category.

The deliverable

We will publish both the data and the "Very Good™" policies used to generate it. The aim is to make accountability and granularity the default — so that anyone benchmarking against the dataset can also inspect, contest, or extend the policy behind it.

The policy is the dataset. Anything else is a black box dressed up as a benchmark.

Updates

As outputs publish, they'll appear here and in the workstreams index.