03 Building · Living resource

Functional Mapping of the Safety Tooling Landscape

A living map of the tools that keep platforms safer.

Existing frameworks for safety tooling lack the granularity required to map a rapidly diversifying ecosystem of internal and user-facing interventions. This workstream develops a comprehensive map, categorizing tools by functional utility, impact on user experience, and role in the safety lifecycle, from preventative design to reactive enforcement — and publishes it as a living, browsable resource.

A living map of the safety stack.

A growing catalogue of trust & safety tools — open-source and commercial, internal and user-facing. Six neighbourhoods on the map group the tools by what they do; switch to the card view to search or filter by topic.

The neighbourhood names are an evocative consolidation of the committee's preliminary topic buckets. A formal mapping by functional utility, user impact, and place in the safety lifecycle is being developed in parallel.

Click + drag to pan · scroll to zoom · click a pin to open the tool

Tooling Map illustration: a hand-drawn continent with six labelled neighbourhoods — Classifier Quarter, Guardrail Heights, Automation Alley, Reviewer's Row, Investigation Inlet, and The Hashlands.

Alice

CoPE

Detoxify

Modulate

OSmod

Roblox Voice Safety Classifier

Sentinel

Cinder

Musubi

Unitary

Osprey

Resolver

Variance

Checkstep

Community Sift

Coop

ReTool

Altitude

Hasher Matcher Action (HMA)

Hasher-Matcher-Actioner (CLIP demo)

hma-matrix

Lattice Extract

MediaModeration (Wiki Extension)

PDQ

Perception

RocketChat CSAM

TMK

VPDQ

Adrift — awaiting a neighbourhood

Nima by Tremau

Alice
Alice (FKA ActiveFence)

No description provided yet.
- AI for Safety
Altitude
Jigsaw

Web UI and hash matching for violent extremism and terrorism content
- Hash matching
Checkstep
Checkstep

No description provided yet.
- Review
Cinder

No description provided yet.
- Content moderation
Community Sift
Microsoft

Community Sift is an AI-powered content moderation platform that combines the best of both worlds: artificial intelligence and human expertise. It is trusted by companies and communities of all sizes to classify, filter, and escalate user-generated content in real-time. By using Community Sift, businesses can enhance online safety, improve user experiences, and focus on growth and innovation.
- Detection
- Review
Content Safety API
Google

Uses machine learning to detect novel CSAM, nudity, and sexually explicit content in images and videos free service, but requires registration not open source itself, but can be used via Coop (https://roostorg.github.io/coop/SIGNALS.html#content-safety-api-by-google), which is open source
- Classification
Coop
ROOST

Scaled review tool
- Review
CoPE
Zentropi

small language model trained for accurate, fast, steerable content classification based on developer-defined content policies
- Classification
Detoxify
Unitary AI

detects and mitigates generalized toxic language (including hate speech, harassment, bullying) in text
- Classification
gpt-oss-safeguard
OpenAI

open-weight reasoning model to classify text content based on provided safety policies
- Classification
Granite Guardian
IBM Research

an input-output guardrail for detecting harms in a variety of use cases (general harm, RAG settings, agentic workflows, etc.)
- AI for Safety
Guardrails AI
Guardrails AI

Python framework that helps build safe AI applications checking input/output for predefined risks
- AI for Safety
Hasher Matcher Action (HMA)
Meta / ROOST

Hashing algorithm, matching function, and ability to hook into actions
- Hash matching
Hasher-Matcher-Actioner (CLIP demo)
Individual - Juan Mrad

HMA extension for CLIP as reference for adding other format extensions
- Hash matching
Hive Classifiers
Hive

No description provided yet.
- Classification
hma-matrix
Matrix.org Foundation

Matrix-specific extensions to HMA for (primarily) the Matrix ecosystem
- Hash matching
Implio by Besedo
Besedo

A moderation tool with: AI Automation: Advanced machine learning models trained on billions of content items. Our AI understands nuance, context, patterns, and makes real-time decisions at scale. Rule-Based Filters: Simple, configurable filters that catch the obvious violations quickly and reliably. Perfect for spam, banned keywords, or clear-cut policies. Human Expertise: Multilingual and compliance-trained moderators who step in when context, culture, or judgment is required. They resolve edge cases and continuously retrain the AI to be smarter every day.
- Review
- Enforcement
Kanana Safeguard
Kakao

harmful content detection model based on Kanana 8B
- AI for Safety
Lasso Moderation
Lasso Moderation

A content moderation solution that's not just an API. Lasso brings the power of AI to protect your brand, tackling 99% of content moderation tasks. Our platform also offers an extensive moderation dashboard for that crucial 1%, where humans can efficiently and effectively moderate at scale.
- Review
- Enforcement
Lattice Extract
Adobe

Grid and lattice detection to guard against FP in hash matching
- Hash matching
Llama Guard
Meta

AI-powered content moderation model to detect harm in text-based interactions
- AI for Safety
Llama Prompt Guard 2
Meta

Detects prompt injection and jailbreaking attacks in LLM inputs
- AI for Safety
MediaModeration (Wiki Extension)
Wikimedia

CSAM hash matching for Wikimedia
- Hash matching
Modulate

No description provided yet.
- Classification
Musubi

No description provided yet.
- Automated T&S
Nima by Tremau
Tremau

Nima is the AI-driven Trust & Safety platform to protect users with efficient automated and human moderation. With one-single API, AI marketplace, and policy-centric approach. It centers compliance tracking/reporting as a core value proposition.
NSFW filtering
Individual - Navendu Pottekkat

browser extension to block explicit images from online platforms; user facing
- Classification
NSFW Keras Model
Individual - Gant Laborde

convoluted neural network (CNN) based explicit image ML model
- Classification
OSmod
Jigsaw

toolkit of machine learning (ML) tools, models, and APIs that platforms can use to moderate content
- Classification
Osprey
ROOST

Rules engine and investigation UI
- Investigation
PDQ
Meta

Perceptual hash algorithm for images
- Hash matching
Perception
Thorn

Provides a common wrapper around existing, popular perceptual hashes (such as those implemented by ImageHash)
- Hash matching
Perspective API
Jigsaw

machine learning-powered tool that helps platforms detect and assess the toxicity of online conversations
- Classification
Private Detector
Bumble

pretrained model for detecting lewd images
- Classification
Purple Llama
Meta

set of tools to assess and improve LLM security. Includes Llama Guard, CyberSec Eval, and Code Shield
- AI for Safety
Resolver

No description provided yet.
- Investigation
ReTool
Retool

No description provided yet.
- Review
Risk Atlas Nexus
IBM Research

knowledge-graph toolkit that maps AI risk taxonomies (IBM AI Risk Atlas, IBM Granite Guardian MIT AI Risk Repository, NIST AI RMF GenAI Profile, AIR 2024, AILuminate Benchmark, Credo Unified Control Framework, OWASP Top 10 for LLM Apps) to evaluations, mitigations and controls, supporting the generation of structured governance workflows
- AI for Safety
Roblox Guard 1.0
Roblox

LLM that helps safeguard unlimited text generation on Roblox
- AI for Safety
Roblox Voice Safety Classifier
Roblox

machine learning model that detects and moderates harmful content in real-time voice chat on Roblox; focuses on spoken language detection
- Classification
RocketChat CSAM
Center for Online Safety and Liberty

CSAM hash matching for RocketChat
- Hash matching
Safer by Thorn
Thorn

No description provided yet.
- Classification
- Review
SafetyKit

No description provided yet.
- Classification
Sentinel
Roblox

Python library designed specifically for realtime detection of extremely rare classes of text by using contrastive learning principles
- Classification
ShieldGemma
Google DeepMind

AI safety toolkit by Google DeepMind designed to help detect and mitigate harmful or unsafe outputs in LLM applications
- AI for Safety
TMK
Meta

Visual similarity match for videos
- Hash matching
Toxic Prompt RoBERTa
Intel

BERT-based model for detecting toxic content in prompts to language models
- Classification
Trust Lab

No description provided yet.
- Automated T&S
TrustedExecBench
OpenGuardrails

Security Gateway providing a transparent reverse proxy for OpenAI apis with integrated safety protection
- AI for Safety
Unitary

No description provided yet.
- Automated T&S
Variance

No description provided yet.
- Investigation
VPDQ
Meta

Visual similarity match for videos using PDQ algorithm
- Hash matching

Dimension i

Functional utility

What the tool actually does — classification, hash matching, review workflow, identity assurance, transparency reporting, and so on.

Dimension ii

Lifecycle position

Where in the safety lifecycle the tool acts — from preventative design, through detection, into responsive enforcement, and on to restorative measures.

Dimension iii

Impact on user experience

Whether the tool is internal (used by reviewers, engineers, or analysts) or user-facing (felt directly by people on the platform), and how visibly it shapes their experience.