gpt-oss-safeguard

gpt-oss-safeguard is a family of open-source safety models developed by OpenAI, available in 120 billion (120b) and 20 billion (20b) parameter versions, designed to classify content based on custom safety policies defined by developers. These models leverage reasoning capabilities to analyze inputs and outputs at inference time, generating explainable chain-of-thought rationales for each classification decision. They are fine-tuned variants of OpenAI’s gpt-oss models and distributed under the permissive Apache 2.0 license, enabling free use, modification, and deployment.
The core value of gpt-oss-safeguard lies in its ability to dynamically interpret developer-defined safety policies during inference, eliminating the need for retraining when policies change. This approach provides transparency through step-by-step reasoning logs, allowing developers to audit decisions and refine policies iteratively. It addresses the limitations of traditional classifiers by combining high flexibility with explainability, particularly in nuanced or evolving risk scenarios.

The models apply reasoning-based classification using chain-of-thought explanations, breaking down decisions into logical steps that align with the provided policy. For example, when evaluating a user message, the model analyzes text structure, intent, and context to determine compliance with rules like prohibiting hate speech or fraud.
Developers can input custom safety policies directly during inference, enabling real-time adaptation without retraining. Policies can range from simple rules (e.g., blocking profanity) to complex multi-clause guidelines (e.g., detecting subtle harassment patterns), with the model interpreting them dynamically.
Available as open-weight models under Apache 2.0, gpt-oss-safeguard allows full customization, including modifying architecture, integrating with existing moderation pipelines, or deploying on-premises. The 120b variant prioritizes accuracy in high-stakes scenarios, while the 20b version balances performance with lower computational costs.

Traditional safety classifiers require extensive labeled datasets and retraining to update policies, creating delays in addressing emerging risks like new misinformation tactics or platform-specific abuse patterns. gpt-oss-safeguard eliminates this bottleneck by allowing policy changes at inference time.
The product targets developers, trust-and-safety teams, and platform operators who need adaptable content moderation systems, particularly in domains with rapidly evolving risks (e.g., social media, gaming, or e-commerce).
Typical use cases include moderating gaming forums for cheat-code discussions, identifying fake product reviews using merchant-defined criteria, or screening biomedical research queries for dual-use risks. It also handles low-data scenarios where training traditional classifiers is impractical.

Unlike static classifiers like OpenAI’s Moderation API, which infer policies from training data, gpt-oss-safeguard directly processes explicit policy definitions, reducing misalignment between intended rules and model behavior. This enables precise control over moderation criteria.
The model’s reasoning capability allows it to handle multi-policy evaluations simultaneously, such as checking a single message against hate speech, violence, and privacy violation guidelines in one pass. Internal tests show 12% higher accuracy in multi-policy tasks compared to GPT-5-based classifiers.
Competitive advantages include explainable outputs for auditability, Apache 2.0 licensing for commercial flexibility, and latency-optimized deployment options. Despite smaller parameter counts than GPT-5, the 120b model matches or exceeds GPT-5’s safety classification accuracy in 83% of tested scenarios.

How does gpt-oss-safeguard differ from OpenAI’s Moderation API? The Moderation API uses pre-trained classifiers based on fixed policies, while gpt-oss-safeguard dynamically interprets custom policies provided at inference. The latter also provides chain-of-thought explanations, which the Moderation API lacks.
Can the model handle multiple policies simultaneously? Yes, gpt-oss-safeguard-120b achieved 89% accuracy in internal tests evaluating compliance with three concurrent policies (e.g., hate speech, self-harm, and copyright violations). Developers can input multiple policy documents or merge them into a single guideline.
Is gpt-oss-safeguard suitable for high-throughput applications? The 20b variant processes 120 tokens per second on A100 GPUs, making it viable for asynchronous batch processing, while the 120b model is recommended for critical real-time checks. OpenAI recommends combining it with faster preliminary filters to optimize costs.

Open safety reasoning models with custom safety policies