ai-safety

Here are 3,771 public repositories matching this topic...

microsoft / agent-governance-toolkit

AI Agent Governance Toolkit — Policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10.

microsoft python security owasp trust compliance governance ai-safety policy-engine ai-agents zero-trust agent-framework

Updated Jun 13, 2026
Python

jphall663 / awesome-machine-learning-interpretability

Star

A curated list of awesome responsible machine learning resources.

Updated Jun 3, 2026

PKU-Alignment / safe-rlhf

Star

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

Updated Nov 24, 2025
Python

OpenLMLab / MOSS-RLHF

Star

Secrets of RLHF in Large Language Models Part I: PPO

alignment ai-safety rlhf

Updated Mar 3, 2024
Python

cvs-health / uqlm

Star

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Jun 8, 2026
Python

tg12 / gpt_jailbreak_status

Star

This is a repository that aims to provide updates on the status of jailbreaking the OpenAI GPT language model.

jailbreak openai gpt ai-safety llm chatgpt prompt-injection

Updated May 16, 2026
HTML

wuyoscar / Internal-Safety-Collapse

Star

Internal Safety Collapse (ISC): Turning the LLM or an AI Agent into a sensitive data generator.

benchmark jailbreak ai-safety red-teaming large-language-models llm-safety safety-evaluation agent-safety

Updated Jun 11, 2026
Python

chrisliu298 / awesome-llm-unlearning

Star

A resource repository for machine unlearning in large language models

Updated Jun 10, 2026

Agentlens is a trusted agent trading platform. Here, you can quickly find the Agent that meets your needs, and you can also publish your own Agent to turn it into your digital asset. We encourage everyone to transform their areas of expertise into Agents and turn them into digital assets, allowing others to see your unique strengths.

Updated Jun 9, 2026
TypeScript

PacificAI / langtest

Star

Deliver safe & effective language models

nlp artificial-intelligence benchmarks benchmark-framework model-assessment ai-safety mlops responsible-ai ml-safety trustworthy-ai ethics-in-ai ml-testing large-language-models llm ai-testing llm-test llm-evaluation-toolkit llm-as-evaluator llm-testing

Updated Apr 22, 2026
Python

agencyenterprise / PromptInject

Star

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Apr 27, 2026
Python

cordum-io / cordum

Star

The open agent control plane. Govern autonomous AI agents with pre-execution policy enforcement, approval gates, and audit trails. Works with LangChain, CrewAI, MCP, and any framework.

Updated Jun 9, 2026
Go

ifixai-ai / iFixAi

Star

Catch your AI's mistakes and blind spots before your customers or regulators do. iFixAi runs 45 inspections, 32 graded core plus 13 extended for frontier risks like sabotage, sandbagging, and oversight evasion. It returns a letter grade in under 5 minutes. Industry and model agnostic.

Updated Jun 9, 2026
Python

pegasi-ai / reins

Star

Stop AI agents from doing things you didn't ask for.

mcp intervention browser-automation ai-safety cua human-in-the-loop audit-trail ai-monitoring agent-security agent-observability claude-code-plugin claude-code-skill claude-code-marketplace openclaw-security

Updated May 22, 2026
Python

tigerlab-ai / tiger

Star

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

classification data-augmentation ai-safety fine-tuning aisafety rag large-language-models llm llm-training

Updated Dec 2, 2023
Jupyter Notebook

aisa-group / PostTrainBench

Star

Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours

ai-safety post-training gemini-cli claude-code codex-cli ai-research-automation

Updated Jun 10, 2026
Python

Justin0504 / Aegis

Star

Runtime policy enforcement for AI agents. Cryptographic audit trail, human-in-the-loop approvals, kill switch. Zero code changes.

mcp ai-safety policy-engine ai-agents audit-trail langchain anthropic llm-observability

Updated Jun 11, 2026
TypeScript

hendrycks / ethics

Star

Aligning AI With Shared Human Values (ICLR 2021)

ai-safety machine-ethics ml-safety ethical-ai gpt-3

Updated Apr 21, 2023
Python

ttguy0707 / CyberClaw

Star

python agent cross-platform ai-safety agent-framework ai-agent llm langchain enterprise-ai langgraph transparent-ai claude-code openclaw two-phase-invocation

Updated Jun 7, 2026
Python

Govcraft / rust-docs-mcp-server

Sponsor

Star

🦀 Prevents outdated Rust code suggestions from AI assistants. This MCP server fetches current crate docs, uses embeddings/LLMs, and provides accurate context via a tool call.

Updated Nov 24, 2025
Rust

Improve this page

Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-safety

Here are 3,771 public repositories matching this topic...

microsoft / agent-governance-toolkit

jphall663 / awesome-machine-learning-interpretability

PKU-Alignment / safe-rlhf

OpenLMLab / MOSS-RLHF

cvs-health / uqlm

tg12 / gpt_jailbreak_status

wuyoscar / Internal-Safety-Collapse

chrisliu298 / awesome-llm-unlearning

ZhangJinHaHaHa / AgentLens

PacificAI / langtest

agencyenterprise / PromptInject

cordum-io / cordum

ifixai-ai / iFixAi

pegasi-ai / reins

tigerlab-ai / tiger

aisa-group / PostTrainBench

Justin0504 / Aegis

hendrycks / ethics

ttguy0707 / CyberClaw

Govcraft / rust-docs-mcp-server

Improve this page

Add this topic to your repo