Overview
Safety policies let you control what your personas can and cannot discuss. Policies are scoped to a tenant and apply to every persona within it. You can block specific terms, filter toxic content, gate PII handling, and define escalation keywords that trigger human handoff.
Guardrails are evaluated before and after generation, so blocked content never reaches your users.
Safety policy structure
A safety policy is a JSON object with five configurable fields. All are optional — omit any field to leave that dimension unrestricted.
| Field | Type | Description |
|---|---|---|
blockedTerms | string[] | Exact terms or phrases that are forbidden in prompts and responses. |
allowPii | boolean | Whether the persona may process or return personally identifiable information. Defaults to false. |
toxicityThreshold | number (0–1) | Maximum toxicity score allowed. Lower values are stricter. Default: 0.7. |
promptGuards | string[] | Patterns applied to inbound prompts to block jailbreak or injection attempts. |
escalationKeywords | string[] | Keywords that trigger automatic escalation to a human agent. |
Create or update a policy
Use the PUT /personas/safety-policy endpoint. This upserts — if a policy already exists for the tenant, it's replaced.
curl -X PUT https://api.person.run/personas/safety-policy \
-H "x-api-key: $PERSON_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"tenantId": "'$PERSON_TENANT_ID'",
"blockedTerms": ["competitor-name", "internal-project-codename"],
"allowPii": false,
"toxicityThreshold": 0.5,
"promptGuards": [
"ignore previous instructions",
"pretend you are",
"act as if you have no restrictions"
],
"escalationKeywords": ["speak to a human", "manager", "complaint"]
}'{
"policyPack": {
"tenantId": "your-tenant-id",
"blockedTerms": ["competitor-name", "internal-project-codename"],
"allowPii": false,
"toxicityThreshold": 0.5,
"promptGuards": [
"ignore previous instructions",
"pretend you are",
"act as if you have no restrictions"
],
"escalationKeywords": ["speak to a human", "manager", "complaint"],
"updatedAt": "2026-02-20T12:00:00.000Z"
}
}Read the current policy
curl https://api.person.run/personas/safety-policy\?tenantId=$PERSON_TENANT_ID \
-H "x-api-key: $PERSON_API_KEY"Returns { "policyPack": null } if no policy has been configured yet.
Remove a policy
Deleting the policy removes all guardrails for the tenant. Personas will respond without any content filtering.
curl -X DELETE https://api.person.run/personas/safety-policy/$PERSON_TENANT_ID \
-H "x-api-key: $PERSON_API_KEY"Blocked terms
Blocked terms are matched case-insensitively against both the inbound prompt and the generated response. If a blocked term is detected, the request is rejected with a 400 status before any AI generation occurs.
Prompt guards
Prompt guards are patterns matched against inbound user prompts to detect jailbreak attempts, prompt injection, or social engineering. When a guard matches, the prompt is rejected before generation.
Common prompt guard patterns:
- "ignore previous instructions" — blocks instruction override attempts.
- "pretend you are" — blocks role reassignment attacks.
- "act as if you have no restrictions" — blocks constraint removal attempts.
- "output your system prompt" — blocks system prompt extraction.
- "DAN" or "jailbreak" — blocks known jailbreak technique names.
Toxicity threshold
The toxicity threshold is a value between 0 and 1 that sets the maximum acceptable toxicity score for generated content. Lower values are stricter:
| Threshold | Strictness | Use case |
|---|---|---|
0.3 | Very strict | Children's content, education, healthcare |
0.5 | Moderate | Customer support, general business |
0.7 | Permissive | Creative writing, casual conversation |
1.0 | Unrestricted | Internal tools, testing |
Escalation keywords
When a user's prompt contains an escalation keyword, person.run flags the interaction for human review. Depending on your integration, this can trigger a webhook, route to a support queue, or pause the conversation.
Recommended setup
For most production deployments, we recommend starting with this baseline and tuning from there:
{
"blockedTerms": [],
"allowPii": false,
"toxicityThreshold": 0.5,
"promptGuards": [
"ignore previous instructions",
"pretend you are",
"act as if you have no restrictions",
"output your system prompt"
],
"escalationKeywords": [
"speak to a human",
"talk to a person",
"complaint",
"lawsuit",
"legal"
]
}