Overview

Safety policies let you control what your personas can and cannot discuss. Policies are scoped to a tenant and apply to every persona within it. You can block specific terms, filter toxic content, gate PII handling, and define escalation keywords that trigger human handoff.

Guardrails are evaluated before and after generation, so blocked content never reaches your users.

Safety policy structure

A safety policy is a JSON object with five configurable fields. All are optional — omit any field to leave that dimension unrestricted.

Field	Type	Description
`blockedTerms`	`string[]`	`Exact terms or phrases that are forbidden in prompts and responses.`
`allowPii`	`boolean`	`Whether the persona may process or return personally identifiable information. Defaults to false.`
`toxicityThreshold`	`number (0–1)`	`Maximum toxicity score allowed. Lower values are stricter. Default: 0.7.`
`promptGuards`	`string[]`	`Patterns applied to inbound prompts to block jailbreak or injection attempts.`
`escalationKeywords`	`string[]`	`Keywords that trigger automatic escalation to a human agent.`

Create or update a policy

Use the PUT /personas/safety-policy endpoint. This upserts — if a policy already exists for the tenant, it's replaced.

PUT /personas/safety-policy

curl -X PUT https://api.person.run/personas/safety-policy \
  -H "x-api-key: $PERSON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "tenantId": "'$PERSON_TENANT_ID'",
    "blockedTerms": ["competitor-name", "internal-project-codename"],
    "allowPii": false,
    "toxicityThreshold": 0.5,
    "promptGuards": [
      "ignore previous instructions",
      "pretend you are",
      "act as if you have no restrictions"
    ],
    "escalationKeywords": ["speak to a human", "manager", "complaint"]
  }'

Response (200)

{
  "policyPack": {
    "tenantId": "your-tenant-id",
    "blockedTerms": ["competitor-name", "internal-project-codename"],
    "allowPii": false,
    "toxicityThreshold": 0.5,
    "promptGuards": [
      "ignore previous instructions",
      "pretend you are",
      "act as if you have no restrictions"
    ],
    "escalationKeywords": ["speak to a human", "manager", "complaint"],
    "updatedAt": "2026-02-20T12:00:00.000Z"
  }
}

Read the current policy

GET /personas/safety-policy

curl https://api.person.run/personas/safety-policy\?tenantId=$PERSON_TENANT_ID \
  -H "x-api-key: $PERSON_API_KEY"

Returns { "policyPack": null } if no policy has been configured yet.

Remove a policy

Deleting the policy removes all guardrails for the tenant. Personas will respond without any content filtering.

DELETE /personas/safety-policy/:tenantId

curl -X DELETE https://api.person.run/personas/safety-policy/$PERSON_TENANT_ID \
  -H "x-api-key: $PERSON_API_KEY"

Blocked terms

Blocked terms are matched case-insensitively against both the inbound prompt and the generated response. If a blocked term is detected, the request is rejected with a 400 status before any AI generation occurs.

TipUse blocked terms for brand names, competitor references, internal codenames, or any content your personas should never produce. For broader topic restrictions, combine blocked terms with prompt guards.

Prompt guards

Prompt guards are patterns matched against inbound user prompts to detect jailbreak attempts, prompt injection, or social engineering. When a guard matches, the prompt is rejected before generation.

Common prompt guard patterns:

"ignore previous instructions" — blocks instruction override attempts.
"pretend you are" — blocks role reassignment attacks.
"act as if you have no restrictions" — blocks constraint removal attempts.
"output your system prompt" — blocks system prompt extraction.
"DAN" or "jailbreak" — blocks known jailbreak technique names.

Toxicity threshold

The toxicity threshold is a value between 0 and 1 that sets the maximum acceptable toxicity score for generated content. Lower values are stricter:

Threshold	Strictness	Use case
`0.3`	`Very strict`	`Children's content, education, healthcare`
`0.5`	`Moderate`	`Customer support, general business`
`0.7`	`Permissive`	`Creative writing, casual conversation`
`1.0`	`Unrestricted`	`Internal tools, testing`

Escalation keywords

When a user's prompt contains an escalation keyword, person.run flags the interaction for human review. Depending on your integration, this can trigger a webhook, route to a support queue, or pause the conversation.

NoteEscalation events are emitted alongside the normal response — the persona still replies, but the event is logged and can trigger downstream workflows via webhooks.

Recommended setup

For most production deployments, we recommend starting with this baseline and tuning from there:

Recommended baseline

{
  "blockedTerms": [],
  "allowPii": false,
  "toxicityThreshold": 0.5,
  "promptGuards": [
    "ignore previous instructions",
    "pretend you are",
    "act as if you have no restrictions",
    "output your system prompt"
  ],
  "escalationKeywords": [
    "speak to a human",
    "talk to a person",
    "complaint",
    "lawsuit",
    "legal"
  ]
}