Security
Shield configuration
24 checks, profiles, thresholds, fail_behavior, custom PII recognizers, session drift, and the shadow-to-enforce workflow.
16 min read
Every Shield check, every profile, and every knob you can turn to move from shadow mode to enforce without breaking real traffic.
TL;DR
- 12 checks ship today, grouped into S1 (input/output), S2 (MCP-aware), and S3 (high-assurance). Their IDs run
CHK-013…CHK-024. - Three built-in profiles:
none,baseline(default),strict. Acustomprofile is available for per-check overrides. - Every check has three modes:
off,log_only,enforce.baselineruns most checks inlog_onlyso you see verdicts without blocking. fail_behaviorcontrols what happens when a detector is broken or unreachable —fail_closedblocks the request,fail_openlets it through with a warning verdict.- You can register custom PII recognizers and tune session-drift thresholds via the admin API. No on-disk config.
Check catalog
All 12 checks are served by the gateway's Shield layer. Scanner names are behavioural — they describe what the check detects, not how it detects it.
| ID | Name | Scan point | Phase | Default mode (baseline) | Default fail_behavior | Detects |
|---|---|---|---|---|---|---|
| CHK-013 | Prompt Injection | Pre-LLM | S1 | log_only | fail_closed | Attempts in the user input to override system instructions or hijack model behaviour. Multilingual classifier-backed. |
| CHK-014 | Jailbreak | Pre-LLM | S1 | log_only | fail_closed | Jailbreak patterns — persona overrides, role-play attacks. Produced by the same classifier pass as CHK-013. |
| CHK-015 | PII in Input | Pre-LLM | S1 | enforce | fail_open | Personally identifiable information (email, phone, credit card, national ID, IP address, etc.) in the user input. Redacts inline using the configured redaction_mode. |
| CHK-016 | Secrets in Input | Pre-LLM | S1 | enforce | fail_open | Secret tokens in the user input. Pattern-matched against a built-in catalog (see below). Inline redaction. |
| CHK-017 | Toxicity in Output | Post-LLM | S1 | log_only | fail_open | Toxic, threatening, insulting, or hateful content in the model response. Multilingual classifier-backed. |
| CHK-018 | PII in Output | Post-LLM | S1 | log_only | fail_open | PII accidentally leaked in the model response. Same engine and redaction modes as CHK-015. |
| CHK-019 | Secrets in Output | Post-LLM | S1 | log_only | fail_open | Secret tokens the model echoes back. Same pattern catalog as CHK-016. |
| CHK-020 | Tool-Output Injection | Post-tool | S2 | log_only | fail_open | Indirect prompt-injection payloads hidden inside MCP tool results before they re-enter the LLM context. |
| CHK-021 | PII in Tool Output | Post-tool | S2 | log_only | fail_open | PII in MCP tool results. Uses a per-connection redaction mode if set, otherwise the key's configured mode. |
| CHK-022 | Session Drift | Session-level | S2 | log_only | fail_open | Cumulative exfiltration pressure across a session — PII hits, external URLs, data volume — with separate warn and block thresholds. |
| CHK-023 | Grounding / Hallucination | Post-LLM | S3 | log_only | fail_open | Whether the model's response is supported by the supplied reference context (tool outputs, system prompt, or explicit reference). Runs off the hot path; verdict is recorded in the audit trace. |
| CHK-024 | Off-Topic Detection | Pre-LLM | S3 | log_only | fail_open | User inputs that fall outside the key's allowed-topic set, determined by embedding similarity against per-key topic vectors. |
Profiles
A profile is a named bundle of per-check configurations, stored as JSONB on the virtual key (VirtualKeyORM.security_profile). The three built-ins come from BUILTIN_PROFILES.
none
All checks off. Zero Shield overhead. Use only for trusted internal services where Shield is not required.
baseline (default)
- Enforce: CHK-015 (PII in input), CHK-016 (secrets in input) — these redact inline and return the modified text to the model.
- Log-only: CHK-013, CHK-014, CHK-017, CHK-018, CHK-019, CHK-020, CHK-021, CHK-022, CHK-023, CHK-024.
- PII defaults:
pii_redaction_mode: fake, entities =EMAIL_ADDRESS,PHONE_NUMBER,CREDIT_CARD,IBAN_CODE,US_SSN,IP_ADDRESS,CRYPTO,MEDICAL_LICENSE.
strict
Everything that was log_only in baseline is upgraded to enforce, except CHK-023. CHK-023 remains log_only in strict because its verdict arrives asynchronously — after the response has already been returned — and therefore cannot retroactively block it.
custom
Per-check overrides applied on top of an optional base profile. The per-key JSONB looks like:
{
"profile": "custom",
"base_profile": "baseline",
"pii_redaction_mode": "mask",
"pii_entities": ["EMAIL_ADDRESS", "PHONE_NUMBER"],
"overrides": {
"chk_013": { "mode": "enforce" },
"chk_022": { "extra": { "pii_block": 100 } }
}
}
Check IDs inside overrides can be either CHK-013 or chk_013 — the resolver normalises them.
Legacy profile names
If a JSONB blob references highassurance from an earlier internal build, the resolver silently remaps it to strict. You should not need to care about this.
Reading the resolved profile
GET /api/v1/security/profiles
Returns every built-in profile with each check's mode, threshold, fail_behavior, redaction_mode, plus the tenant default. Useful as the source of truth for an admin UI or policy-as-code tooling.
Thresholds
Each check can carry a threshold, a fail_behavior, a redaction_mode, and a free-form extra dict.
Defaults in baseline:
| Check | threshold | fail_behavior | extra notes |
|---|---|---|---|
| CHK-013 / CHK-014 | 0.90 | fail_closed | — |
| CHK-015 / CHK-018 / CHK-021 | 0.5 | fail_open | redaction_mode: fake, entities: [...] |
| CHK-016 / CHK-019 | 0.85 (default) | fail_open | Pattern match — threshold is effectively hit-or-miss. |
| CHK-017 | 0.90 | fail_open | — |
| CHK-020 | 0.90 | fail_open | — |
| CHK-022 | 0.85 (default) | fail_open | Counter thresholds — see Session drift. |
| CHK-023 | 0.7 | fail_open | async: true, daily_token_budget: 100_000, max_context_chars: 8000. |
| CHK-024 | 0.50 | fail_open | Embedding similarity — lower = looser. |
In the custom profile you can override any of these per key. Example override payload in PATCH /api/v1/keys/{key_ref}:
{
"security_profile": {
"profile": "custom",
"base_profile": "strict",
"overrides": {
"CHK-024": { "threshold": 0.40 },
"CHK-022": { "extra": { "pii_warn": 10, "pii_block": 25 } }
}
}
}
fail_behavior semantics
fail_open— if the underlying detector is broken (timeout, exception, model unavailable), the verdict is anALLOWwith anerrorfield describing what happened. The request proceeds. The verdict is still recorded in the audit trace.fail_closed— if the detector is broken, the request is blocked with a503-equivalent verdict. Use for checks whose failure must not be treated as a pass (prompt injection, jailbreak).
log_only mode is orthogonal to fail_behavior: log-only verdicts never affect the overall decision. They only appear in the trace.
Built-in secret categories
CHK-016 (input) and CHK-019 (output) match against a catalogue of 17 patterns. The patterns themselves are not published — only the category names:
AWS_ACCESS_KEY,AWS_SECRET_KEYOPENAI_API_KEYANTHROPIC_API_KEYGITHUB_PAT,GITHUB_APP_TOKEN,GITHUB_FINE_GRAINEDGITLAB_PATSLACK_BOT_TOKEN,SLACK_USER_TOKENSTRIPE_SECRET_LIVE,STRIPE_SECRET_TEST,STRIPE_RESTRICTEDPRIVATE_KEY_PEMJWT_TOKENGOOGLE_API_KEYVEROSEK_KEY(the gateway catches its own virtual keys leaking through a prompt).
Custom PII recognizers
You can add domain-specific entity types (Patient MRN, Case Number, Internal Customer ID, Employee SSO Token, etc.) via the admin API. Loaded into the PII scanner at gateway startup.
POST /api/v1/security/custom-pii
Request body (validated by CustomPIIRecognizerCreate):
{
"entity_name": "PATIENT_MRN",
"regex_pattern": "MRN[-_ ]?[0-9]{6,10}",
"description": "Hospital patient medical record numbers",
"default_redaction": "MRN-XXXXXX",
"confidence": 0.85,
"enabled": true
}
Constraints:
entity_namemust match^[A-Z][A-Z0-9_]*$(length 2–120).regex_patternis validated by compiling it; invalid regex returns400.
Response on success (CustomPIIRecognizerResponse):
{
"id": "pii_...",
"entity_name": "PATIENT_MRN",
"regex_pattern": "MRN[-_ ]?[0-9]{6,10}",
"description": "Hospital patient medical record numbers",
"default_redaction": "MRN-XXXXXX",
"confidence": 0.85,
"enabled": true,
"created_at": "2025-01-01T00:00:00Z"
}
Other endpoints:
GET /api/v1/security/custom-pii— list.DELETE /api/v1/security/custom-pii/{recognizer_id}— remove.
Built-in recognizer entity names (shipped in baseline.pii_entities): EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, IBAN_CODE, US_SSN, IP_ADDRESS, CRYPTO, MEDICAL_LICENSE.
The multilingual PII engine covers English, Spanish, French, German, Italian, and Portuguese out of the box. Language is auto-detected per request with a safe fall-back to English.
Session drift (CHK-022)
CHK-022 tracks cumulative accumulators per session and fires when any threshold is crossed. Defaults in baseline:
| Counter | Warn threshold | Block threshold |
|---|---|---|
| PII hits (cumulative) | 20 | 50 |
| External URLs | 10 | 30 |
| Data volume (bytes) | 5_000_000 | 20_000_000 |
When a block threshold is crossed, the scanner sets a flag on the session. Subsequent requests from the same key are blocked under CHK-022 until an admin intervenes.
To tune thresholds on a specific key, use the custom profile override:
{
"security_profile": {
"profile": "custom",
"base_profile": "baseline",
"overrides": {
"CHK-022": {
"extra": {
"pii_warn": 10,
"pii_block": 25,
"urls_warn": 5,
"urls_block": 15,
"bytes_warn": 1000000,
"bytes_block": 5000000
}
}
}
}
}
Per-connection redaction (CHK-021)
Each MCP connection can carry its own redaction profile, separate from whatever the calling key has configured. Useful when an enterprise-wide database connection should always mask emails regardless of which key calls it.
Set via PATCH /api/v1/connections/{connection_id}:
{
"security_profile": {
"pii_enabled": true,
"pii_redaction_mode": "fake",
"pii_entities": ["EMAIL_ADDRESS", "PHONE_NUMBER"]
}
}
Topic centroids (CHK-024)
CHK-024 scores incoming user prompts against a set of topic centroids registered per virtual key.
GET /api/v1/security/topics?key_id=...— list centroids for a key.POST /api/v1/security/topics— register a new topic. Body:{ "key_id", "topic_name", "example_sentences", "threshold" }. The gateway computes the centroid and stores it.DELETE /api/v1/security/topics/{topic_id}— remove.
If a key has no registered topics, CHK-024 returns ALLOW by design — the check only fires when you have defined what "on topic" means for that key.
Verdict shape in the audit trace
Every Shield scan writes a step into the trace. The step type is one of SECURITY_SCAN_INPUT, SECURITY_SCAN_OUTPUT, SECURITY_BLOCKED.
Example (SECURITY_SCAN_INPUT):
{
"step_number": 2,
"type": "SECURITY_SCAN_INPUT",
"timestamp": "2025-01-01T12:00:00.123Z",
"duration_ms": 12,
"policy_decision": {
"decision": "MODIFY",
"score": 0.0,
"triggered_checks": [
{
"check_id": "CHK-015",
"triggered": true,
"detail": "pii: MODIFY (mode=enforce, conf=0.92)",
"score_contribution": 0.92
},
{
"check_id": "CHK-016",
"triggered": false,
"detail": "secrets: ALLOW (mode=enforce, conf=0.00)",
"score_contribution": 0.0
}
],
"modifications": ["CHK-015: MODIFY"],
"block_reason": null
}
}
Every verdict is included in triggered_checks, including ALLOW verdicts (with triggered: false), so the trace shows the full evaluation evidence — a shadow-mode scan that found nothing is visibly a scan that ran, not a missing step.
The aggregate decision field follows these rules:
- Any
enforce-mode check returningBLOCK→ overallBLOCK. - Else any
enforce-mode check returningMODIFY→ overallMODIFY. - Else →
ALLOW. log_onlyverdicts never affect the overall decision.
Log-only → enforce graduation workflow
The point of shadow mode is to build confidence before you let a check block real traffic. The recommended workflow:
- Start every key in the
baselineprofile (the default). Most checks arelog_only. - Let real traffic flow for enough time to produce a representative verdict sample. Keep Shield's ML-backed detectors warm.
- Query the audit API for the verdict distribution per check. Look at the
SECURITY_SCAN_*steps in/api/v1/traces. The Shield analytics endpoint (GET /api/v1/security/analytics) aggregates hits per check. - For each check you want to graduate, compute: true-positive rate, false-positive rate on the sample. Decide whether blocking the false positives is acceptable.
- When comfortable, flip the check from
log_onlytoenforce— either by switching the profile (baseline→strict), or by applying acustomoverride on the key:PATCH /api/v1/keys/{key_ref}withsecurity_profile.overrides.CHK-XXX.mode = "enforce". - Monitor the trace for blocked requests. Any false positive should show up as a
SECURITY_BLOCKEDstep you can inspect end-to-end.
Rolling back is symmetric: flip the check back to log_only (or off) via the same PATCH.
What's next
Read the Audit API for how to query the trace store, download signed decision receipts, and export compliance evidence bundles.