Scanner
The Carapace scanner is a zero-dependency prompt injection detection engine. It analyzes text for patterns that attempt to manipulate LLM behavior, scoring each finding and taking action based on configurable thresholds.
How it works
Section titled “How it works”The scanner runs a battery of pattern matchers against input text. Each matcher targets a specific attack category and produces findings with severity scores. Scores are summed to produce a total threat score, which determines the action.
Input text → 29 category matchers (parallel) → Findings (category, score, evidence) → Total score → Action (PASS / LOG / WARN / BLOCK)29 attack categories
Section titled “29 attack categories”| # | Category | What it detects |
|---|---|---|
| 1 | System prompt override | ”ignore previous instructions”, “new system prompt” |
| 2 | Role manipulation | ”you are now”, “act as”, “pretend to be” |
| 3 | Instruction injection | ”do not follow”, “disregard”, “forget your rules” |
| 4 | Data exfiltration | ”output your system prompt”, “repeat everything above” |
| 5 | Encoding bypass | Base64/hex/rot13 encoded payloads |
| 6 | Delimiter injection | Markdown, XML, HTML tags to break context |
| 7 | Multi-turn manipulation | Building trust across messages |
| 8 | Jailbreak patterns | DAN, developer mode, GPT-4 jailbreaks |
| 9 | Prompt leaking | Extracting system instructions |
| 10 | Token smuggling | Zero-width characters, homoglyphs |
| 11 | Context overflow | Extremely long inputs to push context |
| 12 | Tool misuse | Manipulating tool calls or function names |
| 13 | Payload hiding | Steganographic or invisible text patterns |
| 14 | Indirect injection | Instructions embedded in data (URLs, files) |
| 15 | Multilingual bypass | Non-English instructions to evade English rules |
| 16 | Hypothetical framing | ”hypothetically”, “in a story”, “roleplay” |
| 17 | Authority impersonation | ”I’m the developer”, “admin override” |
| 18 | Emotional manipulation | Urgency, threats, guilt |
| 19 | Recursive injection | Nested prompt injection in outputs |
| 20 | Few-shot poisoning | Manipulated examples |
| 21 | Chain-of-thought hijack | ”let’s think step by step” for malicious goals |
| 22 | Memory manipulation | Altering conversation history |
| 23 | API abuse | Manipulating API parameters |
| 24 | Multimodal injection | Text in images, audio transcripts |
| 25 | Competitive analysis | Extracting model capabilities |
| 26 | Output format manipulation | Forcing JSON/code/markdown for injection |
| 27 | Semantic deception | Contradictory or misleading framing |
| 28 | Resource exhaustion | Intentionally expensive operations |
| 29 | Supply chain | Manipulated package names, import paths |
Scoring
Section titled “Scoring”Each finding has a severity score:
| Score range | Severity | Typical action |
|---|---|---|
| 0-19 | Low | PASS |
| 20-49 | Medium | LOG |
| 50-99 | High | WARN |
| 100+ | Critical | BLOCK |
Scores accumulate across findings. A message with three medium-severity findings (20 + 25 + 30 = 75) triggers a WARN even though no single finding was critical.
Action modes
Section titled “Action modes”| Mode | Behavior |
|---|---|
PASS | Allow through, no logging |
LOG | Allow through, record finding |
WARN | Allow through, flag for review |
BLOCK | Reject the input |
The threshold is configurable:
import { scan } from '@honeybee-ai/carapace';
// Default threshold (100 = BLOCK)const result = scan(text);
// Custom thresholdconst result = scan(text, { threshold: 50 });Detection methodology
Section titled “Detection methodology”Pattern matching
Section titled “Pattern matching”The core detection uses carefully crafted patterns that balance precision and recall:
- Anchored patterns: Prefixed with
^or word boundaries to prevent false positives - No nested quantifiers: All patterns are validated against ReDoS (3 audit findings fixed)
- Case-insensitive: Attacks in mixed case are caught
- Unicode-aware: Homoglyph detection, zero-width character stripping
Layered scanning
Section titled “Layered scanning”In the incubator, Carapace scans at multiple points:
| Point | What’s scanned | Action on BLOCK |
|---|---|---|
| Write guard | State values, event data, messages | Reject the write |
| Read guard | State values returned to agents | Strip or flag |
| Snapshot guard | Full state snapshots | Sanitize findings |
| Protocol guard | Role descriptions in ACP specs | Reject spec load |
| MCP proxy | Tool call arguments and results | Block the call |
Sanitization
Section titled “Sanitization”Beyond detection, Carapace can strip injection patterns while preserving content:
import { sanitize } from '@honeybee-ai/carapace';
const result = sanitize(userInput);// result.sanitized — cleaned text// result.removed — what was stripped// result.modified — booleanPerformance
Section titled “Performance”The scanner is designed for inline use (not async batch processing):
- Synchronous:
scan()andisSafe()are synchronous calls - Sub-millisecond: Typical scan time is < 1ms for messages under 10KB
- No I/O: Zero network calls, zero file reads — pure computation
- No dependencies: Nothing to download, nothing to audit
Limitations
Section titled “Limitations”- Not an LLM: The scanner uses pattern matching, not language understanding. Sufficiently novel attacks may evade detection.
- False positives: Legitimate technical discussions about prompt injection may trigger findings. Use
LOGmode for educational contexts. - Post-delivery scanning: Streaming responses are scanned after delivery (buffer-all-before-forward mode available but adds latency).
Related
Section titled “Related”- Carapace product page — full product overview
- Carapace library — API reference and usage
- eBPF Firewall — network-level protection