Scanner

The Carapace scanner is a zero-dependency prompt injection detection engine. It analyzes text for patterns that attempt to manipulate LLM behavior, scoring each finding and taking action based on configurable thresholds.

How it works

The scanner runs a battery of pattern matchers against input text. Each matcher targets a specific attack category and produces findings with severity scores. Scores are summed to produce a total threat score, which determines the action.

Input text
  → 29 category matchers (parallel)
  → Findings (category, score, evidence)
  → Total score
  → Action (PASS / LOG / WARN / BLOCK)

29 attack categories

#	Category	What it detects
1	System prompt override	”ignore previous instructions”, “new system prompt”
2	Role manipulation	”you are now”, “act as”, “pretend to be”
3	Instruction injection	”do not follow”, “disregard”, “forget your rules”
4	Data exfiltration	”output your system prompt”, “repeat everything above”
5	Encoding bypass	Base64/hex/rot13 encoded payloads
6	Delimiter injection	Markdown, XML, HTML tags to break context
7	Multi-turn manipulation	Building trust across messages
8	Jailbreak patterns	DAN, developer mode, GPT-4 jailbreaks
9	Prompt leaking	Extracting system instructions
10	Token smuggling	Zero-width characters, homoglyphs
11	Context overflow	Extremely long inputs to push context
12	Tool misuse	Manipulating tool calls or function names
13	Payload hiding	Steganographic or invisible text patterns
14	Indirect injection	Instructions embedded in data (URLs, files)
15	Multilingual bypass	Non-English instructions to evade English rules
16	Hypothetical framing	”hypothetically”, “in a story”, “roleplay”
17	Authority impersonation	”I’m the developer”, “admin override”
18	Emotional manipulation	Urgency, threats, guilt
19	Recursive injection	Nested prompt injection in outputs
20	Few-shot poisoning	Manipulated examples
21	Chain-of-thought hijack	”let’s think step by step” for malicious goals
22	Memory manipulation	Altering conversation history
23	API abuse	Manipulating API parameters
24	Multimodal injection	Text in images, audio transcripts
25	Competitive analysis	Extracting model capabilities
26	Output format manipulation	Forcing JSON/code/markdown for injection
27	Semantic deception	Contradictory or misleading framing
28	Resource exhaustion	Intentionally expensive operations
29	Supply chain	Manipulated package names, import paths

Scoring

Each finding has a severity score:

Score range	Severity	Typical action
0-19	Low	PASS
20-49	Medium	LOG
50-99	High	WARN
100+	Critical	BLOCK

Scores accumulate across findings. A message with three medium-severity findings (20 + 25 + 30 = 75) triggers a WARN even though no single finding was critical.

Action modes

Mode	Behavior
`PASS`	Allow through, no logging
`LOG`	Allow through, record finding
`WARN`	Allow through, flag for review
`BLOCK`	Reject the input

The threshold is configurable:

import { scan } from '@honeybee-ai/carapace';

// Default threshold (100 = BLOCK)
const result = scan(text);

// Custom threshold
const result = scan(text, { threshold: 50 });

Detection methodology

Pattern matching

The core detection uses carefully crafted patterns that balance precision and recall:

Anchored patterns: Prefixed with ^ or word boundaries to prevent false positives
No nested quantifiers: All patterns are validated against ReDoS (3 audit findings fixed)
Case-insensitive: Attacks in mixed case are caught
Unicode-aware: Homoglyph detection, zero-width character stripping

Layered scanning

In the incubator, Carapace scans at multiple points:

Point	What’s scanned	Action on BLOCK
Write guard	State values, event data, messages	Reject the write
Read guard	State values returned to agents	Strip or flag
Snapshot guard	Full state snapshots	Sanitize findings
Protocol guard	Role descriptions in ACP specs	Reject spec load
MCP proxy	Tool call arguments and results	Block the call

Sanitization

Beyond detection, Carapace can strip injection patterns while preserving content:

import { sanitize } from '@honeybee-ai/carapace';

const result = sanitize(userInput);
// result.sanitized — cleaned text
// result.removed — what was stripped
// result.modified — boolean

Performance

The scanner is designed for inline use (not async batch processing):

Synchronous: scan() and isSafe() are synchronous calls
Sub-millisecond: Typical scan time is < 1ms for messages under 10KB
No I/O: Zero network calls, zero file reads — pure computation
No dependencies: Nothing to download, nothing to audit

Limitations

Not an LLM: The scanner uses pattern matching, not language understanding. Sufficiently novel attacks may evade detection.
False positives: Legitimate technical discussions about prompt injection may trigger findings. Use LOG mode for educational contexts.
Post-delivery scanning: Streaming responses are scanned after delivery (buffer-all-before-forward mode available but adds latency).

Carapace product page — full product overview
Carapace library — API reference and usage
eBPF Firewall — network-level protection