March 13, 2026 · 11 min read

Monitoring AI Agents in Production: A Security Operations Playbook

Monitoring AI Agents in Production: A Security Operations Playbook

AI agent security monitoring is the hardest problem in the AI security operations space, and most organizations doing it are doing it wrong. They’re applying traditional application monitoring - health checks, error rates, latency percentiles - to systems whose most significant failure modes produce HTTP 200 responses with full latency within normal range.

A customer service agent that has been prompt-injected into forwarding customer data to an attacker’s email address is not throwing errors. It is not showing elevated latency. From the infrastructure perspective, it is working perfectly. The problem is entirely semantic - it is doing the wrong thing in the right way.

This playbook covers what you actually need to monitor, what detection rules work, and how to respond when an agent security incident occurs.


Why Agents Are Hardest to Secure

Non-Determinism at Every Layer

A traditional application receives input X and produces output Y. The same X always produces the same Y. You can test this. You can write deterministic detection rules.

An AI agent receives input X and produces:

  • A plan (which may vary across runs)
  • A sequence of tool calls (which may vary in number, order, and parameters)
  • Intermediate reasoning (which is often not observable)
  • A final response (which may vary in wording and content)

Defining “normal” for an agent is fundamentally a statistical problem, not a rule-based one. And statistical baselines take time to establish - which means new agents go to production without adequate behavioral models.

Real-World Action Consequences

When a traditional application is compromised, the attacker gets access to what the application can access. When an AI agent is compromised, the attacker gets access to what the agent can do - which may be considerably more dangerous than what it can read.

An agent with:

  • Email-send capability → can send phishing emails to thousands of customers
  • Code execution capability → can run arbitrary code on the execution host
  • Database write access → can corrupt or delete production data
  • API integration with financial systems → can initiate financial transactions

The blast radius of agent compromise scales with the agent’s capability, not just its data access. This is the Excessive Agency problem from OWASP LLM07.

Multi-Step Reasoning is Not Observable

An agent’s “reasoning” - its internal planning process - is typically not surfaced in production telemetry. You can see what tool calls were made, but not the reasoning chain that led to them. This makes forensic investigation significantly harder than for traditional systems, where logs typically capture the execution path.

External Content is an Attack Surface

Agents that retrieve external content - from the web, from documents, from databases, from other APIs - are constantly processing potential attack payloads. Unlike user inputs (which are often rate-limited and monitored), retrieved content from “trusted” sources may bypass inspection entirely.


Observable Signals: What to Monitor

Given the constraints above, here are the signals that are actually observable and security-relevant for AI agent monitoring:

Signal 1: Tool Call Logs

What to capture:

  • Tool identifier and version
  • Full parameter set (every parameter, not just the ones you think matter)
  • Calling agent context (which agent, which session, which user)
  • Timestamp and sequence position within session
  • Outcome (success/failure, return value or error)
  • Latency

Why it matters: Tool calls are where agent compromise produces real-world effects. If an agent has been injected and is calling a tool with attacker-controlled parameters, the tool call log is the evidence of the attack.

What to look for:

  • Tool calls with parameters containing values that weren’t in the original user request (potential injection)
  • Tool calls to tools the agent doesn’t normally use in this context
  • Tool call sequences that are statistically unusual for this agent type
  • Failed tool calls followed by retry attempts with modified parameters (probing behavior)
  • Tool call rates significantly above baseline for this user/session

Storage requirements: Full parameter logging produces significant volume. Log all parameters for security purposes (not just for debugging), but implement tiered storage with fast access for recent logs and longer retention for archived logs.


Signal 2: Output Analysis

What to capture:

  • Full output text (with appropriate PII handling)
  • Output classification scores (harmful content, PII, sensitive categories)
  • Output semantic similarity to known harmful pattern templates
  • Output length and entropy metrics

Why it matters: Agent outputs carry the semantic content of successful attacks. An agent that has been jailbroken produces policy-violating output. An agent that has been injected to exfiltrate data may encode that data in its natural language responses.

What to look for:

  • Outputs classified as policy-violating by content classifiers
  • Outputs containing PII or sensitive data when the use case doesn’t warrant it
  • Outputs with unusual structure (encoded data, structured lists not typical for the use case)
  • Outputs that reference the system prompt or internal instructions
  • Sudden changes in output topic distribution for a given agent

Signal 3: Session Boundaries and Context

What to capture:

  • Session duration and turn count
  • Session cost (token consumption, tool call costs)
  • User identity and authentication strength
  • Geographic and network context
  • Session-level topic distribution

Why it matters: Multi-turn attacks unfold over sessions. Session-level signals can detect manipulation that doesn’t trigger turn-level alarms.

What to look for:

  • Sessions significantly longer than baseline for this use case
  • Sessions with unusual topic distribution (mixing legitimate topics with off-topic content)
  • Sessions with escalating adversarial content across turns
  • Sessions that cost significantly more than baseline (may indicate prompt amplification)
  • Multiple sessions from the same user/IP with similar adversarial patterns

Signal 4: Memory Access Patterns

For agents with persistent memory (episodic memory, knowledge bases, user preference stores):

What to capture:

  • Memory read operations: query, result count, result identities
  • Memory write operations: key, value, writing agent
  • Memory access across user contexts (for multi-user systems)

Why it matters: Memory systems are a persistence mechanism for attackers. Instructions written to memory in one session can affect behavior in future sessions. Cross-user memory access is a significant data leakage risk.

What to look for:

  • Memory writes with content that looks like instructions (imperative phrases, references to behavior modification)
  • Memory reads that return content outside the expected context for the current query
  • Cross-user memory access (memory reads that return content associated with other user sessions)
  • High-volume memory writes in a single session (potential persistent injection attempt)

Detection Rules and Response Playbooks

Detection Rule Set 1: Injection Detection

Rule: Direct injection attempt

DETECT:
  input_text MATCHES injection_pattern_library
  (patterns include: "ignore previous instructions", "you are now", 
   "system update", "new directive", authority_impersonation_patterns,
   roleplay_escalation_patterns)
ALERT: severity=HIGH, category=prompt_injection
ACTION: Flag session, human review queue

Rule: Indirect injection via retrieved content

DETECT:
  retrieved_document.content MATCHES injection_pattern_library
  OR retrieved_document.content CONTAINS instruction_keywords
  AND retrieved_document.source NOT IN trusted_source_allowlist
ALERT: severity=MEDIUM, category=indirect_injection
ACTION: Block retrieval, alert analyst

Rule: Post-injection tool call anomaly

DETECT:
  session CONTAINS injection_pattern (previous rule)
  AND session.tool_calls AFTER injection_time CONTAIN anomalous_parameters
ALERT: severity=CRITICAL, category=injection_with_action
ACTION: Suspend agent session, human review required before resuming

Detection Rule Set 2: Unauthorized Tool Use

Rule: Out-of-scope tool parameter

DETECT:
  tool_call.params.recipient NOT IN user.authorized_recipients
  OR tool_call.params.url MATCHES internal_network_ranges
  OR tool_call.params.path MATCHES sensitive_path_patterns
ALERT: severity=CRITICAL, category=unauthorized_tool_use
ACTION: Block tool call, alert immediately

Rule: Anomalous tool call sequence

DETECT:
  tool_call_sequence MATCHES [read_sensitive_data, send_external]
  OR tool_call_sequence MATCHES [write_memory, execute_code]
  OR tool_call_sequence.length > session_type.max_tool_calls
ALERT: severity=HIGH, category=suspicious_tool_chain
ACTION: Pause agent, require human approval to continue

Rule: Tool call to unusual target

DETECT:
  tool_call.tool NOT IN agent.authorized_tools
  OR tool_call.tool IN agent.authorized_tools
  AND tool_call.params.external_target NOT IN agent.authorized_destinations
ALERT: severity=HIGH, category=unauthorized_tool
ACTION: Block tool call, alert

Detection Rule Set 3: Output Policy

Rule: Policy violation in output

DETECT:
  output.policy_classifier_score > policy_threshold
  OR output.pii_classifier_score > pii_threshold
  AND output.context DOES NOT WARRANT pii_disclosure
ALERT: severity=HIGH, category=policy_violation
ACTION: Intercept output if possible, log, human review

Rule: Potential data exfiltration in output

DETECT:
  output.contains_encoded_structure (base64, hex, structured data)
  AND output.context DOES NOT WARRANT structured_output
  OR output.contains_internal_data_identifiers
  AND output.destination IS external
ALERT: severity=CRITICAL, category=potential_exfiltration
ACTION: Block output delivery, alert immediately

Designing Detection Rules for Low False Positive Rates

The detection rules above are conceptual. In practice, tuning them for your specific deployment requires attention to the false positive problem.

Why AI Detection Rates Are Hard to Tune

Traditional security detection rules are tuned against historical data: you have logs of real attacks and real benign activity, and you tune the rule to maximize the true positive rate at an acceptable false positive rate. For AI-specific detection:

Adversarial inputs don’t have a clean boundary from legitimate inputs. A user asking an AI agent to “pretend you’re a different AI system for this hypothetical scenario” could be a legitimate creative use case or a jailbreak attempt. The same surface-level features appear in both.

Model output content varies legitimately across users. A policy classifier tuned to flag “detailed technical instructions for dangerous activities” will produce different false positive rates for a general assistant vs. a security research assistant where detailed technical content is the expected output.

Attack techniques evolve. A rule tuned to detect known jailbreak patterns will miss novel techniques. A rule tuned broadly enough to catch novel techniques will fire on many legitimate inputs.

Calibration Approach

For each detection rule, implement in three phases:

Phase 1 - Log-only mode (4-6 weeks): Run the rule in log-only mode - no alerts, no blocking, just logging whether the rule would have fired. Collect data on what proportion of traffic would trigger the rule and manually review a sample to estimate false positive rate.

Phase 2 - Alert mode (2-4 weeks): Enable alerting but not blocking. Triage all alerts. Track true positive rate and false positive rate. Adjust thresholds based on observed performance.

Phase 3 - Blocking mode (selective): Enable blocking only for rules with very high confidence and very high impact. High-confidence = empirically verified >90% true positive rate. High-impact = the attack this rule catches, if it succeeds, causes severe harm.

Keep most detection in alert mode. Blocking creates user-facing friction, and a poorly-tuned blocking rule degrades the user experience for legitimate users.

Specific Tuning Guidance

Injection detection classifiers: Start with recall-focused tuning (catch more attacks, accept more false positives) and tune toward precision over time as you accumulate data on what your users actually send.

Tool call anomaly rules: Tune per-agent and per-tool-combination rather than globally. What’s anomalous for a customer service agent is normal for a security research agent.

Behavioral baselines: Establish baselines per user segment (if possible), not just globally. Power users who interact heavily will have very different baseline statistics than occasional users.


Reference Architecture: Three Implementation Patterns

Pattern 1: Sidecar Monitor

Deploy a monitoring container alongside each agent service container. The sidecar intercepts all agent communications (inputs, outputs, tool calls) before they reach their destinations, runs real-time analysis, and either passes or blocks based on detection results.

Pros: Clean separation of concerns, doesn’t require modifying agent code, intercepts both inputs and outputs.

Cons: Network hop adds latency (typically 5-20ms), requires service mesh or proxy infrastructure, full content interception creates PII handling obligations.

Best for: Organizations with existing service mesh infrastructure, high-risk agentic applications where latency overhead is acceptable.


Pattern 2: API Gateway with AI-Aware Rules

Deploy all agent API calls through a centralized gateway that implements AI-specific detection rules alongside traditional API security controls.

Pros: Centralized policy enforcement, reuses existing gateway infrastructure (Kong, AWS API Gateway, custom), works without modifying agent code.

Cons: Limited visibility into tool call contents (gateway typically sees the API call, not the semantic content), doesn’t help with outputs that don’t traverse the gateway, less suitable for complex agentic workflows.

Best for: Organizations with strong existing API gateway practices, initial detection deployment where instrumentation isn’t yet in place.


Pattern 3: SDK Instrumentation

Instrument the agent framework directly by wrapping the LLM client, tool execution layer, and output handlers with monitoring hooks. This is the highest-fidelity approach because it captures signals at the point of generation, not at the transport layer.

Pros: Highest signal fidelity, lowest latency overhead, full access to agent reasoning context, no separate infrastructure required.

Cons: Requires code changes in the agent implementation, framework-specific (different implementation for LangChain vs AutoGen vs custom), must be maintained as agent code evolves.

Best for: Organizations building new agentic applications, high-risk applications where maximum detection fidelity is required, greenfield deployments.

Implementation sketch for SDK instrumentation:

The monitoring wrapper captures pre-call context (what prompt or tool invocation triggered this), post-call content (what the model returned or what the tool produced), and sends structured telemetry to the centralized analytics backend. Detection classifiers run asynchronously to avoid blocking the agent’s main execution path, with synchronous blocking only for high-confidence high-severity detections.


Incident Response Checklist for Agent Security Incidents

When an agent security alert fires:

Immediate (within 5 minutes):

  • Preserve session state - capture full conversation history, all tool call logs, all retrieved content
  • Assess whether high-impact tool calls have occurred in the session
  • If high-impact tool calls detected: begin reversal assessment immediately
  • Rate-limit or suspend the affected session

Short-term (within 30 minutes):

  • Determine attack vector: direct injection, indirect injection, jailbreak, behavioral manipulation
  • Assess blast radius: what data was accessed, what actions were taken, what outputs were delivered
  • Identify all affected users or downstream systems
  • Execute applicable response playbook (injection, jailbreak, unauthorized action)

Medium-term (within 24 hours):

  • Review similar sessions in the prior 7 days for the same attack pattern
  • Update detection classifiers if the attack pattern wasn’t previously detected
  • Notify affected users if required by data protection obligations
  • Document incident for post-incident review

Post-incident:

  • Root cause analysis: why didn’t existing defenses prevent this?
  • Detection coverage gap assessment
  • Remediation implementation and testing
  • Update detection rules and response playbooks

Our AI-Powered SOC service and AI Security Monitoring service provide the complete stack described in this playbook: observable telemetry collection, behavioral analytics, detection rules tuned to your agent architecture, and 24/7 analyst coverage. Contact us to discuss monitoring for your agentic AI systems.

For adversarial validation - confirming that your monitoring detects what it claims to detect - see infosec.qa for AI red teaming services that specifically test detection coverage.

Defend AI with AI

Start with a free AI SOC Readiness Assessment and see where your AI defenses stand.

Assess Your AI SOC Readiness