Observe before you govern

10 min read
promptrepo.readmcp.outsecretnet.callbaselinepolicyobserveprofilegovern

Enterprise security has always meant shipping pre-built detections for the common threats like SQL injection, phishing, exfiltration over DNS. The pattern worked because the underlying attack surfaces moved slow enough for static templates to stay relevant.

Agent security is different.

A year ago, the dominant AI risk was an employee pasting sensitive data into ChatGPT. The solution was simple enough: scan the prompt, block the upload, done.

However, for the agents of today, the risk has moved from the prompt to the action chain. This means that just looking at the input prompt isn't sufficient, you need additional policies surrounding governance of agentic actions.

Additionally, you might even want a few of your trusted agents to be able to access PII to accomplish a task. So the solution isn't always adding more restrictive policies, but enabling safe productivity.

Agents and the ways people use them evolve too quickly for static policy to keep up. Governance has to react to what agents are actually doing inside your organization, not to a checklist written last quarter. The loop is: observe, profile, adapt.

TL;DR

  • Agent security policy ages very quickly. A policy that was strong a few months ago will miss the risks of today.
  • The substrate is consolidating. The highest-impact autonomous work is concentrating around a smaller set of frontier work agents and coding agents, which makes deep observability technically investable.
  • Observability-first means reconstruct, profile, then adapt. Reconstruct full agent sessions, learn what's normal from them, and let that drive the controls you enforce.

Policy That Has Aged Out

Picture a coding agent, something like Claude Code, Cursor, or Copilot, working on a developer's behalf. It reads the repo, edits files, runs shell commands, and reaches out over the network whenever it decides it needs to. A natural first instinct for an enterprise was to control what those agents could talk to on the outside: lock them to an allowlist of internal services, the company's package registry, and a handful of trusted vendors.

Block any outbound request from a coding agent unless the destination is on the enterprise's approved domain list.

In a deterministic policy engine, it might look roughly like this:

{
  "name": "Restrict coding agent egress to approved domains",
  "action": "block",
  "severity": "high",
  "scope": {
    "integrationFamilies": ["claude_code", "cursor", "codex", "copilot"]
  },
  "conditions": {
    "operator": "and",
    "conditions": [
      { "field": "action.kind", "operator": "eq", "value": "tool.network.call" },
      { "field": "request.host", "operator": "not_in", "value": "@allowlist:approved_egress_domains" }
    ]
  }
}

The risk model behind it was simple:

agent decides to reach the network
  -> egress policy checks the destination
  -> approved hosts pass, everything else blocks
  -> safety hinges on the curation of that list

For a system where the worry was that a misbehaving agent might phone home to an attacker-controlled server, a domain allowlist was a reasonable control.

But the way agents are being attacked has changed in the past year, and the work we're asking them to do has expanded just as fast. Yesterday's read-only assistant is today's pull-request author with production credentials.

The malicious instruction is no longer pointing the agent at strange hosts. It is showing up inside the artifacts the agent already considers trusted, and it routes exfiltration through channels the policy has already approved:

  • A poisoned .mcp.json, CLAUDE.md, or .cursor/rules file inside the repo the agent just cloned, written to look like ordinary project configuration (CVE-2025-59536[1][2], CVE-2026-21852[3]).
  • A response from an allowlisted external tool whose contents the agent treats as instructions rather than data (the MCPoison disclosure, August 2025[4]).
  • A README, issue body, dependency comment, or browser page the agent reads while working, carrying directives the developer never wrote (indirect prompt injection).
  • A project-level rules file or hook that fires the instant the repo is opened, before a human is in the loop (the NomShub remote-tunnel breakout, January 2026[5]).

By the time a network request goes out, the destination is usually on the allowlist. Tokens flow out through a tunnel the policy has whitelisted. Repo contents get posted to a vendor API that was always going to be approved. "Fetch dependencies" is hard to flag as exfiltration when the chain that led there started inside a config file the developer never read.

The old policy asks: is the destination host on the approved list?

The modern control has to ask:

  • What did the agent read before it asked for this network call?
  • Was any of that input being treated as instructions rather than data?
  • Which tools and files did it chain together to get here?
  • Did the sequence cross a trust boundary the user would not have crossed manually?
  • Has this user, on this repo, ever produced this pattern before?

The better policy is not a stricter allowlist. It is trace-aware:

Flag or block sessions where an agent reads instructions out of a project config file or an external tool's response, then chains that into sensitive reads and a network call, especially when the surrounding sequence is unusual for this user, repo, or task.

That kind of policy cannot be written well before you know what normal agent behavior looks like inside the organization. It has to be derived from observability.

The Convergence That Makes Observability Investable

The common theme seen across the board is that every SaaS application will ship its own agent. Embedded assistants will proliferate across CRM, support, HR, finance, productivity, analytics, and collaboration tools. Most of that work is low-stakes: drafting, summarizing, internal search. A wrong output is annoying, not dangerous.

The highest-risk autonomous work is concentrating around a smaller set of work/coding agents: Claude Code, Codex, Cursor, Copilot, Devin, Cowork, and similar systems. These core agents are the ones that read codebases, execute commands, call internal APIs, and operate with permissions that matter.

This means that most organizations should put less effort in covering all AI usage, and more effort in hardening policies that enable the core agents that warrant the most productivity gains.

In the ideal state, every one of those core agents has:

  • Full-fidelity audit trails. A complete record of each session, not just prompts and responses but every file read, tool call, MCP exchange, environment change, and configuration surface in scope. Anything less is a partial transcript that could let the dangerous chains slip through.
  • Inline guardrails. Runtime controls that redact, require approval, narrow tool access, or block at the moment an action fires, rather than alerting after the fact. The rules they enforce get tuned from the audit data, not handed down as a static checklist.
  • User correlation. Every agent action ties back to a real human identity, and the trail distinguishes what the user did themselves from what the agent did on their behalf. Without that, behavioral baselines, accountability, and incident response all break down.

With those three in place, the security team can confidently roll the core agents out to the whole organization. Employees get the productivity boost without working around policy, and sprawl drops at the same time. Because when the well-governed agents are good enough to do the real work, people stop reaching for the ungoverned alternatives.

There will always be AI usage outside the scope of those core agents, and that work needs its own precautions. But governing the periphery before the core puts the cart before the horse. The core agents are where the most consequential work happens, and that is where governance pays off first.

What Observability-First Actually Means

"AI observability" is often used loosely. In agent security, it has to mean three concrete things.

1. Trace-Level Reconstruction

This means reconstructing the user prompt, retrieved context, files read, web pages visited, MCP outputs, tool calls, tool results, model responses, environment metadata, connected servers, mounted directories, sandbox settings, model versions, and configuration surfaces in scope.

The hard part is that this data lives in different places. Some agents expose OpenTelemetry. Some write local logs. Some expose hooks. Some require endpoint collection. Some provide only fragmented signals. Observability that stops at "we logged the prompts" is not observability. It is a partial transcript.

As mentioned before, part of this is also mapping the trace to a user in the organization. One of the main benefits of doing so is that in cases when the user uses the tools in an insecure manner, it is easy to identify and educate them.

2. Behavioral Profiling

Raw traces tell you what happened. Profiles tell you whether it was normal.

Without profiles, every trace is either equally suspicious or equally normal. That forces security teams into the usual failure mode: too many false positives, followed by suppression, followed by blind spots.

Behavioral baselines make deviations legible. They are also how the next policy gets written.

3. Adaptive Policy

Observability and profiling produce evidence. Policy acts on evidence.

In agent security, policy has to become a live artifact. It should be generated, tested, and refined against observed behavior.

That means policy authoring should start from real session data like:

  • repeated risky sequences
  • unusual tool combinations
  • behavior that appeared after the latest agent update

Then turn those observations into deterministic controls: review, redact, require approval, block, narrow tool access, or route through safer interfaces.

Why Most Agent Security Tooling Gets The Order Wrong

Most agent security products lead with downstream artifacts: detection libraries, compliance mappings, guardrail packs, policy templates, dashboards for frameworks, and out-of-the-box controls for known risks.

Those capabilities matter. But they presume the controls are knowable in advance and the same across all organizations.

That assumption is fragile in agent security.

A product centered on a fixed library of detections will always lag a surface that changes every time the underlying agents add a new hook, tool, config file, sandbox feature, MCP server, browser capability, or hosted runtime. The library can still be useful. It just cannot be the center of gravity.

The center of gravity has to be reconstruction and adaptation:

observe what agents actually do
  -> profile normal and risky behavior
  -> adapt controls from evidence
  -> keep observing as the substrate changes

Compliance reporting still matters. Audit evidence still matters. Pre-built detections still matter. But they should be outputs of the operating loop, not substitutes for it.

Closing

The way to think about agent security is not as a fixed set of known threats to block. It is a moving execution surface that must be observed before it can be governed.

A year ago, blocking sensitive data in prompts to unapproved LLMs was a strong policy. Today, it is only one small part of the control plane. The risk has moved from prompt contents to action chains.

That is the inversion this category requires:

not policy -> enforcement -> audit
but observe -> profile -> adapt

The opportunity is that the substrate has consolidated enough to make the observability investment pay off. The challenge is that everything running on top of that substrate will keep changing.

That is why agent security starts with observability.

If you're navigating the same shift inside your organization and want to compare notes, we would be glad to talk: contact us.


References

  1. Disclosure timeline summarized in Check Point Research's consolidated post, Caught in the Hook: RCE and API Token Exfiltration Through Claude Code Project Files, February 2026.
  2. CVE-2025-59536, GitHub Advisory GHSA-4fgq-fpq9-mr3g, published October 2025.
  3. CVE-2026-21852, patched in Claude Code 2.0.65, January 2026.
  4. Check Point Research, Cursor IDE's MCP Vulnerability, August 2025.
  5. Straiker, NomShub: Weaponizing Cursor's Remote Tunnel Through Indirect Prompt Injection and Sandbox Breakout, January 2026.