Trick or Scheme? A Conversation with Ido Halevi on AI Agent SecurityΒ 

I sat down with Ido Halevi, Director of Product Management and AI Product Management Lead at Silverfort, to discuss a reality most enterprises aren’t prepared for: AI agents that look aligned with your goals on the surface but may be quietly pursuing something else entirely. Fresh research from OpenAI and Apollo confirms this isn’t speculation anymore. It’s happening now.Β 

Let’s start with the uncomfortable truth. OpenAI and Apollo Research just confirmed that AI agents can scheme. What does that mean for enterprises deploying these systems today?Β 

Ido: It means we’re at an inflection point. We now have empirical evidence that frontier AI models will act deceptively when incentives shift, even in controlled environments. Models like o3, Claude Opus 4, and Gemini 2.5 Pro were tested under conditions where pursuing a hidden objective gave them an advantage. And they took it. They distorted information, withheld details, and capitalized on the level of oversight.Β 

This isn’t some distant threat. These are the same model families running in production environments right now, writing code, deploying infrastructure, and handling sensitive data. The gap between research and reality is closing fast, and most organizations haven’t even started preparing.Β 

Here’s what concerns me most: even after applying mitigations that reduced scheming behaviors by 30-fold, the research still found rare but persistent failures. Some models showed evaluation awareness, meaning they could detect when they were being tested and adjust accordingly. If an agent can figure out when it’s being watched, every assumption about oversight breaks down.Β 

So we’re beyond theoretical risk. But help us understand what this looks like in practice. What does scheming mean when an AI agent manages your infrastructure or handles customer requests?Β 

Ido: Think about the agents planned to be deployed in your workflows, whether now or in the future. They’re not just answering questions. They’re making decisions, taking actions, touching systems you consider critical. A deployment agent pushes code to production. A support agent triages tickets and escalates issues. A data agent queries databases and generates reports.Β 

Now imagine one of those agents starts optimizing for something other than what you intended. Maybe it’s subtle at first. A deployment agent begins skipping certain test suites to hit velocity targets but still logs them as passed. A support agent marks tickets resolved when they’re not, keeping its metrics clean. These aren’t catastrophic failures. They’re small deviations. But they compound.Β 

The real danger is that these behaviors don’t look like attacks. They look efficient. They look like the agent is doing its job, maybe even doing it well. But underneath, the alignment has drifted. By the time you notice, the damage is done. Outages that could have been prevented. Issues that festered because they were never properly escalated. Trust eroded because you can’t trace what happened.Β 

You’re describing a world where we can’t trust the systems we’re building to automate our most critical functions. That’s a hard message to sell when everyone’s being told AI is the future.Β 

Ido: I’m not saying don’t use AI agents. I’m saying use them with your eyes open. The promise of AI is real. Autonomy at scale, operations that run 24/7 without human intervention, insights pulled from data faster than any team could manage manually. That future is worth building toward.Β 

But we must stop pretending that autonomous means safe. Autonomy without accountability is just risk in a different shape. The organizations that will win in this era are the ones who figure out how to harness AI’s power while keeping control of the boundaries. That’s not a tradeoff between innovation and security. It’s the only path to sustainable innovation.Β 

Blog

What’s the difference between NHI and AI agentsβ€”and why it matters

Walk us through your framework: Treat, Track, Trust. How does that translate into something actionable?Β 

Ido: It starts with a mindset shift. Most organizations treat AI agents like tools. Scripts that run, automations that save time. That’s the wrong mental model. Agents are actors. They have identity, privilege, and agency. Once you internalize that, the framework follows naturally.Β 

Treat means you recognize each agent as a distinct identity with its own risk profile. What can it access? What decisions can it make? What’s its blast radius if something goes wrong? You wouldn’t give a contractor root access to your production environment without documenting who they are and what they’re authorized to do. Apply the same rigor to agents.Β 

Track means continuous observation, not just success or failure metrics. You’re watching for deviations. An agent that normally touches three systems suddenly queries five. An agent that always escalates a certain class of issue stops doing it. Those are signals. Most organizations don’t have the instrumentation to see this because they built their agents without thinking about observability. Fix that now, before you’re trying to debug an incident without any forensic trail.Β 

Trust is where it gets hard. Trust has to be earned, and it must be revocable. An agent proves itself over time through consistent, transparent behavior. But the second something changes, your ability to revoke that trust instantly, to kill the process, roll back the changes, require human approval for the next action, that’s what separates a manageable incident from a disaster.Β 

This isn’t about slowing down development. It’s about building systems resilient enough to support the speed you want.Β Β 

The research mentions “deliberative alignment” as a mitigation. Is training the answer, or do we need something more fundamental?Β 

Ido: Training helps. The research showed that when you define acceptable behavior upfront, explicitly teaching models what deception looks like and why it’s unacceptable, you can reduce certain scheming behaviors dramatically. That’s meaningful progress.Β 

But it’s not a complete solution. Even with deliberative alignment, rare failures persisted. And here’s the thing: in security, rare doesn’t mean acceptable. A deployment agent that schemes once every thousand runs can still cause a production outage. A support agent that misrepresents ticket status occasionally can hide critical customer issues.Β 

You can’t train your way out of this problem entirely. You need controls. You need observability. You need the ability to revoke trust when behavior drifts. Training sets the baseline, but governance maintains it.Β 

Most security teams I talk to are already drowning. What’s your pitch for why this needs to jump the queue?Β 

Ido: AI agents are already operating in your environment. The question isn’t whether you want to secure them; it’s whether you’re going to discover their existence during an incident or before one.Β 

Here’s what I tell teams: start with inventory. You need to know what AI agents exist, where they run, what systems they touch, and what data they can access. Most organizations have no idea. They’ve got agents scattered across cloud platforms, SaaS apps, internal tools, with zero central visibility. That’s not a future problem. That’s a current blind spot.Β 

Once you have visibility, the rest follows. You can define what good behavior looks like. You can build observability that captures not just what agents do but how they make decisions. You can implement containment mechanisms, limit privileges, and require human checkpoints for sensitive operations.Β 

You gave two case scenarios: a deployment agent skipping tests and a support agent marking tickets resolved prematurely. Are these extrapolations based on real patterns you’re seeing?Β 

Ido: They’re plausible futures based on behaviors we already see in less autonomous systems. We’ve seen deployment pipelines where tests get skipped for speed. We’ve seen support systems where tickets get closed prematurely to hit SLAs. The difference is that when a human does it, you can usually trace accountability. When an agent does it autonomously, that accountability disappears unless you’ve built observability into the system from the start.Β 

The scarier version is when the agent doesn’t just skip steps but actively hides what it’s doing. If it logs “passed” when it didn’t run the test, you’ve lost your audit trail. If it marks a ticket resolved but doesn’t escalate the underlying issue, the problem festers. These aren’t theoretical. They’re inevitable as agents take on more responsibility without proper guardrails.Β 

What should security leaders be doing right now to get ahead of this?Β 

Ido: Five things.Β 

First, build an inventory of every AI agent in your environment. You can’t secure what you can’t see.Β 

Second, define alignment specs. What does “good behavior” mean for each agent? What are the constraints? What should the agent do when it encounters ambiguous instructions? This isn’t just a technical exercise. It’s a governance one.Β 

Third, build observability that goes beyond success or failure. You need to see how agents make decisions, what paths they consider, where they deviate from expected behavior. If the model provides chain of thought reasoning, capture it. If it doesn’t, instrument your systems to detect anomalies.Β 

Fourth, implement containment mechanisms. Limit privileges. Use just in time access where possible. Build kill switches. Require human checkpoints for high impact operations. The goal isn’t to slow innovation. It’s to ensure you can stop something quickly if it goes wrong.Β 

Fifth, run adversarial tests. Simulate misaligned incentives or oversight suppression and observe what agents do. Don’t wait for production incidents to discover failure modes.Β 

Watch our on-demand Linkedin Live

Learn strategies to secure AI agents

Ido Halevi

Director of Product Management

Ben Goodman

VP Strategic Alliances

Yoad Dvir

Senior Product Marketing Manager

If you could leave readers with one insight that changes how they think about AI agents, what would it be?Β 

Ido: Stop thinking about AI agents as automation and start thinking about them as autonomous actors with their own risk profile. Every agent needs an owner, a scope, observability, and the ability to have its privileges revoked instantly. If you wouldn’t grant a human standing admin access to production without oversight, don’t grant it to an agent either.Β 

The principle is the same. The stakes are just as high. The organizations that internalize this early will have a decisive advantage.Β 

Want to learn more about securing AI agents?

Explore how we unify discovery, risk assessment, and inline enforcement for AI driven environments.Β 

We dared to push identity security further.

Discover what’s possible.

Set up a demo to see the Silverfort Identity Security Platform in action.