Evaluating AI Vulnerability Detection: How Reliable Are LLMs for Secure Coding?

Large language models (LLMs) can be used to generate source code, and these AI coding assistants have changed the landscape for how we produce software. Speeding up boilerplate tasks like syntax checking, generating test cases, and suggesting bug fixes accelerates the time to deliver production-ready code. What about securing our code from vulnerabilities?

If AI can understand entire repositories within a context window, one might jump to the conclusion that they can also be used to replace traditional security scanning tools that are based on static analysis of source code.

A recent security research project put that idea to the test and discovered that AI is really effective for identifying vulnerabilities for some classes of problems, but not consistently or predictably.

The Experiment: AI vs. AI

Security researchers evaluated two AI coding agents — Anthropic’s Claude Code (v1.0.32, Sonnet 4) and OpenAI’s Codex (v0.2.0, o4-mini) — across 11 large, actively maintained, open-source Python web applications.

They produced more than 400 findings that our security research team reviewed manually, one by one. The results were interesting:

Claude code: 46 real vulnerabilities found (14% true positive rate, 86% false positives)
Codex: 21 real vulnerabilities found (18% true positive rate, 82% false positives)

So yes, using AI tooling, we could identify real vulnerabilities and security flaws in live code. But the full picture is more nuanced in terms of how effective this might be as a routine workflow.

AI Vulnerability Detection Did Well at Contextual Reasoning

The AI agents were surprisingly good at finding Insecure Direct Object Reference (IDOR) vulnerabilities. These security bugs occur when an app exposes internal resources using predictable identifiers (like IDs in a URL) without verifying that the user is authorized to access them.

Imagine you’re browsing your order history at an online store and notice a URL like this:

/findings-report/dzone

If you change “dzone” to “faang” and suddenly see someone else’s report, that’s an IDOR. The vulnerability happens because the backend code assumes that knowing the report ID means you’re allowed to view it, which is a faulty assumption.

Here’s an example of what that code might look like:

def get_report(request):
    id = request.GET.get("id")
    report = get_report_safely(id)
    return JsonResponse(report.to_dict())

From a program analysis point of view, this code may be fine if the get_report_safely() lookup is sanitizing the user input from invalid escape characters or other injection attacks. What program analysis can’t do here very easily, however, is recognize that there is code missing, specifically an authorization check. The user input handling was valid; it was the user providing it that was not authorized.

AI models like Claude Code spotted this kind of pattern very well. In our study, Claude achieved a 22% true positive rate on IDOR, which was far better than for other vulnerability types.

AI Struggled With Data Flows

When it comes to traditional injection vulnerabilities like SQL Injection or Cross-Site Scripting (XSS), AI’s performance dropped sharply.

Claude Code’s true positive rate for SQL injection: 5%
Codex’s true positive rate for XSS: 0%

Why? These classes of vulnerabilities require understanding how untrusted input travels through an application — a process known as taint tracking. Taint tracking is the ability to follow data from its source (like user input) to its sink (where that data is used, such as a database query or HTML page). If that path isn’t properly sanitized or validated, it can lead to serious security issues.

Here’s a simple Python example of a SQL injection vulnerability:

def search_users(request):
    username = request.GET.get("username")
    query = f"SELECT * FROM users WHERE name = '{username}'"
    results = db.execute(query)
    return JsonResponse(results)

This may look harmless, but if untrusted data makes its way to this function, it can be exploited to expose every record in the users’ table. A secure version would use parameterized queries. This gets more complex when functions like this are abstracted away from the request object across libraries. For example, from a web form in one module to a database call in another. That’s where today’s LLMs struggle. Their contextual reasoning helps them recognize risky-looking patterns, but without a deep understanding of data flows, they can’t reliably tell which inputs are truly dangerous and which are already safe.

The Chaos Factor: Non-Determinism

Even when the AI recognized the pattern, it often missed sanitization logic or generated “fixes” that broke functionality. In one case, the model tried to fix a DOM manipulation issue by double-escaping HTML, introducing a new bug in the process.

Perhaps the most fascinating (and concerning) part of our research was non-determinism — the characteristic of AI tools to produce different results every time you run them.

This was tested by running the same prompt on the same app three times in a row. The results varied:

One run found 3 vulnerabilities.
The next found 6.
The third found 11.

While that might look like it was progressively getting more thorough, that was not the explanation; it was just different findings each time.

That inconsistency matters for reliability. In a typical Static Application Security Testing (SAST) pipeline, if a vulnerability disappears from a scan, it’s assumed to be fixed or the code has been changed sufficiently to assume the issue is no longer relevant. But with non-deterministic AI, a finding might vanish simply because the model didn’t notice it that time.

The cause lies in how LLMs handle large contexts. When you feed them an entire repository, they summarize and compress information internally, which, like other compression algorithms, can be lossy. This is known as context compaction or context rot.

Important details like function names, access decorators, or even variable relationships can get “forgotten” between runs. Think of it like summarizing a novel: you’ll capture the main plot, but you’ll miss subtle clues and side stories.

Benchmarks and the Illusion of Progress

Evaluating AI tools for security is harder than it looks. Many existing benchmarks — like OWASP JuiceShop or vulnerable-app datasets aren’t very realistic. These projects are small, synthetic, and often already known to the models through their training data.

When we tested real, modern Python web applications (Flask, Django, FastAPI), we found the models performed differently for each codebase. Sometimes better and sometimes worse, but more importantly, the variability created an illusion of progress.

In other words, don’t benchmark AI tools only once, as that is anecdotal. You need to test them repeatedly, across real code. Their non-deterministic behavior means one run might look great, while the next misses many critical findings.

When “False Positives” Are Still Useful

While 80–90% false positive rates sound terrible, some of those “mistaken” findings were actually good guardrails. For example, Claude Code often suggested parameterizing a SQL query that was already safe. Technically, that’s a false positive, but it’s still a good secure coding recommendation, not dissimilar to a linter flagging stylistic improvements. Combined with the ease of generating a fix to that issue using the LLM, the cost of a false positive is reduced.

However, you can’t completely rely on AI to know the difference and avoid breaking the behavior you need. In a production security pipeline, noise quickly becomes a burden. The sweet spot is to use AI tools as assistants, not authorities. They are great for idea generation, triage hints, or prioritization, but better when paired with deterministic analysis.

Takeaways for Development Teams

If you’re a developer using AI tools like Claude, Copilot, Windsurf, Ghostwriter, etc., this research may feel familiar. They’re great at pattern matching and explaining code, but not always consistent or precise.

When it comes to security, inconsistency becomes a consistent risk and leads to uncertainty.

Here are a few key takeaways:

AI can find real vulnerabilities. Especially logic flaws like IDOR and broken access control.
AI is non-deterministic. Running the same scan twice may yield different results, so expect variability and uncertainty if you have met your threshold of acceptable risk.
AI struggles with deep data flows. Injection and taint-style vulnerabilities remain a strength of static analysis.
AI context can be useful. Treat findings as guardrails around the types of solutions you are building.
Hybrid systems can win. The future lies in combining AI’s contextual reasoning with deterministic, rule-based static analysis engines.

LLMs won’t replace security engineers or tools anytime soon, but they’re reshaping how we think about software security.

To review the methodology and data, see the original report: Finding vulnerabilities in modern web apps using Claude Code and OpenAI Codex.

Evaluating AI Vulnerability Detection: How Reliable Are LLMs for Secure Coding?

The Experiment: AI vs. AI

AI Vulnerability Detection Did Well at Contextual Reasoning

AI Struggled With Data Flows

The Chaos Factor: Non-Determinism

Benchmarks and the Illusion of Progress

When “False Positives” Are Still Useful

Takeaways for Development Teams

Real audio of a Bank Impersonation scam

Scambaiters vs. Scammers – Who will win?

Tangem Wallet Tutorial (2026): Full Setup & Transferring Crypto

If Apple Spoke Real English

The Experiment: AI vs. AI

AI Vulnerability Detection Did Well at Contextual Reasoning

AI Struggled With Data Flows

The Chaos Factor: Non-Determinism

Benchmarks and the Illusion of Progress

When “False Positives” Are Still Useful

Takeaways for Development Teams

Similar Posts