v2.5 — Stability + Safety Hardening

AI images that
actually work

A security-hardened fork of Fooocus that runs on weak hardware, won't crash on low VRAM, and tells you exactly what it did. Scales from 16 GB RAM laptops to Apple Silicon M4.

4Prompt modes
2Safety layers
6Hardware modes
8Motion presets

Generate an image right now

Free. No account. No API key. Powered by Pollinations.ai with NSFW filtering enforced — the same safety principle as Cookie-Fooocus. For the full experience, install locally.

🌐
This demo Static site · client-side prompt filter · Pollinations.ai backend (external, uncontrolled)
vs
🍪
Full Cookie-Fooocus Local / server install · VRAM governor · priority queue · decision chain · 4-mode prompt engine · full 2-layer safety
🎨

Your image will appear here

Type a prompt above and hit Generate

Built different

Not more features — better architecture.

Full deployment only

Won't crash on low VRAM

Predictive VRAM model estimates memory before a job starts. If budget is tight, it auto-scales: steps first, then resolution, then precision. You get a result, not an OOM error.

estimated = 3.5GB + (megapixels × cost) + (steps × cost)
Full deployment only
🔍

No silent changes

Every job has a decision chain — an audit log of what each stage decided about your parameters. No more "something silently reduced my steps."

{"stage": "vram_model", "action": "reduce_steps", "reason": "predicted_vram_high"}
🧠

4 prompt modes

RAW passes your prompt unchanged. BALANCED adds deterministic keyword expansion with no LLM. STANDARD uses the original Fooocus GPT-2 engine. LLM uses Ollama for creative rewrites. Every result shows what mode actually ran and why.

🛡️

2-layer safety

Layer 1 is fast deterministic rules — no ML, always active. Layer 2 is an ML classifier that only runs when Layer 1 passes. Image moderation is post-generation only — no wasted GPU cycles.

Full deployment only
📊

Real performance data

p50 and p95 per stage, VRAM peak tracking. The VRAM prediction model learns from actual usage using EWA smoothing — gets more accurate over time without oscillating.

🔌

Safe by default

n8n webhooks are disabled by default. Auto-tune is opt-in. Telemetry collects data but never modifies behaviour unless explicitly enabled. Everything sensitive requires a deliberate config change.

Full deployment only
💾

L1/L2 cache hierarchy

In-memory LRU for speed, SQLite for persistence. Cold starts warm from disk automatically. Cache failures are non-fatal — a write error never fails a generation job.

Full deployment only
🎥

Video pipeline

Same UI, same safety, same queue — switch to video output in one click. 8 motion presets. Frame cost cap prevents one request from running 144 sequential inference passes.

New in v2.5
🖥

Multi-GPU scheduling

GPU topology layer detects all CUDA / Metal devices. Jobs route to the least-loaded GPU. Per-GPU free VRAM and throughput scores updated after every job — no more queue stacking on one device.

New in v2.5
🌐

Distributed worker protocol

Control plane / execution plane split. Register local or remote HTTP workers. Job lease system with automatic reclaim on timeout. Foundation for home lab → production rendering farm scaling.

New in v2.5
🗄

Cache lifecycle hardening

SQLite prompt cache no longer grows indefinitely. Size-based eviction (soft cap 5k / hard cap 10k rows), age-based pruning (30-day TTL), async VACUUM compaction, and eviction metrics in stats().

New in v2.5
🔬

VRAM calibration tool

The EWA feedback model can drift after hardware changes. calibrate() runs a benchmark sweep across all resolution/step combinations and reports drift percentage. reset_vram_model() returns it to baseline.

report = governor.calibrate() → {"max_drift_pct": 3.2}
New in v2.5
🔎

Safety Explainability UI

The decision_chain is now surfaced in the UI as a visual timeline. Every stage shows what it decided, what it changed, and why — VRAM adjustments, safety layer decisions, scheduler actions, all auditable in one panel.

Runs on your hardware

Pick the row that matches your machine.

Hardware Mode Speed Notes
16 GB RAM, no GPU 5 — No VRAM ~10–20 min/image Minimum viable. Works.
4–8 GB VRAM GPU 1 or 2 ~30–90 s/image Recommended minimum for regular use
Apple Silicon 32 GB+ 6 — MPS ~40–90 s/image No VRAM ceiling — unified memory handles large jobs
12 GB+ VRAM GPU 1 — GPU ~10–30 s/image Full feature set including LLM mode + stable video
🍏
Apple Silicon tip: Select Mode 6 on first run. Unified memory means large jobs that would OOM a same-spec NVIDIA card will complete. LLM mode needs Ollama. Best results with 32 GB+ but 16 GB works with BALANCED mode.

Install in 3 commands

git clone https://github.com/FreddieSparrow/cookiefooocus.git
cd cookiefooocus
bash install_local.sh
bash run.sh

Then open http://localhost:7865

Optional: LLM prompt mode

Install Ollama for Mode C (creative prompt rewriting). Falls back to BALANCED automatically if not installed.

curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma4
ollama serve

Common errors, solved

CUDA out of memory
The VRAM governor should prevent this. If it happens, select a lower hardware mode or reduce resolution to 512–768.
Ollama unavailable
Run ollama serve, or switch to BALANCED mode. The trace will always say which mode actually ran.
Job timed out
Generation exceeded 600s. Reduce resolution or steps. Check VRAM with telemetry.dashboard().
Invalid HMAC signature
Secret mismatch between safety_policy.json and the n8n Code node. They must be identical.
Output hidden / blank
NSFW score above block threshold. Lower nsfw_block_threshold in safety_policy.json if the content is legitimate.
--server on macOS / Apple Silicon
Server mode is not supported on macOS or Apple Silicon — it will not start. Use local mode instead: bash run.sh. Server mode requires Linux + CUDA GPU.
Very slow on Apple Silicon
You may be in CPU mode accidentally. Confirm Mode 6 is selected. Check startup logs for mps.

Documentation

Everything you need to know — architecture, modules, configuration, and the v2.5 engineering improvements.

Architecture

Three strict layers: Core (SDXL inference), Orchestration (scheduler, VRAM governor, cache, telemetry), and Policy (safety, rate limits, profiles). UI never calls pipelines directly.

Read more →

Prompt Engine (4 modes)

RAW — unchanged passthrough.
BALANCED — deterministic keyword expansion, no LLM, always reproducible.
STANDARD — original Fooocus GPT-2 engine.
LLM — Ollama with constrained JSON output, falls back to BALANCED.

Read more →

VRAM Governor

Pre-execution VRAM budget check with predictive model and EWA feedback correction. Auto-downscale cascade: steps → resolution → precision → reject. New: calibrate() detects model drift; reset_vram_model() restores baseline.

Read more →

2-Layer Safety

Layer 1: fast deterministic rules (regex, keyword, fuzzy — always on, no ML). Layer 2: ML classifier (DeBERTa v3, runs only when Layer 1 passes). Image moderation post-generation only. Structured SafetyDecision output on every check.

Read more →

Scheduler & Queue

Priority queue with starvation prevention (low-priority jobs promoted after 30s). Per-user job limit (max 2 active). Job lifecycle: QUEUED → SCHEDULED → RUNNING → COMPLETE/FAILED/CANCELLED/TIMED_OUT.

Read more →

Multi-GPU & Distributed Queue

GPUTopology layer detects all CUDA / Metal devices and routes jobs to the least-loaded device. ControlPlane / WorkerNode protocol enables multi-machine rendering with lease-based job ownership and heartbeat monitoring.

Read more →

Cache System

L1 in-memory LRU + L2 SQLite. Separate caches for prompt expansions (no TTL) and NSFW scores (TTL 300s). New: size-based eviction (5k/10k row caps), age-based pruning (30 days), async VACUUM compaction, eviction metrics.

Read more →

n8n Integration

HMAC-SHA256 signed requests with replay protection (nonce + timestamp). Rate limiting (30 req/60s per IP). Disabled by default in safety_policy.json. Callbacks on: blocked, complete, queue_wait.

Read more →

Decision Chain & Explainability

Every job produces an ordered audit log of all parameter decisions. New: explainability module formats this as plain text, structured JSON, or an HTML timeline for the Gradio UI — what the system changed and why, fully visible.

Read more →

Policy Profiles

Four profiles: balanced (default), creative (minimal filtering, LLM), strict (max moderation, public deploy), api_safe (programmatic / n8n). Copy values into safety_policy.json to apply — never auto-applied.

Read more →

Video Pipeline

Same UI, queue, and safety as image generation. Backends: SVD (img2vid) and AnimateDiff. 8 motion presets: smooth, cinematic, handheld, zoom, orbit, parallax, dolly, drone. Frame cost cap: max 96 frames per job.

Read more →

Full Readme

Complete documentation including all command-line flags, hardware modes, Apple Silicon guide, n8n signing examples, telemetry dashboard output, and full version comparison table.

Open full readme →