Safety model and limits

CodeGate is a pre-flight inspection tool, not a safety guarantee. Understanding how its layers work and where they stop is essential for using it responsibly.

Detection layers

Layers 1 and 2 — Offline-first and deterministic

Layers 1 and 2 run on every scan. They make no network requests. Results are deterministic given the same input files and rule configuration.

Layer 1 — Discovery: CodeGate enumerates known AI config file paths, installed tools, MCP server definitions, hook locations, rule and skill markdown, workspace settings, and extension manifests. It builds a surface map of what can influence agent behavior. Layer 2 — Static risk detection: CodeGate runs a rule-based scanner across the collected surface. Detection categories include COMMAND_EXEC, ENV_OVERRIDE, CONSENT_BYPASS, RULE_INJECTION, IDE_SETTINGS, SYMLINK_ESCAPE, GIT_HOOK, NEW_SERVER, and CONFIG_CHANGE. Unicode analysis runs across rule-file and tool-description content to detect visually hidden characters. Because Layers 1 and 2 are offline-first, they have no dependency on network availability, no exposure from fetching remote content, and no interaction with external services.

Layer 3 only runs when you pass --deep. Each eligible remote resource requires explicit per-resource consent before CodeGate fetches or analyzes it. Skipped consent is recorded in the output for auditability.

Layer 3 extends the scan with:

Discovery of eligible external resources from known config paths.
Discovery of eligible local instruction files from the selected markdown/text scan surface.
Remote tool-description fetching and analysis (MCP server endpoints).
Meta-agent analysis of local instruction files (AGENTS.md, CODEX.md, discovered rule/skill markdown) via a supported local tool (currently Claude Code for text-only analysis).
TOXIC_FLOW findings for cross-tool manipulation patterns.
PARSE_ERROR findings when Layer 3 content cannot be parsed.

Exposure trade-off: Deep scan increases exposure compared to static-only scans because it may fetch remote metadata or content and invoke a selected local AI tool. This is the explicit trade-off for broader detection coverage. CodeGate does not execute untrusted MCP stdio command arrays during tool-description scanning. Local instruction-file analysis is text-only: CodeGate passes file content and referenced URL strings as inert text and does not execute referenced content.

Layer 4 — Remediation (opt-in)

Remediation is best-effort and reversible. Backup sessions are written to .codegate-backup/. Use codegate undo to restore the most recent session. Review all changes before committing them.

Layer 4 supports guided and automatic remediation:

--remediate: guided file remediation with prompts.
--fix-safe: auto-applies unambiguous critical fixes without prompting.
--dry-run: shows proposed changes without writing anything.
--patch: generates a patch file for external review workflows.

Even with --fix-safe, users should review the changes before proceeding. Remediation is best-effort: not every finding has a safe automatic fix, and the scanner cannot verify whether a fix is correct in all project contexts.

What CodeGate does not claim

CodeGate is designed to improve visibility and decision quality. It is not designed to function as an absolute safety net.

Not a guarantee of safety. A clean scan result means CodeGate found no known patterns matching its current rule set. It does not mean the project is safe.
Not a replacement for secure engineering judgment, review, and hardening. Human review of configuration files, MCP server definitions, hooks, and instruction content is still necessary.
Not perfect. CodeGate can produce false positives (flagging benign content) and false negatives (missing malicious content). Both are expected and documented.
Not a promise that every malicious pattern will be detected. New attack techniques can appear before signatures and heuristics are updated. Rule packs can be extended with rule_pack_paths, but coverage is never complete.

False positives and false negatives

Both are inherent to pattern-based detection. False positives occur when CodeGate flags content that is not actually malicious. Common causes include legitimate uses of shell commands in hooks, MCP servers from known safe providers that are not in the known_safe_mcp_servers list, and rule/skill markdown that uses assertive language for valid reasons. Suppress known-safe findings using suppress_findings (by finding ID or fingerprint) or suppression_rules (by rule ID, file path, severity, category, CWE, or fingerprint with AND semantics). False negatives occur when CodeGate does not flag content that is malicious. Causes include novel attack patterns not yet in the rule set, obfuscation techniques that evade static analysis, and content that only becomes risky at runtime or through multi-step agent reasoning. Layer 3 deep scan improves coverage for remote and dynamic surfaces, but introduces its own limitations.

Guiding principles

From why-codegate.md:

Inspect before trust. Scan first; decide whether to trust and run after reviewing findings.
Prefer explicit user consent over silent execution. Layer 3 resources require per-resource consent. codegate run requires confirmation for warning-level findings unless explicitly bypassed.
Keep high-risk operations explainable and reviewable. Every finding includes a category, severity, and location. Remediation is backed by a reversible backup.
Treat documented risk as real risk when users are likely to miss it. “This is documented behavior” is not sufficient justification when execution surfaces are broad and defaults are permissive.
Preserve operator control with backups, undo, and policy thresholds. Config keys like severity_threshold, trusted_directories, blocked_commands, and suppression_rules put policy decisions in operator hands.

​Detection layers

​Layers 1 and 2 — Offline-first and deterministic

​Layer 3 — Opt-in, consent-driven, increases exposure

​Layer 4 — Remediation (opt-in)

​What CodeGate does not claim

​False positives and false negatives

​Guiding principles

Detection layers

Layers 1 and 2 — Offline-first and deterministic

Layer 3 — Opt-in, consent-driven, increases exposure

Layer 4 — Remediation (opt-in)

What CodeGate does not claim

False positives and false negatives

Guiding principles