← Back to Portfolio

Building Digital Shadow Agent: A Recursive OSINT Tool Powered by LangGraph and Local LLMs

Disclaimer: This project is intended for ethical, educational, and authorized use only — such as researching your own digital footprint, authorized penetration testing, or security awareness training. Always comply with applicable laws and platform terms of service.


Table of Contents

  1. What Is Digital Shadow Agent?
  2. The Problem It Solves
  3. Tech Stack
  4. Architecture Overview
  5. Designing the State
  6. The Four Nodes
  7. Wiring It Together with LangGraph
  8. The Flask Web Server
  9. Live Demo — My Own Digital Shadow
  10. Vehicle Intelligence
  11. Forensics Lab — File Metadata Extraction
  12. Key Design Decisions
  13. What I Learned

What Is Digital Shadow Agent?

Digital Shadow Agent is a recursive OSINT (Open Source Intelligence) footprinting tool. Given a username or email address, it autonomously:

  1. Discovers online profiles across hundreds of platforms
  2. Scrapes those profiles for emails, aliases, and links
  3. Uses a local LLM to analyse the data and find new identities to pivot to
  4. Repeats recursively until no new leads are found
  5. Compiles everything into a structured Markdown intelligence dossier

The key word is recursive. Most OSINT tools are single-pass. This one keeps going, using each round of findings to fuel the next search — just like a real investigator would.


The Problem It Solves

Manual OSINT is tedious. You find a GitHub username, manually check Twitter, find an email in a bio, then search that email, find another alias… It compounds quickly and takes hours.

I wanted to automate that entire investigative loop: discover → harvest → analyse → pivot → repeat.

The challenge was keeping it intelligent enough to not spiral into noise. That’s where the LLM comes in — it acts as the judgement layer, deciding which new leads are actually related to the original target.


Tech Stack

Layer Tool
Agentic Workflow LangGraph
LLM (Local) Ollama + qwen3:8b
Profile Discovery Maigret
Web Search DuckDuckGo Search (ddgs)
HTTP Client httpx
HTML Parsing BeautifulSoup4
Breach Checking XposedOrNot API
Web UI Flask + Server-Sent Events
Terminal UI Rich

Everything runs locally. No external AI API keys required.


Architecture Overview

The agent follows a directed graph pattern:

START
  │
  ▼
SCOUT ──────────────────────────────────┐
  │                                     │
  ▼                                     │ (recurse if new pivots
HARVESTER                               │  AND iteration < max)
  │                                     │
  ▼                                     │
ANALYST ───────────────────────────────┘
  │
  │ (no more pivots OR max depth reached)
  ▼
DOSSIER
  │
  ▼
 END

The conditional edge after Analyst is the core of the recursive behaviour. If the LLM finds new identities to investigate, the graph loops back to Scout. Otherwise, it falls through to Dossier.


Designing the State

LangGraph passes a shared state dict between every node. The key design challenge was: how do you accumulate data across iterations without duplicating it?

The answer is reducers — custom functions that LangGraph calls when a node tries to update a state field.

# state.py

def _merge_lists(left: list, right: list) -> list:
    """Append new items; no deduplication."""
    return left + right

def _merge_sets(left: set, right: set) -> set:
    """Union two sets — natural deduplication."""
    return left | right

def _merge_demographics(left: dict, right: dict) -> dict:
    """Merge dicts, keeping non-empty values and avoiding overwrites with 'Unknown'."""
    merged = left.copy() if left else {}
    for k, v in right.items():
        old_val = merged.get(k)
        if not old_val or str(old_val).lower() in ["unknown", "none", "n/a"]:
            merged[k] = v
        elif isinstance(old_val, list) and isinstance(v, list):
            merged[k] = list(set(old_val + v))
    return merged

The AgentState TypedDict then uses Annotated to attach these reducers:

class AgentState(TypedDict, total=False):
    seed_input: str
    input_type: Literal["username", "email"]
    current_targets: list[str]

    discovered_profiles: Annotated[list[dict], _merge_lists]
    harvested_data:       Annotated[list[dict], _merge_lists]
    seen_targets:         Annotated[set[str],   _merge_sets]
    demographics:         Annotated[dict,        _merge_demographics]
    # ...

This means every node just returns a partial update dict. LangGraph handles the merging automatically.


The Four Nodes

Scout — Profile Discovery

The Scout node is responsible for finding where the target exists online.

It runs two strategies in sequence:

Primary: Maigret CLI

Maigret checks a username against 3,000+ sites and outputs a JSON report. I call it as a subprocess and parse the generated report file:

def _run_maigret(username: str) -> list[dict]:
    cmd = [
        sys.executable, "-m", "maigret",
        username,
        "--json", "simple",
        "--timeout", "10",
        "--no-progressbar",
        "--folder", "reports",
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=180)

    # Find the generated JSON report
    for f in os.listdir("reports"):
        if f.startswith(f"report_{username}") and f.endswith(".json"):
            with open(os.path.join("reports", f)) as fp:
                data = json.load(fp)
            return [p for p in data.get("profiles", []) if p.get("is_found")]

Fallback: Direct HTTP Checks

If Maigret fails, Scout directly requests URLs on major platforms and inspects the response:

platforms = {
    "GitHub": f"https://github.com/{username}",
    "Reddit": f"https://www.reddit.com/user/{username}",
    # ... etc
}

for platform, url in platforms.items():
    resp = client.get(url)
    if resp.status_code == 200 and not any(sig in resp.text[:2000].lower()
                                           for sig in NOT_FOUND_SIGNALS):
        profiles.append({"platform": platform, "url": url, ...})

Profile Validation

Before moving on, each discovered URL is re-validated in parallel using ThreadPoolExecutor. This catches soft 404s (pages that return 200 but show “user not found” text) and also attempts to extract profile images via regex patterns specific to each platform:

# GitHub: avatar URL pattern
match = re.search(r'(https://avatars\.githubusercontent\.com/u/\d+)', html)

# TikTok: CDN pattern
match = re.search(r'(https://p16-common-sign\.tiktokcdn[^"\'\s>]+)', html)

Relevance Scoring

Each profile gets a relevance score (0.0–1.0) based on how closely its username/URL matches the original target. This prevents noise from being processed further downstream:

def _score_profile(profile: dict, target: str) -> float:
    score = 0.0
    if profile["username"].lower() == target.lower():
        score += 0.8
    if target.lower() in profile["url"].lower():
        score += 0.3
    return min(1.0, score)

Breach Checking

If the target is an email address, Scout also queries the XposedOrNot API to check for known data breaches:

def _check_breaches(email: str) -> list[str]:
    resp = httpx.get(f"https://api.xposedornot.com/v1/check-email/{email}")
    # Parse the response and extract breach names

Harvester — Data Extraction

Once profiles are found, the Harvester scrapes each one for intelligence.

It uses BeautifulSoup to extract structured data from HTML:

def _extract_from_html(html: str, base_url: str) -> dict:
    soup = BeautifulSoup(html, "html.parser")

    # Remove noise
    for tag in soup(["script", "style", "noscript"]):
        tag.decompose()

    text = soup.get_text(separator=" ", strip=True)

    # Bio: try meta description, og:description, or first <p>
    bio = (soup.find("meta", {"name": "description"}) or {}).get("content", "")

    # Extract emails with regex
    emails = EMAIL_RE.findall(text)

    # Extract @mentions
    usernames = USERNAME_RE.findall(text)

    # Collect outbound links
    external_links = [
        a["href"] for a in soup.find_all("a", href=True)
        if a["href"].startswith("http") and base_domain not in a["href"]
    ]

    return {"bio": bio, "emails_found": emails, "usernames_found": usernames, ...}

Anti-bot Courtesy

The Harvester adds a random jitter delay between requests (1.5–3.5 seconds) and rotates User-Agent headers per platform. This isn’t about evasion — it’s about being a respectful scraper and not hammering servers.

DuckDuckGo Search

After scraping profiles directly, it also runs DuckDuckGo searches for each target and scrapes those result pages too, widening the intelligence net:

with DDGS() as ddgs:
    for r in ddgs.text(f'"{target}" social profile', max_results=5):
        # Scrape the result URL as well

Analyst — LLM-Powered Intelligence

This is the brain of the agent. The Analyst sends all harvested data to a local Ollama LLM, which returns structured JSON containing:

The System Prompt

You are an expert OSINT analyst. Your job is to examine harvested intelligence
data from online profiles and extract:

1. New usernames or aliases the target might use on other platforms.
2. New email addresses associated with the target.
3. Key observations — notable findings.
4. Target Demographic Intelligence: Name, Age (guess from career/education),
   Sex, Race/Ethnicity, Phone Numbers, LinkedIn Profiles.

Rules:
- ONLY include intelligence that belongs to the PRIMARY target.
- If data belongs to a DIFFERENT person, DISCARD IT.
- Return valid JSON only.

Pivot Filtering

A critical piece of logic: the LLM might hallucinate or include irrelevant results. So new pivots are filtered against an identity-match function before being accepted:

def _simple_identity_match(handle: str, target: str) -> bool:
    """Only accept pivots that share a 3-char prefix with the original target."""
    h = handle.lower().replace(".", "").replace("_", "")
    t = target.lower().replace(" ", "")
    if h[:3] != t[:3]:
        return False
    return t in h or h in t

This prevents the agent from going off on tangents and investigating unrelated people.

Demographic Locking

Same logic applies to demographic data. If the LLM extracts a name that doesn’t match our target, those demographics are discarded:

if extracted_name and not _simple_identity_match(extracted_name, primary_target):
    # Discard — this is someone else's data
    locked_demographics = {}

Dossier — Report Generation

The final node compiles everything into a structured Markdown report and writes it to disk:

dossier_parts = [
    "# Intelligence Dossier",
    f"**Target:** `{seed}` ({input_type})",
    f"**Generated:**{timestamp}",
    "## Summary Statistics",
    f"| Profiles discovered |{len(profiles)} |",
    # ... all sections from every iteration
]

filename = f"dossier_{safe_seed}_{date_str}.md"
with open(filepath, "w") as f:
    f.write("\n".join(dossier_parts))

Because dossier_sections in the state accumulates one section per node per iteration, the final report naturally tells the story of the whole investigation chronologically.


Wiring It Together with LangGraph

The graph is defined in graph.py. The key part is the conditional edge after Analyst:

def _should_recurse(state: dict) -> str:
    pivots = state.get("pivots", [])
    iteration = state.get("iteration", 0)
    max_iter = state.get("max_iterations", 3)

    if pivots and iteration < max_iter:
        return "scout"   # Loop back
    else:
        return "dossier" # Finish up

graph = StateGraph(AgentState)
graph.add_node("scout",     scout_node)
graph.add_node("harvester", harvester_node)
graph.add_node("analyst",   analyst_node)
graph.add_node("dossier",   dossier_node)

graph.set_entry_point("scout")
graph.add_edge("scout", "harvester")
graph.add_edge("harvester", "analyst")
graph.add_conditional_edges("analyst", _should_recurse, {
    "scout":   "scout",
    "dossier": "dossier",
})
graph.add_edge("dossier", END)

compiled = graph.compile()

Running the agent is then just:

initial_state = make_initial_state(target="johndoe", input_type="username")
final_state = compiled.invoke(initial_state)

LangGraph handles all the state merging, edge routing, and iteration tracking.


The Flask Web Server

Beyond the CLI, I built a Flask web server (web_server.py) that exposes the agent over HTTP with a real-time UI using Server-Sent Events (SSE).

The frontend starts an investigation via a POST request, then opens an EventSource stream to receive live updates:

POST /api/investigate  → starts the agent in a background thread
GET  /api/stream/<id>  → SSE stream of log messages, profiles, progress
GET  /api/result/<id>  → final dossier JSON when done

Each node pushes events onto a per-investigation queue.Queue. The SSE endpoint reads from that queue and streams events to the browser as they happen. This gives the UI live feedback without polling.


Live Demo — My Own Digital Shadow

The best way to understand what this tool does is to see it in action. I ran it on myself — Rusheel Raj — to map out my own public digital footprint. Here’s exactly what it found.


The Dashboard — Three Modes in One UI

Digital Shadow Agent — Main Dashboard

Digital Shadow Agent — Main Dashboard

The UI is split into three independent investigation modes, all accessible from a single dashboard:

The bottom bar shows the current Recursion Depth setting (DEEP DIVE L2 = 2 iterations) and the system status. All three modules run independently — you don’t need to run a full OSINT scan to use the Forensics Lab.


Operation Log and Discovered Profiles

Operation Log and Discovered Profiles

Once a scan is started, the left panel becomes a live Operation Log — a real-time stream of every action the agent takes, timestamped to the second:

[16:03:59] Mission initiated. Target: Rusheel Raj
[16:03:59] Max recursion depth: 2
[16:07:44] Checking breaches for matching identity: rusheelraj.p@gmail.com
[16:07:44] ⚠ Found 1 breach(es) for rusheelraj.p@gmail.com!
[16:15:39] MISSION COMPLETE — Dossier generated.

This is powered by Server-Sent Events (SSE) — each node pushes log events into a queue as it runs, and the browser receives them live without polling.

On the right, a Profile Image Carousel displays photos collected from discovered profiles. In this run it found 3 images (3/3) pulled from different platforms — shown cycling through automatically.

Below, the Discovered Profiles panel lists every claimed account found — 45 profiles in total for my name. Each entry shows the platform, the direct URL, and its status (CLAIMED / AVAILABLE). Platforms found in this run included LinkedIn, Twitch, GitHub, YouTube, Twitter/X, Steam, and Instagram — all verified as active.


Evidence Eye — Collected Media

Massive Evidence Eye — Collected Media

Massive Evidence Eye — Collected Media

The Massive Evidence Eye section aggregates all profile images harvested from across platforms into a single visual gallery. Each image is labelled with its source platform and view type.

In my scan, three images were collected:
- Instagram — a black and white side profile photo
- Instagram — a candid photo from a snow trip
- GitHub — my professional headshot in a white shirt

This is built by the Harvester node, which looks for og:image meta tags and platform-specific avatar CDN patterns in the scraped HTML. The fact that three photos from two different platforms were linked to the same identity demonstrates how consistent usernames across platforms make cross-referencing trivial.


Pivot Targets, Recursive Scout & Breach Detection

Pivot Targets and Scout Results — Iteration 2

Pivot Targets and Scout Results — Iteration 2

This screenshot shows the agent mid-investigation at Iteration 2 — after the first pass found my primary profiles, the Analyst node extracted two new pivot targets:

These become the new seed targets for the next Scout pass. The Scout then scans those identifiers across platforms — finding 15 additional profiles in iteration 2 alone.

The breach detection module also triggered: rusheelraj.p@gmail.com was found in 1 known data breach (flagged in orange), queried live against the XposedOrNot API.

The profiles table shows the recursive harvest in action — the same email address appearing across LinkedIn, Twitch, Twitter/X, Steam, Instagram, and TikTok, all with claimed status. This is exactly the snowball effect the recursive architecture is designed to exploit: one email unlocks a second wave of platform accounts.


Analyst Report — Intelligence Extracted by the LLM

Analyst Report — Iteration 3 with LLM Observations

Analyst Report — Iteration 3 with LLM Observations

After the Harvester scrapes 14 pages in Iteration 3, the raw data is passed to qwen3:8b running locally via Ollama. The Analyst node returns a structured report with two parts:

Demographic Intelligence — structured fields the LLM attempts to populate from scraped text. In this run most fields were Unknown because the pages didn’t contain explicit personal details, which is the correct behaviour — the model doesn’t hallucinate.

Observations — a free-text summary of what the LLM actually found and inferred:

“Rusheel Raj Panakadan is a cybersecurity graduate student at Florida Tech specialising in penetration testing, secure application deployment, and enterprise security hardening. He has a portfolio website, GitHub profile, and LinkedIn presence. His work includes cyber forensics, quantum cryptography integration, and threat detection using ELK stack. He is based in Melbourne, Florida, USA.”

This was generated entirely from scraped public page text — no manual input. The LLM correctly identified my university, field of study, technical specialisations, and location purely from what’s publicly visible across my profiles.

The qwen3:8b model label is visible in the report header, confirming it’s running fully locally with no external API calls.


Vehicle Intelligence

Beyond usernames and emails, the agent has a dedicated Vehicle Intelligence node — a specialized lookup engine for vehicle registration data from public records.

How It Works

The vehicle_node is triggered separately from the main OSINT graph. You supply a vehicle registration (plate) number and a region, and it routes to the appropriate data source:

Vehicle Plate Input
        │
        ├── Region: IN ──▶ vahanx.in  (India RTO records)
        └── Region: US ──▶ findbyplate.com (USA NHTSA/VIN data)

India — RTO Lookup

image6.png

For Indian plates, the node scrapes the public portal and extracts a comprehensive record:

Field Example
Owner Name Rahul Sharma
Vehicle Model Maruti Suzuki Swift
Registration Number MH12AB1234
Registration Date 2019-03-15
Fuel Type Petrol
RTO Location Pune, Maharashtra
Insurance Company HDFC ERGO (Exp: 2026-03-14)
Chassis / Engine (last 5) X1234 / Y5678
Vehicle Age 5 Years
Fitness Valid Until 2034-03-14
Vehicle Class Motor Car
Financier HDFC Bank

The parser targets the specific CSS structure of VahanX’s card layout (div.hrcd-cardbody, div.col-sm-6.mb-4), pulling both the summary cards and the detailed grid fields.

USA — NHTSA / VIN Lookup (findbyplate.com)

image7.png

For US plates, owner data is legally restricted under the Driver’s Privacy Protection Act (DPPA), so the node focuses on vehicle technical intelligence instead:

Field What’s Available
Make / Model / Year ✅ Full detail
VIN Descriptor ✅ Partial VIN
Fuel Type
Vehicle Class
Engine Displacement
Safety Systems ✅ (ABS, ESC, TPMS, Traction Control)
GVWR
Assembly Plant
Owner Name ❌ DPPA Restricted

The node flags any technical alerts or recall notices found on the page as well.

The LLM Layer

In cases where structured scraping is insufficient (e.g., insurance snippets in search results, resale listings), the node falls back to the qwen3:8b LLM with a specialised RTO analyst prompt:

You are an expert RTO OSINT analyst. Extract:
owner name, vehicle model, registration number,
registration date, fuel type, RTO location,
insurance details, chassis/engine numbers.
Return valid JSON only.

The result from either the scraper or LLM is stored in vehicle_records in the agent state, and the deduplication reducer (_merge_vehicles) ensures the same plate is never double-counted across iterations.


Forensics Lab — File Metadata Extraction

The Forensics Lab is an independent feature within the web UI that analyses uploaded files for hidden metadata — the kind that can reveal far more than the file’s visible content.

Supported File Types

File Type Extensions What’s Extracted
Images JPG, PNG, TIFF, BMP, WEBP EXIF data, camera info, GPS coordinates
PDF PDF Author, creator, title, keywords, creation/modification dates, page count
Word Documents DOCX Author, last modified by, revision count, word count, creation/modification dates

Image EXIF — The Most Revealing

Photos taken on smartphones almost always carry EXIF metadata embedded silently in the file. The Forensics Lab uses Pillow to extract all EXIF tags:

exif_data = img._getexif()
for tag_id, value in exif_data.items():
    tag_name = TAGS.get(tag_id, str(tag_id))  # human-readable tag name
    metadata["fields"][tag_name] = value

Commonly found fields include:

image8.png

GPS Coordinate Extraction

GPS data in EXIF is stored in DMS format (Degrees, Minutes, Seconds). The lab converts it to decimal coordinates for direct use in maps:

def _dms_to_decimal(dms_tuple, ref):
    degrees = float(dms_tuple[0])
    minutes = float(dms_tuple[1])
    seconds = float(dms_tuple[2])
    decimal = degrees + minutes / 60.0 + seconds / 3600.0
    if ref in ("S", "W"):
        decimal = -decimal
    return round(decimal, 6)

If GPS data is found, the UI displays the coordinates and a direct Google Maps link — pinpointing exactly where a photo was taken.

image9.png

Why This Matters

Most people don’t know their files are talking. A photo posted online, a PDF submitted to a form, a Word document shared via email — all of them carry silent metadata that can reveal:

The Forensics Lab makes this visible in seconds through a simple drag-and-drop interface.


Key Design Decisions

1. Local LLM, no cloud dependency

Using Ollama means the agent runs entirely offline. No API costs, no data leaving your machine. The qwen3:8b model was chosen for its strong instruction-following and JSON output capabilities.

2. Strict identity matching as a guardrail

The 3-character prefix check sounds crude, but it’s effective. It prevents the LLM from pivoting to a completely different person just because they share a common name. You could replace this with something more sophisticated (fuzzy matching, vector similarity) for a production system.

3. Relevance scoring before harvesting

Maigret can return hundreds of results. Only profiles scoring ≥ 0.2 relevance and up to the top 12 are passed to the Harvester. This keeps each iteration fast and focused.

4. Reducers for clean state accumulation

The custom _merge_* reducers mean nodes don’t need to worry about the accumulated state — they just return their new data. LangGraph merges correctly. This keeps each node simple and testable in isolation.

5. Graceful fallbacks everywhere

Maigret → HTTP fallback. ChatOllama → direct httpx fallback. Primary DDG library → secondary import. The agent keeps running even if individual dependencies are missing or fail.


What I Learned

LangGraph is excellent for iterative, stateful agents. The reducer pattern eliminates a whole class of bugs around state accumulation that I would have had to manage manually.

Local LLMs are surprisingly capable for structured extraction tasks. qwen3:8b reliably returns valid JSON from my prompt template. Temperature 0.1 is key — you want determinism, not creativity, for structured output.

OSINT tools need careful noise filtering. Without the identity-match guardrail, the agent spiraled off to investigate unrelated people early in testing. The LLM is powerful but not magic — you need heuristic guards around its outputs.

Threading + SSE is a great pattern for long-running agent UIs. Spawning the agent in a background thread and streaming results via SSE gives a much better UX than waiting for a response on a single HTTP request that might take minutes.


Final Thoughts

This project started as a curiosity — “what if I automated the recursive loop of OSINT research?” — and turned into a genuinely useful tool for understanding your own digital footprint.

The combination of LangGraph’s graph-based orchestration, Maigret’s breadth of platform coverage, and a local LLM for intelligent pivot decisions produces something more capable than any of those parts alone.

If you build on this, I’d suggest exploring:
- Async execution in Scout/Harvester for faster parallel scraping
- Vector similarity for better identity matching across aliases
- More data sources — Shodan, Certificate Transparency logs, WHOIS
- Export formats — PDF, HTML, JSON alongside Markdown

The digital shadow you cast online is larger than most people realise. Tools like this help bring it into focus.


Built with Python, LangGraph, Ollama, and too much curiosity.