Defensive cybersecurity CLI for detecting phishing domains and suspicious URLs using typosquatting analysis, similarity scoring, heuristic detection, and explainable risk scoring.
PhishGuard has three focused commands:
domain: generate bounded typo variants for a target brand/domain and score risky lookalikes.url: analyze a suspicious URL with structural checks, intelligence enrichment, and rule-based overrides.compare: compute Levenshtein similarity between two hosts for quick lookalike checks.
This is a defensive analysis tool designed for security triage, demos, and training.
Use this project only for defensive cybersecurity, awareness, and detection. Do not use it to facilitate phishing, impersonation, or malicious operations.
- CLI & UX
typerfor command parsingrichfor hacker-style terminal tables/panels
- Core Models
pydanticfor typed report/data models
- Network & Intelligence
httpxfor HTTP calls (RDAP + VirusTotal + redirect handling)dnspythonfor DNS record resolution (A,AAAA,NS,MX,TXT,CNAME)
- Domain & Similarity
tldextractfor registrable-domain parsingpython-Levenshteinfor distance/similarity
- Reporting
jinja2for HTML report templates- JSON/CSV via stdlib
- Configuration
python-dotenvfor.env(VirusTotal API key)
For input like google.com, the generator creates bounded candidate variants using:
omission: remove one character (gogle.com)repetition: repeat one character (gooogle.com)character_swap: swap adjacent characters (googel.com)keyboard_adjacency: QWERTY neighbor substitutionhyphenation: insert hyphen at safe positionstld_swap: swap to common TLDs (.co,.net,.io, etc.)
Safety and quality controls:
- strict domain label validation
- de-duplication with a set
- hard cap of
max_variants(clamped to1..25)
Each candidate stores:
- technique
- mutation description
- Levenshtein distance
- normalized similarity score
Levenshtein computation is done on normalized alphanumeric strings:
- lowercase
- non-alphanumeric stripped
Similarity score formula:
score = (1 - distance / max_len) * 100- clamped to
0..100
This normalized score is used in both domain findings and brand impersonation signals.
- substring matching against curated phishing keyword sets:
- auth/login
- urgency/threat
- finance/payment
- account recovery/actions
- bait/scam terms
- technical lure terms
- tokenizes hostname labels/hyphen tokens
- applies leet normalization (
0->o,1->l,3->e, etc.) - checks:
- direct leet brand equivalence
- near-match with Levenshtein (
distance <= 2orsimilarity >= 78)
Returns:
- impersonation signal text
- strength
- similarity
Safe baseline domains reduce false positives:
google.comgithub.commicrosoft.com
Matching works on registrable domain and subdomains.
Resolves record types:
A,AAAA,NS,MX,CNAME,TXT
Returns:
- record map
- hit list
- resolver errors/timeouts
For domain and URL modes:
- registrar extraction (
vcardArray) - registration/update/expiry event parsing
- nameservers and port43
Domain age:
- parse registration timestamp
- compute age in days
- classify windows:
0..14very new15..30new31..180young180+mature
Uses:
GET /api/v3/domains/{hostname}
Extracts:
malicious,suspicious,harmless,undetected,reputation
Result is cached in-process for performance.
URL analysis computes signals for:
- HTTP vs HTTPS
- raw IP host usage
- URL shortener host
- redirect chain depth
- mixed-script / IDN suspicion
- suspicious TLD (
.xyz,.top,.click,.shop,.buzz, etc.) - very long URL
@symbol obfuscation- subdomain depth
- uncommon ports (non-80/443)
- high query-parameter count
- encoded obfuscation tokens (
%40,%2f,%2e,%25,%3a,%3d) - suspicious keywords in host/path/query
- brand impersonation signal
- domain age (from RDAP)
- whitelist match
All scores are additive with explicit breakdown lines, then clamped to 0..100.
- HTTP:
+10 - IP host:
+25 - shortener:
+16 - long redirects:
+14 - mixed script:
+24 - suspicious TLD:
+10 - very long URL: up to
+10 @obfuscation:+10- many subdomains/params: up to
+8 - encoded obfuscation: up to
+10 - host/path keywords: capped additions
- brand impersonation: dynamic base (
+25+ boosts) - domain age:
<=30 days:+20<=90 days:+10
- whitelist trust:
-25
Uses ratio-based malicious scoring:
total = vt_malicious + vt_suspicious + vt_harmless
ratio = vt_malicious / total if total > 0 else 0Thresholds:
vt_malicious >= 5orratio > 0.2->+25andVirusTotal strongly maliciousvt_malicious >= 2->+10andVirusTotal mildly suspicious- else -> ignored (noise filtering)
- brand impersonation + login-style keyword => force at least HIGH band
- IP host + no HTTPS => force at least HIGH band
- new domain (<=90d) + suspicious keywords => force at least HIGH band
0..34->LOW35..59->MEDIUM60..79->HIGH80..100->CRITICAL
URL command prints:
- Risk Summary
- Reason
- Score Breakdown
- Indicators
- DNS
- Verdict
This makes the model behavior demo-friendly and auditable.
gitclone https://github.com/Pranavvvv-09/PhishGuard
##create Virtual envirnment python3 -m venv .venv source .venv/bin/activate
pip install -r requirements.txt
Optional `.env`:
```bash
VIRUSTOTAL_API_KEY="your_api_key_here"
python3 -m phishguard_py --helpDomain analysis:
python3 -m phishguard_py domain google.com --rdap --vt --max-variants 25 --top 5URL analysis:
python3 -m phishguard_py url "http://g00gle-login.example" --dns --rdap --vtCompare hosts:
python3 -m phishguard_py compare google.com g00gle.comReport export:
python3 -m phishguard_py url "https://example.com" --out reports/url_report --formats json,csv,html
python3 -m phishguard_py domain google.com --out reports/domain_report --formats json,csv,html