filescan is a Python tool for analyzing code at scale through flat graph representations.
Scan two layers of your codebase and export them as graph data:
- 📁 Filesystem Graph: Directory structure → parent/child relationships
- 🧠 Python AST Graph: Python source code → modules, classes, functions, and cross-file semantic relationships
Instead of nested trees, filescan produces flat node and edge tables (CSV/JSON) suitable for:
- 🔍 SQL and Pandas analysis
- 🤖 LLM code understanding (token-efficient)
- 📊 Static analysis pipelines
- 🗂️ Graph algorithms and embeddings
Traditional nested JSON/tree structures are:
- 📈 Verbose and deeply nested
- 🚩 Hard to filter, join, or aggregate
- 💰 Expensive for LLM token usage
⚠️ Inefficient for large codebases
filescan instead uses explicit node and edge tables:
nodes.csv (id, type/kind, name, metadata...)
edges.csv (id, source, target, relation, ...)
Benefits:
- ✅ Materialize relationships explicitly
- ✅ Works naturally with SQL, Pandas, DuckDB
- ✅ Lower token overhead for LLMs
- ✅ Great for graph algorithms
- ✅ Deterministic and reproducible
- Recursive directory traversal in deterministic order
- Parent → child edge relationships
- File size and metadata
- Supports
.gitignore-style ignore rules (.fscanignore) - CSV + JSON export
- Deterministic, collision-safe node IDs
- Definitions: Modules, classes, functions, methods
- Docstrings: First-line capture for context
- Signatures: Function/method parameters (best-effort)
- Line numbers: Exact source locations
- Cross-file relationships:
contains— class contains methodimports— module imports symbolcalls— function calls another functioninherits— class inherits from base classreferences— any other reference
- Multi-root scanning support
- Unified programmatic API
- CLI with multiple commands (
scan,watch,search,context,uml) - File watcher with auto-rescan
- Hybrid text + semantic search
- Built-in ignore rules (Python-aware defaults)
pip install filescangit clone https://github.com/DreamSoul-AI/filescan.git
cd filescan
pip install -e .Requirements: Python 3.10+
# Scan filesystem structure
filescan scan ./src
# Include Python AST analysis
filescan scan ./src --ast
# Custom output location
filescan scan ./src --ast -o results/myprojectOutput:
results/
├── myproject_nodes.csv
├── myproject_edges.csv
├── myproject.json
├── myproject_ast_nodes.csv
├── myproject_ast_edges.csv
└── myproject_ast.json
filescan watch ./src --debounce 0.5Re-scans automatically when .py files change. Useful during development.
filescan search ./src "MyClass" \
--nodes graph_ast_nodes.csv \
--edges graph_ast_edges.csvFinds all references to MyClass in the codebase with semantic context.
filescan uml ./src -o results/uml.mdfrom filescan import GraphBuilder
builder = GraphBuilder()
builder.build(roots=["./data"], include_filesystem=True, include_ast=False)
builder.export_filesystem("output/fs")
print(f"Found {len(builder.filesystem.nodes)} filesystem nodes")from filescan import GraphBuilder
# Build and export
builder = GraphBuilder()
builder.build(roots=["./src"], include_filesystem=False, include_ast=True, ignore_file=".fscanignore")
builder.export_ast("output/ast")
print(f"Indexed symbols: {len(builder.ast.by_qname)}")from filescan import GraphBuilder
from pathlib import Path
# Single-pass builder for both graphs
builder = GraphBuilder()
builder.build(
roots=[Path("./src"), Path("./tests")], # Multiple roots
include_filesystem=True,
include_ast=True,
ignore_file=".fscanignore"
)
# Export with custom prefixes
builder.export_filesystem("output/fs")
builder.export_ast("output/ast")
# Access graph data programmatically
fs_nodes = builder.filesystem.nodes
fs_edges = list(builder.filesystem.edges)
ast_nodes = builder.ast.nodes
ast_edges = list(builder.ast.edges)filescan supports gitignore-style patterns via pathspec.
--ignore-fileCLI argument (if provided)./.fscanignorein working directory (if exists)- Built-in defaults (ignore
.git,__pycache__,.pyc, etc.)
# Version control
.git/
.hg/
# IDEs
.vscode/
.idea/
*.swp
# Python
__pycache__/
*.pyc
*.pyo
*.egg-info/
.venv/
venv/
# Build & dist
build/
dist/
*.egg
# Project-specific
node_modules/
.DS_StorePatterns apply to both filesystem and AST scanning (ignored files are skipped).
Nodes (*_nodes.csv):
| Field | Type | Description |
|---|---|---|
id |
String | Unique node identifier (hash-based) |
type |
Char | 'd' (directory) or 'f' (file) |
name |
String | Base name of file/directory |
abs_path |
String | Absolute file system path |
size |
Integer | File size in bytes (null for dirs) |
Edges (*_edges.csv):
| Field | Type | Description |
|---|---|---|
id |
String | Unique edge identifier |
source |
String | Parent node ID |
target |
String | Child node ID |
relation |
String | Always 'contains' |
Nodes (*_nodes.csv):
| Field | Type | Description |
|---|---|---|
id |
String | Unique symbol identifier |
kind |
String | module, class, function, method |
name |
String | Symbol name (unqualified) |
qualified_name |
String | Full dotted path (e.g., module.Class.method) |
module_path |
String | File path relative to scan root |
lineno |
Integer | Starting line number (1-based) |
end_lineno |
Integer | Ending line number |
signature |
String | Function/method signature (best-effort) |
doc |
String | First line of docstring (if present) |
Edges (*_edges.csv):
| Field | Type | Description |
|---|---|---|
id |
String | Unique edge identifier |
source |
String | Source node ID |
target |
String | Target node ID |
relation |
String | Relationship type (see below) |
lineno |
Integer | Line where relationship occurs |
end_lineno |
Integer | End line of relationship |
Relation Types:
contains— Parent symbol contains child (class → method, module → class)imports— Module imports a symbolcalls— Function/method calls anotherinherits— Class inherits from base classreferences— Other semantic reference
Feed flat CSVs to Claude/GPT for code analysis without token bloat:
nodes_csv, edges_csv = scan_project(root)
context = build_context_for_llm(nodes_csv, edges_csv)
response = llm.analyze(context)Use SQL/DuckDB for queries:
-- Find all functions with no docstring
SELECT name, qualified_name, module_path
FROM ast_nodes
WHERE kind = 'function' AND doc IS NULL;
-- Find unused classes (no incoming calls/references)
SELECT n.name
FROM ast_nodes n
LEFT JOIN ast_edges e ON n.id = e.target
WHERE n.kind = 'class' AND e.id IS NULL;Build dependency graphs, identify circular imports, plan migrations.
Compute embeddings per node, build semantic search indexes.
Load into Pandas for custom analysis:
import pandas as pd
nodes = pd.read_csv("ast_nodes.csv")
edges = pd.read_csv("ast_edges.csv")
# Classes per module
classes_per_module = nodes[nodes['kind'] == 'class'].groupby('module_path').size()
print(classes_per_module)Run filesystem and/or AST scan.
filescan scan <ROOT> [OPTIONS]Options:
--ignore-file FILE— Use custom ignore file--ast— Include Python AST scan (default: filesystem only)--ast-only— Skip filesystem, only scan AST-o, --output PREFIX— Output file prefix (default:graph)--output-ast PREFIX— Separate prefix for AST output
Examples:
# Filesystem only
filescan scan ./src -o results/fs
# Both filesystem and AST
filescan scan ./src --ast -o results/project
# Custom ignore file
filescan scan ./src --ast --ignore-file .scanignore -o results/customWatch a project directory and auto-scan on Python file changes.
filescan watch <ROOT> [OPTIONS]Options:
--ignore-file FILE— Use custom ignore file-o, --output PREFIX— Output file prefix (default:graph)--output-ast PREFIX— Separate prefix for AST output--debounce SECONDS— Debounce interval (default:0.5)
Example:
# Watch for changes, rescan every 0.5 seconds
filescan watch ./src -o results/live --debounce 0.5Press Ctrl+C to stop watching.
Search an existing AST graph with semantic context.
filescan search <ROOT> <QUERY> --nodes <FILE> --edges <FILE>Required:
<ROOT>— Project root (must match AST scan root)<QUERY>— Search query (symbol name)--nodes FILE— Path to AST nodes CSV--edges FILE— Path to AST edges CSV
Example:
filescan search ./src "MyClass" \
--nodes graph_ast_nodes.csv \
--edges graph_ast_edges.csvOutput:
Shows all occurrences with semantic context:
- Match type (definition, call, reference, import, inherit)
- File path and line number
- Source code context
- Definition source (if available)
Build AST graph and export a Mermaid class diagram markdown file.
filescan uml <ROOT> [OPTIONS]Options:
--ignore-file FILE鈥?Use custom ignore file-o, --output FILE鈥?Output markdown path (default:graph_uml.md)--show-private鈥?Include private methods in class diagrams--module-path-filter TEXT鈥?Include only nodes whosemodule_pathcontainsTEXT--title TEXT鈥?Markdown title (default:AST UML)
Example:
filescan uml ./src -o results/uml.md --module-path-filter "core/"filescan/
├── src/filescan/
│ ├── __init__.py # Public API
│ ├── base.py # ScannerBase (ID generation, ignore handling)
│ ├── scanner.py # Filesystem scanner
│ ├── ast_scanner.py # Python AST scanner (astroid)
│ ├── graph_builder.py # Unified builder
│ ├── search_engine.py # Hybrid search (ripgrep + AST)
│ ├── file_watcher.py # File change watcher
│ ├── utils.py # Utilities
│ ├── commands/
│ │ └── cli.py # CLI entry point
│ └── default.fscanignore # Built-in ignore rules
├── tests/ # Unit tests
├── examples/ # Usage examples
└── pyproject.toml
pytest tests/python examples/scan_self.py
python examples/search_self.py
python -m examples.watch_self# Build wheel
python -m build
# Upload to PyPI
python -m twine upload dist/*| Feature | filescan |
AST-only tools | Tree JSON | ripgrep |
|---|---|---|---|---|
| Filesystem structure | ✅ | ❌ | ❌ | ❌ |
| Python AST | ✅ | ✅ | ❌ | ❌ |
| Flat graph design | ✅ | ❌ | ❌ | ❌ |
| CSV/SQL-friendly | ✅ | ❌ | ❌ | ✅ |
| Cross-file semantics | ✅ | ~Some | ❌ | ❌ |
| CLI + Library | ✅ | ~Some | ❌ | ❌ |
| LLM-optimized | ✅ | ❌ | ❌ | ❌ |
All node and edge IDs are generated deterministically from content hashes. This ensures:
- ✅ Reproducible scans (same input → same IDs)
- ✅ Collision detection and handling
- ✅ Stable linking across multiple scans
- Pass 1 — Definitions: Collect all module/class/function definitions across all Python files
- Pass 2 — Relationships: Resolve cross-file imports, calls, inherits, and references
This ensures all semantic relationships are resolvable in a single pass.
The search engine combines:
- ripgrep for fast text search
- AST index for semantic enrichment
Results are ranked by semantic priority (definition > call > reference > import).
Typical scan times (on modern hardware):
| Target | Filesystem | AST | Combined |
|---|---|---|---|
| 1K files | ~50ms | ~200ms | ~250ms |
| 10K files | ~200ms | ~2s | ~2.2s |
| 100K files | ~2s | ~20s | ~22s |
filescan is designed for sub-second iteration during development via watch mode.
- AST signatures are best-effort (lambda functions, comprehensions may be incomplete)
- Dynamically resolved imports (e.g.,
__import__, exec) are not captured - Relationship extraction depends on static analysis (no runtime tracing)
- Search is file-relative; cross-filesystem projects need unified root
Contributions welcome! Open issues or PRs on GitHub.
MIT License — see LICENSE file.
- 📖 Documentation: See this README and examples/
- 🐛 Report Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
Made with ❤️ by DreamSoul