filescan

filescan is a Python tool for analyzing code at scale through flat graph representations.

Scan two layers of your codebase and export them as graph data:

📁 Filesystem Graph: Directory structure → parent/child relationships
🧠 Python AST Graph: Python source code → modules, classes, functions, and cross-file semantic relationships

Instead of nested trees, filescan produces flat node and edge tables (CSV/JSON) suitable for:

🔍 SQL and Pandas analysis
🤖 LLM code understanding (token-efficient)
📊 Static analysis pipelines
🗂️ Graph algorithms and embeddings

Why Flat Graphs?

Traditional nested JSON/tree structures are:

📈 Verbose and deeply nested
🚩 Hard to filter, join, or aggregate
💰 Expensive for LLM token usage
⚠️ Inefficient for large codebases

filescan instead uses explicit node and edge tables:

nodes.csv     (id, type/kind, name, metadata...)
edges.csv     (id, source, target, relation, ...)

Benefits:

✅ Materialize relationships explicitly
✅ Works naturally with SQL, Pandas, DuckDB
✅ Lower token overhead for LLMs
✅ Great for graph algorithms
✅ Deterministic and reproducible

Features

📁 Filesystem Scanner

Recursive directory traversal in deterministic order
Parent → child edge relationships
File size and metadata
Supports .gitignore-style ignore rules (.fscanignore)
CSV + JSON export
Deterministic, collision-safe node IDs

🧠 Python AST Scanner

Definitions: Modules, classes, functions, methods
Docstrings: First-line capture for context
Signatures: Function/method parameters (best-effort)
Line numbers: Exact source locations
Cross-file relationships:
- contains — class contains method
- imports — module imports symbol
- calls — function calls another function
- inherits — class inherits from base class
- references — any other reference

⚙️ Core Capabilities

Multi-root scanning support
Unified programmatic API
CLI with multiple commands (scan, watch, search, context, uml)
File watcher with auto-rescan
Hybrid text + semantic search
Built-in ignore rules (Python-aware defaults)

Installation

From PyPI

pip install filescan

Development Mode

git clone https://github.com/DreamSoul-AI/filescan.git
cd filescan
pip install -e .

Requirements: Python 3.10+

Quick Start

CLI: Scan a Directory

# Scan filesystem structure
filescan scan ./src

# Include Python AST analysis
filescan scan ./src --ast

# Custom output location
filescan scan ./src --ast -o results/myproject

Output:

results/
├── myproject_nodes.csv
├── myproject_edges.csv
├── myproject.json
├── myproject_ast_nodes.csv
├── myproject_ast_edges.csv
└── myproject_ast.json

CLI: Watch Mode (Auto-Rescan)

filescan watch ./src --debounce 0.5

Re-scans automatically when .py files change. Useful during development.

CLI: Semantic Search

filescan search ./src "MyClass" \
  --nodes graph_ast_nodes.csv \
  --edges graph_ast_edges.csv

Finds all references to MyClass in the codebase with semantic context.

CLI: Export Mermaid UML

filescan uml ./src -o results/uml.md

Library Usage

Filesystem Scanner

from filescan import GraphBuilder

builder = GraphBuilder()
builder.build(roots=["./data"], include_filesystem=True, include_ast=False)
builder.export_filesystem("output/fs")

print(f"Found {len(builder.filesystem.nodes)} filesystem nodes")

Python AST Scanner

from filescan import GraphBuilder

# Build and export
builder = GraphBuilder()
builder.build(roots=["./src"], include_filesystem=False, include_ast=True, ignore_file=".fscanignore")
builder.export_ast("output/ast")

print(f"Indexed symbols: {len(builder.ast.by_qname)}")

Advanced: GraphBuilder

from filescan import GraphBuilder
from pathlib import Path

# Single-pass builder for both graphs
builder = GraphBuilder()

builder.build(
    roots=[Path("./src"), Path("./tests")],  # Multiple roots
    include_filesystem=True,
    include_ast=True,
    ignore_file=".fscanignore"
)

# Export with custom prefixes
builder.export_filesystem("output/fs")
builder.export_ast("output/ast")

# Access graph data programmatically
fs_nodes = builder.filesystem.nodes
fs_edges = list(builder.filesystem.edges)
ast_nodes = builder.ast.nodes
ast_edges = list(builder.ast.edges)

Ignore Rules (`.fscanignore`)

filescan supports gitignore-style patterns via pathspec.

Resolution order:

--ignore-file CLI argument (if provided)
./.fscanignore in working directory (if exists)
Built-in defaults (ignore .git, __pycache__, .pyc, etc.)

Example `.fscanignore`:

# Version control
.git/
.hg/

# IDEs
.vscode/
.idea/
*.swp

# Python
__pycache__/
*.pyc
*.pyo
*.egg-info/
.venv/
venv/

# Build & dist
build/
dist/
*.egg

# Project-specific
node_modules/
.DS_Store

Patterns apply to both filesystem and AST scanning (ignored files are skipped).

Output Schemas

Filesystem Graph

Nodes (*_nodes.csv):

Field	Type	Description
`id`	String	Unique node identifier (hash-based)
`type`	Char	`'d'` (directory) or `'f'` (file)
`name`	String	Base name of file/directory
`abs_path`	String	Absolute file system path
`size`	Integer	File size in bytes (`null` for dirs)

Edges (*_edges.csv):

Field	Type	Description
`id`	String	Unique edge identifier
`source`	String	Parent node ID
`target`	String	Child node ID
`relation`	String	Always `'contains'`

Python AST Graph

Nodes (*_nodes.csv):

Field	Type	Description
`id`	String	Unique symbol identifier
`kind`	String	`module`, `class`, `function`, `method`
`name`	String	Symbol name (unqualified)
`qualified_name`	String	Full dotted path (e.g., `module.Class.method`)
`module_path`	String	File path relative to scan root
`lineno`	Integer	Starting line number (1-based)
`end_lineno`	Integer	Ending line number
`signature`	String	Function/method signature (best-effort)
`doc`	String	First line of docstring (if present)

Edges (*_edges.csv):

Field	Type	Description
`id`	String	Unique edge identifier
`source`	String	Source node ID
`target`	String	Target node ID
`relation`	String	Relationship type (see below)
`lineno`	Integer	Line where relationship occurs
`end_lineno`	Integer	End line of relationship

Relation Types:

contains — Parent symbol contains child (class → method, module → class)
imports — Module imports a symbol
calls — Function/method calls another
inherits — Class inherits from base class
references — Other semantic reference

Use Cases

1. LLM-Based Code Understanding

Feed flat CSVs to Claude/GPT for code analysis without token bloat:

nodes_csv, edges_csv = scan_project(root)
context = build_context_for_llm(nodes_csv, edges_csv)
response = llm.analyze(context)

2. Static Analysis & Code Quality

Use SQL/DuckDB for queries:

-- Find all functions with no docstring
SELECT name, qualified_name, module_path
FROM ast_nodes
WHERE kind = 'function' AND doc IS NULL;

-- Find unused classes (no incoming calls/references)
SELECT n.name
FROM ast_nodes n
LEFT JOIN ast_edges e ON n.id = e.target
WHERE n.kind = 'class' AND e.id IS NULL;

3. Refactoring & Architecture

Build dependency graphs, identify circular imports, plan migrations.

4. Embeddings & Search

Compute embeddings per node, build semantic search indexes.

5. Data Pipelines

Load into Pandas for custom analysis:

import pandas as pd

nodes = pd.read_csv("ast_nodes.csv")
edges = pd.read_csv("ast_edges.csv")

# Classes per module
classes_per_module = nodes[nodes['kind'] == 'class'].groupby('module_path').size()
print(classes_per_module)

Commands Reference

`filescan scan`

Run filesystem and/or AST scan.

filescan scan <ROOT> [OPTIONS]

Options:

--ignore-file FILE — Use custom ignore file
--ast — Include Python AST scan (default: filesystem only)
--ast-only — Skip filesystem, only scan AST
-o, --output PREFIX — Output file prefix (default: graph)
--output-ast PREFIX — Separate prefix for AST output

Examples:

# Filesystem only
filescan scan ./src -o results/fs

# Both filesystem and AST
filescan scan ./src --ast -o results/project

# Custom ignore file
filescan scan ./src --ast --ignore-file .scanignore -o results/custom

`filescan watch`

Watch a project directory and auto-scan on Python file changes.

filescan watch <ROOT> [OPTIONS]

Options:

--ignore-file FILE — Use custom ignore file
-o, --output PREFIX — Output file prefix (default: graph)
--output-ast PREFIX — Separate prefix for AST output
--debounce SECONDS — Debounce interval (default: 0.5)

Example:

# Watch for changes, rescan every 0.5 seconds
filescan watch ./src -o results/live --debounce 0.5

Press Ctrl+C to stop watching.

`filescan search`

Search an existing AST graph with semantic context.

filescan search <ROOT> <QUERY> --nodes <FILE> --edges <FILE>

Required:

<ROOT> — Project root (must match AST scan root)
<QUERY> — Search query (symbol name)
--nodes FILE — Path to AST nodes CSV
--edges FILE — Path to AST edges CSV

Example:

filescan search ./src "MyClass" \
  --nodes graph_ast_nodes.csv \
  --edges graph_ast_edges.csv

Output:

Shows all occurrences with semantic context:

Match type (definition, call, reference, import, inherit)
File path and line number
Source code context
Definition source (if available)

`filescan uml`

Build AST graph and export a Mermaid class diagram markdown file.

filescan uml <ROOT> [OPTIONS]

Options:

--ignore-file FILE 鈥?Use custom ignore file
-o, --output FILE 鈥?Output markdown path (default: graph_uml.md)
--show-private 鈥?Include private methods in class diagrams
--module-path-filter TEXT 鈥?Include only nodes whose module_path contains TEXT
--title TEXT 鈥?Markdown title (default: AST UML)

Example:

filescan uml ./src -o results/uml.md --module-path-filter "core/"

Development

Project Structure

filescan/
├── src/filescan/
│   ├── __init__.py              # Public API
│   ├── base.py                  # ScannerBase (ID generation, ignore handling)
│   ├── scanner.py               # Filesystem scanner
│   ├── ast_scanner.py           # Python AST scanner (astroid)
│   ├── graph_builder.py         # Unified builder
│   ├── search_engine.py         # Hybrid search (ripgrep + AST)
│   ├── file_watcher.py          # File change watcher
│   ├── utils.py                 # Utilities
│   ├── commands/
│   │   └── cli.py               # CLI entry point
│   └── default.fscanignore      # Built-in ignore rules
├── tests/                        # Unit tests
├── examples/                     # Usage examples
└── pyproject.toml

Running Tests

pytest tests/

Running Examples

python examples/scan_self.py
python examples/search_self.py
python -m examples.watch_self

Building & Publishing

# Build wheel
python -m build

# Upload to PyPI
python -m twine upload dist/*

Comparison to Alternatives

Feature	`filescan`	AST-only tools	Tree JSON	`ripgrep`
Filesystem structure	✅	❌	❌	❌
Python AST	✅	✅	❌	❌
Flat graph design	✅	❌	❌	❌
CSV/SQL-friendly	✅	❌	❌	✅
Cross-file semantics	✅	~Some	❌	❌
CLI + Library	✅	~Some	❌	❌
LLM-optimized	✅	❌	❌	❌

Architecture Notes

Deterministic IDs

All node and edge IDs are generated deterministically from content hashes. This ensures:

✅ Reproducible scans (same input → same IDs)
✅ Collision detection and handling
✅ Stable linking across multiple scans

Two-Pass AST Scanning

Pass 1 — Definitions: Collect all module/class/function definitions across all Python files
Pass 2 — Relationships: Resolve cross-file imports, calls, inherits, and references

This ensures all semantic relationships are resolvable in a single pass.

Hybrid Search

The search engine combines:

ripgrep for fast text search
AST index for semantic enrichment

Results are ranked by semantic priority (definition > call > reference > import).

Performance

Typical scan times (on modern hardware):

Target	Filesystem	AST	Combined
1K files	~50ms	~200ms	~250ms
10K files	~200ms	~2s	~2.2s
100K files	~2s	~20s	~22s

filescan is designed for sub-second iteration during development via watch mode.

Limitations & Known Issues

AST signatures are best-effort (lambda functions, comprehensions may be incomplete)
Dynamically resolved imports (e.g., __import__, exec) are not captured
Relationship extraction depends on static analysis (no runtime tracing)
Search is file-relative; cross-filesystem projects need unified root

Contributing

Contributions welcome! Open issues or PRs on GitHub.

License

MIT License — see LICENSE file.

Support

📖 Documentation: See this README and examples/
🐛 Report Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Made with ❤️ by DreamSoul

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
examples		examples
src/filescan		src/filescan
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
main.sh		main.sh
package.ps1		package.ps1
package.sh		package.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

filescan

Why Flat Graphs?

Features

📁 Filesystem Scanner

🧠 Python AST Scanner

⚙️ Core Capabilities

Installation

From PyPI

Development Mode

Quick Start

CLI: Scan a Directory

CLI: Watch Mode (Auto-Rescan)

CLI: Semantic Search

CLI: Export Mermaid UML

Library Usage

Filesystem Scanner

Python AST Scanner

Advanced: GraphBuilder

Ignore Rules (.fscanignore)

Resolution order:

Example .fscanignore:

Output Schemas

Filesystem Graph

Python AST Graph

Use Cases

1. LLM-Based Code Understanding

2. Static Analysis & Code Quality

3. Refactoring & Architecture

4. Embeddings & Search

5. Data Pipelines

Commands Reference

filescan scan

filescan watch

filescan search

filescan uml

Development

Project Structure

Running Tests

Running Examples

Building & Publishing

Comparison to Alternatives

Architecture Notes

Deterministic IDs

Two-Pass AST Scanning

Hybrid Search

Performance

Limitations & Known Issues

Contributing

License

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Ignore Rules (`.fscanignore`)

Example `.fscanignore`:

`filescan scan`

`filescan watch`

`filescan search`

`filescan uml`

Packages