A multithreaded web crawler and content analysis engine built in Java 17. Point it at any seed URL and it fans out across the site, then produces a rich analysis report covering word frequency, TF-IDF scoring, link-graph structure, and crawl performance — exported as JSON and CSV.
| Capability | Details |
|---|---|
| Concurrent crawling | CompletableFuture + fixed thread pool, BFS traversal |
| Rate limiting | Semaphore-based global request cap + configurable delay |
| robots.txt | Parsed and cached per host; opt-out flag available |
| Word analysis | Global frequency + per-page TF-IDF scoring via Streams API |
| Link graph | In-degree map, hub pages, orphan detection, external link count |
| Performance stats | Fetch time distribution, status code breakdown, slowest pages |
| Dual export | Pretty-printed JSON report + RFC 4180 CSV for pandas/Excel |
| Unit tested | JUnit 5 suite covering analyzers, config builder, and records |
src/
└── main/java/scraper/
├── Main.java # CLI entry point
├── config/
│ └── ScraperConfig.java # Immutable config (Builder pattern)
├── core/
│ ├── WebCrawler.java # BFS orchestrator (CompletableFuture)
│ ├── PageFetcher.java # Rate-limited HTTP + jsoup parser
│ └── RobotsParser.java # robots.txt cache (ConcurrentHashMap)
├── model/
│ ├── PageData.java # Java 17 record — immutable page snapshot
│ └── AnalysisResult.java # Aggregated analysis record
├── analyzer/
│ ├── DataAnalyzer.java # @FunctionalInterface — Strategy pattern
│ ├── WordFrequencyAnalyzer.java # TF-IDF via Streams
│ ├── LinkGraphAnalyzer.java # Link graph metrics
│ ├── PerformanceAnalyzer.java # Crawl health stats
│ └── AnalysisPipeline.java # Composite runner
├── export/
│ ├── JsonExporter.java # Gson pretty-print
│ └── CsvExporter.java # RFC 4180 CSV
└── util/
└── AppLogger.java # Thread-safe timestamped logger
- Builder —
ScraperConfigkeeps construction readable and enforces immutability - Strategy —
DataAnalyzerinterface lets new analyzers be plugged in with zero changes to the pipeline - Composite —
AnalysisPipelineruns and merges all strategies transparently - Template Method —
AnalysisPipeline.run()defines the fixed sequence; analyzers fill in the steps
- Java 17+
- Maven 3.8+
mvn package -qThis produces a runnable fat-jar at target/web-scraper-analyzer-1.0.0.jar.
# Crawl example.com, max 2 levels deep, 50 pages
java -jar target/web-scraper-analyzer-1.0.0.jar https://example.com \
--depth 2 --pages 50 --delay 300
# Restrict to one domain, 8 threads
java -jar target/web-scraper-analyzer-1.0.0.jar https://news.ycombinator.com \
--domain ycombinator.com --threads 8 --out results/hn--depth <n> Max crawl depth (default: 3)
--pages <n> Max pages to scrape (default: 100)
--delay <ms> Request delay per thread (default: 500 ms)
--threads <n> Worker thread count (default: CPU cores)
--domain <d> Restrict to this domain
--out <dir> Output directory (default: output/)
--no-robots Ignore robots.txt
--no-json Skip JSON export
--no-csv Skip CSV export
output/
├── report.json # Full analysis — word freq, TF-IDF, link graph, perf stats
└── pages.csv # One row per page — URL, title, status, word count, etc.
mvn testImplement DataAnalyzer and register it in AnalysisPipeline.defaultPipeline():
public class SentimentAnalyzer implements DataAnalyzer {
@Override public String name() { return "Sentiment"; }
@Override
public Map<String, Object> analyze(List<PageData> pages) {
// your logic here
return Map.of("positivePages", ...);
}
}That's it — no other files change.
- Java 17 — records, text blocks, pattern matching, sealed types
- jsoup 1.17 — HTML parsing and link extraction
- Gson 2.10 — JSON serialisation
- JUnit 5 — unit tests
- Maven — build and dependency management
MIT