Web Scraper + Data Analyzer

A multithreaded web crawler and content analysis engine built in Java 17. Point it at any seed URL and it fans out across the site, then produces a rich analysis report covering word frequency, TF-IDF scoring, link-graph structure, and crawl performance — exported as JSON and CSV.

Features

Capability	Details
Concurrent crawling	`CompletableFuture` + fixed thread pool, BFS traversal
Rate limiting	`Semaphore`-based global request cap + configurable delay
robots.txt	Parsed and cached per host; opt-out flag available
Word analysis	Global frequency + per-page TF-IDF scoring via Streams API
Link graph	In-degree map, hub pages, orphan detection, external link count
Performance stats	Fetch time distribution, status code breakdown, slowest pages
Dual export	Pretty-printed JSON report + RFC 4180 CSV for pandas/Excel
Unit tested	JUnit 5 suite covering analyzers, config builder, and records

Architecture

src/
└── main/java/scraper/
    ├── Main.java                   # CLI entry point
    ├── config/
    │   └── ScraperConfig.java      # Immutable config (Builder pattern)
    ├── core/
    │   ├── WebCrawler.java         # BFS orchestrator (CompletableFuture)
    │   ├── PageFetcher.java        # Rate-limited HTTP + jsoup parser
    │   └── RobotsParser.java       # robots.txt cache (ConcurrentHashMap)
    ├── model/
    │   ├── PageData.java           # Java 17 record — immutable page snapshot
    │   └── AnalysisResult.java     # Aggregated analysis record
    ├── analyzer/
    │   ├── DataAnalyzer.java       # @FunctionalInterface — Strategy pattern
    │   ├── WordFrequencyAnalyzer.java  # TF-IDF via Streams
    │   ├── LinkGraphAnalyzer.java  # Link graph metrics
    │   ├── PerformanceAnalyzer.java    # Crawl health stats
    │   └── AnalysisPipeline.java   # Composite runner
    ├── export/
    │   ├── JsonExporter.java       # Gson pretty-print
    │   └── CsvExporter.java        # RFC 4180 CSV
    └── util/
        └── AppLogger.java          # Thread-safe timestamped logger

Design Patterns Used

Builder — ScraperConfig keeps construction readable and enforces immutability
Strategy — DataAnalyzer interface lets new analyzers be plugged in with zero changes to the pipeline
Composite — AnalysisPipeline runs and merges all strategies transparently
Template Method — AnalysisPipeline.run() defines the fixed sequence; analyzers fill in the steps

Quick Start

Prerequisites

Java 17+
Maven 3.8+

Build

mvn package -q

This produces a runnable fat-jar at target/web-scraper-analyzer-1.0.0.jar.

Run

# Crawl example.com, max 2 levels deep, 50 pages
java -jar target/web-scraper-analyzer-1.0.0.jar https://example.com \
     --depth 2 --pages 50 --delay 300

# Restrict to one domain, 8 threads
java -jar target/web-scraper-analyzer-1.0.0.jar https://news.ycombinator.com \
     --domain ycombinator.com --threads 8 --out results/hn

Options

--depth  <n>    Max crawl depth            (default: 3)
--pages  <n>    Max pages to scrape        (default: 100)
--delay  <ms>   Request delay per thread   (default: 500 ms)
--threads <n>   Worker thread count        (default: CPU cores)
--domain <d>    Restrict to this domain
--out   <dir>   Output directory           (default: output/)
--no-robots     Ignore robots.txt
--no-json       Skip JSON export
--no-csv        Skip CSV export

Output

output/
├── report.json   # Full analysis — word freq, TF-IDF, link graph, perf stats
└── pages.csv     # One row per page — URL, title, status, word count, etc.

Tests

mvn test

Extending the Analyzer

Implement DataAnalyzer and register it in AnalysisPipeline.defaultPipeline():

public class SentimentAnalyzer implements DataAnalyzer {
    @Override public String name() { return "Sentiment"; }

    @Override
    public Map<String, Object> analyze(List<PageData> pages) {
        // your logic here
        return Map.of("positivePages", ...);
    }
}

That's it — no other files change.

Tech Stack

Java 17 — records, text blocks, pattern matching, sealed types
jsoup 1.17 — HTML parsing and link extraction
Gson 2.10 — JSON serialisation
JUnit 5 — unit tests
Maven — build and dependency management

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
lib		lib
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper + Data Analyzer

Features

Architecture

Design Patterns Used

Quick Start

Prerequisites

Build

Run

Options

Output

Tests

Extending the Analyzer

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraper + Data Analyzer

Features

Architecture

Design Patterns Used

Quick Start

Prerequisites

Build

Run

Options

Output

Tests

Extending the Analyzer

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages