Skip to content

ManishSharma2026/web-scraper-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper + Data Analyzer

A multithreaded web crawler and content analysis engine built in Java 17. Point it at any seed URL and it fans out across the site, then produces a rich analysis report covering word frequency, TF-IDF scoring, link-graph structure, and crawl performance — exported as JSON and CSV.


Features

Capability Details
Concurrent crawling CompletableFuture + fixed thread pool, BFS traversal
Rate limiting Semaphore-based global request cap + configurable delay
robots.txt Parsed and cached per host; opt-out flag available
Word analysis Global frequency + per-page TF-IDF scoring via Streams API
Link graph In-degree map, hub pages, orphan detection, external link count
Performance stats Fetch time distribution, status code breakdown, slowest pages
Dual export Pretty-printed JSON report + RFC 4180 CSV for pandas/Excel
Unit tested JUnit 5 suite covering analyzers, config builder, and records

Architecture

src/
└── main/java/scraper/
    ├── Main.java                   # CLI entry point
    ├── config/
    │   └── ScraperConfig.java      # Immutable config (Builder pattern)
    ├── core/
    │   ├── WebCrawler.java         # BFS orchestrator (CompletableFuture)
    │   ├── PageFetcher.java        # Rate-limited HTTP + jsoup parser
    │   └── RobotsParser.java       # robots.txt cache (ConcurrentHashMap)
    ├── model/
    │   ├── PageData.java           # Java 17 record — immutable page snapshot
    │   └── AnalysisResult.java     # Aggregated analysis record
    ├── analyzer/
    │   ├── DataAnalyzer.java       # @FunctionalInterface — Strategy pattern
    │   ├── WordFrequencyAnalyzer.java  # TF-IDF via Streams
    │   ├── LinkGraphAnalyzer.java  # Link graph metrics
    │   ├── PerformanceAnalyzer.java    # Crawl health stats
    │   └── AnalysisPipeline.java   # Composite runner
    ├── export/
    │   ├── JsonExporter.java       # Gson pretty-print
    │   └── CsvExporter.java        # RFC 4180 CSV
    └── util/
        └── AppLogger.java          # Thread-safe timestamped logger

Design Patterns Used

  • BuilderScraperConfig keeps construction readable and enforces immutability
  • StrategyDataAnalyzer interface lets new analyzers be plugged in with zero changes to the pipeline
  • CompositeAnalysisPipeline runs and merges all strategies transparently
  • Template MethodAnalysisPipeline.run() defines the fixed sequence; analyzers fill in the steps

Quick Start

Prerequisites

  • Java 17+
  • Maven 3.8+

Build

mvn package -q

This produces a runnable fat-jar at target/web-scraper-analyzer-1.0.0.jar.

Run

# Crawl example.com, max 2 levels deep, 50 pages
java -jar target/web-scraper-analyzer-1.0.0.jar https://example.com \
     --depth 2 --pages 50 --delay 300

# Restrict to one domain, 8 threads
java -jar target/web-scraper-analyzer-1.0.0.jar https://news.ycombinator.com \
     --domain ycombinator.com --threads 8 --out results/hn

Options

--depth  <n>    Max crawl depth            (default: 3)
--pages  <n>    Max pages to scrape        (default: 100)
--delay  <ms>   Request delay per thread   (default: 500 ms)
--threads <n>   Worker thread count        (default: CPU cores)
--domain <d>    Restrict to this domain
--out   <dir>   Output directory           (default: output/)
--no-robots     Ignore robots.txt
--no-json       Skip JSON export
--no-csv        Skip CSV export

Output

output/
├── report.json   # Full analysis — word freq, TF-IDF, link graph, perf stats
└── pages.csv     # One row per page — URL, title, status, word count, etc.

Tests

mvn test

Extending the Analyzer

Implement DataAnalyzer and register it in AnalysisPipeline.defaultPipeline():

public class SentimentAnalyzer implements DataAnalyzer {
    @Override public String name() { return "Sentiment"; }

    @Override
    public Map<String, Object> analyze(List<PageData> pages) {
        // your logic here
        return Map.of("positivePages", ...);
    }
}

That's it — no other files change.

Tech Stack

  • Java 17 — records, text blocks, pattern matching, sealed types
  • jsoup 1.17 — HTML parsing and link extraction
  • Gson 2.10 — JSON serialisation
  • JUnit 5 — unit tests
  • Maven — build and dependency management

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages