A comprehensive Node.js application for scraping DistroWatch data to track Linux distributions, their metadata, and visual assets.
- 🚀 Modern modular architecture with single-responsibility modules
- 📊 Complete DistroWatch data extraction (metadata, descriptions, ratings, homepage)
- 🖼️ Comprehensive image downloading: Logos, thumbnails, and high-resolution screenshots
- 📁 Organized output structure with separate folders for different content types
- ⚙️ Smart processing modes: Update existing data or full replacement
- ⚡ Intelligent skipping: Avoid re-scraping existing distributions (with force option)
- 🔧 Environment configuration with
.envsupport - 📈 Popularity and rating tracking from DistroWatch rankings
- 🚫 Rate limiting and graceful error handling
- 🛡️ Robust parsing that adapts to DistroWatch's HTML structure
- 💾 JSON output with both URLs and local file paths
- 📝 Command line interface with helpful options
- Node.js >= 18.0.0
- npm (comes with Node.js)
- Clone or download this project
- Navigate to the project directory:
cd distrowatch-scraper - Install dependencies:
npm install
- Copy the environment template:
cp .env.example .env
- Edit
.envto configure your scraping settings:INDEX_URL: DistroWatch popularity page URLDISTRO_URL: Base URL for individual distribution pagesOUTPUT_DIR: Directory to save scraped data and imagesOUTPUT_FILE: JSON filename for scraped dataREQUEST_DELAY: Delay between requests (milliseconds)MAX_RETRIES: Maximum retry attempts for failed requestsTIMEOUT: Request timeout in millisecondsUSER_AGENT: HTTP User-Agent string for requests
The scraper creates a comprehensive directory structure:
data/
├── distros.json # Complete distribution data
└── images/
├── logos/ # Distribution logos
│ ├── ubuntu.png
│ ├── fedora.png
│ └── ...
├── thumbnails/ # Small screenshot thumbnails
│ ├── ubuntu.png
│ ├── fedora.png
│ └── ...
└── screenshots/ # High-resolution screenshots
├── ubuntu.png
├── fedora.png
└── ...
Each distribution entry includes:
{
"slug": "ubuntu",
"name": "Ubuntu",
"lastUpdate": "2026-01-10 21:20",
"description": "Ubuntu is a complete desktop Linux...",
"homepage": "https://ubuntu.com/",
"osType": "Linux",
"basedOn": "Debian",
"origin": "Isle of Man",
"architecture": "armhf, ppc64el, riscv, s390x, x86_64",
"desktop": "GNOME, Unity",
"category": "Beginners, Desktop, Server, Live Medium",
"status": "Active",
"defaultDesktop": "GNOME",
"installation": "Calamares",
"defaultBrowser": "Firefox",
"popularity": 10,
"rating": 7.7,
"reviewCount": 370,
"logo": "https://distrowatch.com/images/...",
"thumbnail": "https://distrowatch.com/images/...",
"screenshot": "https://distrowatch.com/images/...",
"localPaths": {
"logo": "./data/images/logos/ubuntu.png",
"thumbnail": "./data/images/thumbnails/ubuntu.png",
"screenshot": "./data/images/screenshots/ubuntu_large.png"
}
}The scraper extracts data from multiple sections of each DistroWatch distribution page:
- Basic metadata: From the structured info tables (OS Type, Based on, Origin, Architecture, etc.)
- Feature data: From the version comparison tables (Default Desktop, Installation, Default Browser)
- Popularity & ratings: From the text content and rating displays
- Images: Logo, thumbnail, and screenshot images are identified and downloaded
Note: defaultDesktop, installation, and defaultBrowser fields are extracted from the latest version in the feature comparison table. These fields may be null if not available for a distribution.
The scraper supports several command line options for flexible operation:
node index.js [options]--updateor-u: Update mode - Merge new data with existingdistros.jsoninstead of replacing it--forceor-f: Force refresh - Scrape all distributions, ignoring existing data--helpor-h: Show help - Display usage information and exit
# Default: Replace all data, skip existing distributions
node index.js
# Update existing data, only scrape new distributions
node index.js --update
# Force refresh all distributions (ignore existing data)
node index.js --force
# Update mode + force refresh all distributions
node index.js --update --force
# Show help information
node index.js --help- Default Mode: Replaces all data but skips distributions already in the JSON file
- Update Mode (
--update): Merges new data with existing file, preserving untouched distributions - Force Mode (
--force): Scrapes all distributions regardless of existing data - Combined (
--update --force): Updates existing file and refreshes all distribution data
npm startnpm run dev- Fetch Distribution List: Scrapes DistroWatch popularity rankings to get all active distributions
- Individual Scraping: For each distribution, extracts:
- Basic metadata (name, description, last update)
- Technical details (architecture, desktop, base distribution)
- Popularity metrics and user ratings
- Visual assets (logo, screenshots)
- Image Download: Downloads and saves all images locally with organized naming
- Data Export: Saves complete dataset as JSON with both original URLs and local paths
distrowatch-scraper/
├── .github/
│ └── copilot-instructions.md # GitHub Copilot configuration
├── data/ # Output directory (created automatically)
│ ├── distros.json # Main output file
│ └── images/ # Downloaded images organized by type
│ ├── logos/
│ ├── thumbnails/
│ └── screenshots/
├── src/ # Source code directory
│ ├── config.js # Configuration management
│ ├── cli.js # Command line interface
│ ├── scraper.js # Web scraping logic
│ ├── imageDownloader.js # Image downloading functions
│ └── fileOperations.js # File read/write operations
├── index.js # Main application entry point
├── package.json # Node.js project configuration
├── .env.example # Environment variables template
├── .env # Your local configuration (create from template)
├── .gitignore # Git ignore rules
└── README.md # This documentation
config.js: Configuration management and environment variablescli.js: Command line interface with argument parsing and helpscraper.js: Web scraping functionalityfetchAllDistroSlugs(): Get all active distributions from popularity pagescrapeDistro(slug): Extract complete data for specific distribution
imageDownloader.js: Image download managementdownloadDistributionImages(): Handle all image types for a distributiondownloadImage(): Download individual images with error handling
fileOperations.js: Data persistence and file managementloadExistingData(): Load existing distribution datasaveDataToFile(): Save data with proper formattingmergeDistributionData(): Smart merging of existing and new data
- Parse CLI arguments and initialize configuration
- Load existing data (if not force refresh) to determine what to skip
- Fetch distribution list from DistroWatch popularity rankings
- Process each distribution:
- Skip if exists (unless force refresh)
- Scrape distribution data
- Download all associated images
- Add local file paths to data
- Save results using update or replace mode
- Provide summary of operations performed
This project follows modern Node.js best practices with a clean, modular architecture:
- ✅ Modular design with single-responsibility modules
- ✅ Separation of concerns (config, CLI, scraping, images, file ops)
- ✅ Modern JavaScript ES6+ features throughout
- ✅ Clean architecture with dependency injection
- ✅ Comprehensive error handling and graceful degradation
- ✅ Environment-based configuration with
.envsupport - ✅ Graceful process management and signal handling
- Follow modular structure: Add new functionality to appropriate modules in
src/ - Single responsibility: Each function should have one clear purpose
- Update configuration: Add new settings to
src/config.jsif needed - Test thoroughly: Use
npm run devfor development testing - Update documentation: Reflect changes in this README
src/config.js: Add new environment variables and computed pathssrc/cli.js: Add new command line options and help textsrc/scraper.js: Add new parsing functions or data extractionsrc/imageDownloader.js: Add new image types or download strategiessrc/fileOperations.js: Add new file formats or data operations
- axios: HTTP client for making web requests
- cheerio: Server-side jQuery-like HTML parsing
- dotenv: Loads environment variables from
.envfile
- nodemon: Monitors for file changes and restarts the server automatically
This scraper is designed to be respectful to DistroWatch:
- Configurable delays between requests (default: 1 second)
- Proper User-Agent identification
- Error handling and retry logic
- No concurrent requests to avoid server overload
- "No new data to save": All distributions were skipped (use
--forceto refresh) - Images not downloading: Check internet connection and DistroWatch accessibility
- Empty results: Verify DistroWatch hasn't changed their HTML structure
- Permission errors: Ensure write access to the output directory
- Rate limiting: Increase
REQUEST_DELAYif getting blocked - Module not found errors: Ensure all files in
src/directory exist
- Regular updates: Use
--updatefor daily/weekly runs to only scrape new distributions - Incremental scraping: Default behavior skips existing distributions automatically
- Force refresh: Only use
--forcewhen you need to update all existing data - Rate limiting: Adjust
REQUEST_DELAYbased on your internet connection and DistroWatch response
ISC License - see package.json for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
For issues or questions:
- Check the existing issues
- Create a new issue with detailed information
- Include Node.js version and OS information