A tool for processing and filtering Common Crawl data using Slurm-based parallel processing.
Derived from original work here: CC-Filtering
This repository contains scripts for efficiently downloading and processing Common Crawl data using a Slurm computing cluster. The workflow uses SLURM array jobs for parallel processing within each crawl date, submitted sequentially across dates via a Python runner. This allows scalable handling of large datasets while controlling concurrency to avoid overwhelming resources.
- setup-conda-env.sh - Sets up the conda environment (
.conda_env) for processing with required Python packages. - slurm-sequential-runner.py - Python script that reads crawl dates from
crawl_data.txtand sequentially submits/manages a SLURM array job for each date (waiting for completion before the next). - job-template.sh - Template for individual SLURM array jobs; configures the environment and runs the processor.
- common-crawl-processor.py - Core Python script that downloads and processes Common Crawl WET files for a given date and task ID.
These launcher scripts configure and submit the sequential runner for different parallelism levels. Run with sbatch <script.sh>.
- run-fast-parallel.sh - High-parallelism mode: 50 concurrent array tasks, 25 files per task, 7-day timeout.
- run-ultra-parallel.sh - Maximum throughput: 100 concurrent array tasks, 10 files per task, 7-day timeout.
- run-disk-safe-parallel.sh - Conservative mode: 20 concurrent tasks, 50 files per task, focuses on I/O safety.
- run-optimized-job.sh - Balanced default: 30 concurrent tasks, 25 files per task.
- run-test-config.sh - Dry-run mode: Generates scripts without submitting jobs for validation.
- monitor-job.sh - Real-time monitoring of running jobs: progress, resource usage, estimated completion time.
- job-analyser.sh - Post-completion analysis: success rates, runtimes, failures, and performance metrics from SLURM logs.
- BristolPostcodeLookup.parquet - Lookup table for Bristol postcodes used in data filtering.
- crawl_data.txt - List of crawl dates and file counts (format:
date num_files, e.g.,202104 79840). - wet.paths - Paths to Common Crawl WET files.
- scripts.txt - Quick reference to script purposes.
- slurm-config-guide.txt - SLURM configuration tips and best practices.
- script-comparison.md - Comparison of run script configurations.
-
Setup Environment:
bash setup-conda-env.sh # Creates .conda_env source runme.sh # Sets SLURM account (create if needed: export SLURM_ACCOUNT=your_account)
-
Run Pipeline (submit via SLURM):
# Recommended: Fast parallel mode sbatch run-fast-parallel.sh # Or ultra-parallel for max speed (higher resource use) sbatch run-ultra-parallel.sh # Disk-safe mode (lower concurrency) sbatch run-disk-safe-parallel.sh # Dry-run test (no submission) sbatch run-test-config.sh
Each launcher runs
slurm-sequential-runner.pywith tailored args, e.g.:./.conda_env/bin/python slurm-sequential-runner.py \ --template-file job-template.sh \ --crawl-dates-file crawl_data.txt \ --partition compute \ --time 168 \ # 7 days --mem 2G \ --cpus 2 \ --segments-per-task 25 \ # Files per array task --throttle 50 \ # Max concurrent tasks --job-prefix crawl_job_fast
While jobs run (parent launcher + child arrays):
# Monitor running jobs (parent and children)
./monitor-job.sh
# SLURM commands for details
squeue -u $USER # All jobs
squeue -u $USER | grep crawl_job # Child jobs
sinfo -p compute # Partition statusAfter the sequential runner finishes all dates:
# Analyse completed jobs (success, runtime, failures)
./job-analyser.sh
# Check logs/output
sacct -j <JOB_ID> --format=JobID,State,ExitCode,MaxRSS,Elapsed # Parent job
ls *_%j.out *_%j.err # Child job logsAnalysis Features (via job-analyser.sh):
- 📊 Success/failure rates and exit codes.
- ⏱️ Runtime stats (min/max/avg per task/date).
- 💾 Resource utilization (memory, CPU).
- 📈 Performance insights and tuning recommendations.
- Launcher script (e.g.,
run-fast-parallel.sh) runsslurm-sequential-runner.py. - The Python runner loads
crawl_data.txtand, for each date:- Calculates array size:
n_files / segments_per_task. - Submits a SLURM array job (e.g.,
0-3199%50for ~80k files, 25 per task, 50 concurrent). - Uses
job-template.shto generate the job script, which activates.conda_envand callscommon-crawl-processor.py --task-id $SLURM_ARRAY_TASK_ID. - Waits (polls
squeue) for the array to complete before next date.
- Calculates array size:
- Each array task processes its file segments: downloads WET files (from
wet.paths), filters (usingBristolPostcodeLookup.parquet), outputs Parquet/CSV.
Processed files save to ./output/<date>/ (configurable in common-crawl-processor.py).
- SLURM cluster access.
- Bash, Python 3.x (via
.conda_env: pandas, pyarrow, requests, etc.—installed bysetup-conda-env.sh). - Git LFS for large files (setup below).
Note: Uses stable conda env in .conda_env (no Micromamba due to HPC compatibility issues).
Edit these for customization:
- crawl_data.txt: Add/remove dates and file counts.
- job-template.sh: Tweak SLURM params or env setup.
- common-crawl-processor.py: Adjust filtering logic or output paths.
- run-*.sh: Modify Python args for your needs (e.g.,
--throttle 30for medium clusters).
- Time:
--time 168(7 days) for large dates; check partition max withsinfo. - Parallelism: Lower
--segments-per-task= more tasks (higher parallelism); use--throttleto limit concurrency. - Account: Set in
runme.shand source beforesbatch.
- Configurable Parallelism:
--throttlecontrols concurrent array tasks (default 50; up to 100+ on large clusters). - Sequential Safety: Processes dates one-by-one to avoid overload, but parallel within dates via arrays.
- Extended Timeouts: 7-day limits prevent failures on big crawls.
- Resource Balance: 2G mem, 2 CPUs per task; low overhead for launcher.
- Dry-Run: Test with
run-test-config.sh—generates scripts in./generated_scripts/. - Improvements: 5-10x faster than original sequential (via arrays + throttling); optimized for compute partitions.
Performance Comparison:
- Sequential (old): 10 tasks max, 24h timeout → frequent failures.
- Array (now): 50-100 concurrent, 7-day timeout → full dataset completion.
- Account Setup: Use
runme.shfor portability. - Throttle Tuning: 10-20 for small clusters; 50+ for large.
- Timeouts: 24h for tests; 168h for production.
- Balance Workload:
--segments-per-task 10-50(trade parallelism vs. overhead). - Test First: Always dry-run.
- Parquet Output: Half the size of CSV; better for large data.
- SLURM Arrays: Use
%throttle(e.g.,--array=0-899%50) for control. - Monitor Resources:
sacctfor usage; adjust mem/CPUs if needed. - Git Hygiene:
.gitignoreexcludes outputs/logs; commit only templates/scripts.
(See original README for detailed equivalence to old manual chunking—unchanged, but now fully integrated via slurm-sequential-runner.py.)
Parent Job Times Out: Launcher (e.g., run-fast-parallel.sh) needs #SBATCH --time=168:00:00 for full sequential wait. Add to scripts if missing.
Child Arrays Stop Early: Due to parent kill; ensure parent time > total estimated runtime (~1-2 days per date × num_dates).
PENDING "PartitionConfig": Source runme.sh for account.
Low Concurrency: Increase --throttle; check sinfo for resources.
Exit Code 120: App errors (e.g., download fails)—use ./job-analyser.sh.
Array Not Starting: Normal queuing; SLURM throttles via %.
squeue -u $USER # Running jobs
sacct -j <ID> --format=JobID,State,Elapsed,MaxRSS # Completed
./monitor-job.sh # Custom progress
./job-analyser.sh # Analysis
scontrol show job <ID> # Details- ✅ Success rate (e.g., 95% tasks complete).
- ⏱️ Avg runtime per date.
- 🚨 Failures by code/pattern.
- 🎯 Recommendations (e.g., "Increase throttle to 75").
Uses Git LFS for large files (e.g., .parquet).
git lfs install # If needed
git lfs pull # Download all
git lfs pull --include="*.parquet" # Specificgit lfs ls-files
git lfs trackModify/commit as usual: git add <file> && git commit && git push. .gitattributes auto-tracks large files.