wparc (WordPress Archive) is a powerful command-line tool for backing up and archiving public data from WordPress websites using the WordPress REST API. It provides a simple, efficient way to extract posts, pages, media metadata, comments, and other content from any WordPress site that has the REST API enabled.
wparc connects to WordPress sites via their /wp-json/ REST API endpoint (available by default in WordPress 4.7+) and extracts all publicly accessible data. Unlike traditional backup tools that require database access or FTP credentials, wparc only needs the site URL and works entirely through the public API.
Key capabilities:
- Extract all public WordPress content (posts, pages, media, comments, etc.)
- Download media files (images, videos, documents)
- Analyze and discover WordPress API routes
- Generate structured, machine-readable backups (JSONL format)
- Work with any WordPress site without special permissions
- Data extraction: Dump all WordPress REST API routes and data
- Media download: Download all media files referenced in the API
- Route analysis: Analyze and categorize WordPress API routes, automatically test unknown routes, and generate YAML updates
- Smart pagination: Automatically detects and uses WordPress pagination headers (X-WP-TotalPages, X-WP-Total) for accurate progress tracking
- Progress tracking: Shows "page X of Y" progress when pagination headers are available
- SSL verification: Secure by default with configurable SSL verification
- Configurable: Customize timeout, page size, retry count, and more
- Type-safe: Full type hints for better IDE support and code quality
- Error handling: Comprehensive error handling with custom exceptions and actionable error messages
pip install --upgrade pip setuptools
pip install wparcgit clone https://github.com/ruarxive/wparc.git
cd wparc
pip install -e ".[dev]"Python version 3.6 or greater is required. Python 3.9+ is recommended for best performance and compatibility.
- Python 3.6+
- Internet connection for downloading data
- Sufficient disk space (depends on site size - typically 100MB to several GB)
- Write permissions in the current directory (for creating output folders)
Here's a typical workflow for backing up a WordPress site:
# Step 1: Verify the site's REST API is accessible
wparc ping example.com
# Step 2: Analyze available routes (optional, but recommended)
wparc analyze example.com --verbose
# Step 3: Dump all data from the WordPress site
wparc dump example.com --verbose
# Step 4: Download all media files
wparc getfiles example.com --verboseGet help:
wparc --help
# Or get help for a specific command
wparc ping --help
wparc dump --helpPing a WordPress site (verify API accessibility):
wparc ping example.comDump all data from a WordPress site:
wparc dump example.comDownload media files (requires dump to be run first):
wparc getfiles example.comAnalyze WordPress API routes and test unknown routes:
wparc analyze example.comThe ping command verifies that a WordPress site's REST API is accessible and returns basic information about available endpoints. This is useful as a first step to check if a site supports the WordPress REST API before attempting to dump data.
Syntax:
wparc ping <domain> [OPTIONS]Options:
-v, --verbose: Enable verbose output with detailed logging information--https: Force HTTPS protocol (default: True, use--no-httpsto disable)--no-verify-ssl: Disable SSL certificate verification (not recommended for security)--timeout INTEGER: Request timeout in seconds (default: 360)
What it does:
- Connects to the WordPress REST API endpoint (
/wp-json/) - Verifies the API is accessible and responding
- Counts total available routes
- Returns endpoint URL and route count
Examples:
Basic ping to check if API is accessible:
wparc ping example.comPing with HTTPS and verbose output to see detailed connection information:
wparc ping example.com --https --verbosePing a site with self-signed SSL certificate (development/testing only):
wparc ping localhost --no-verify-ssl --no-httpsPing with custom timeout for slow connections:
wparc ping slow-site.com --timeout 600Expected Output:
✓ Endpoint https://example.com/wp-json/ is OK
✓ Total routes: 45
Use Cases:
- Quick health check before running a full dump
- Verifying REST API is enabled on a WordPress site
- Testing connectivity and SSL configuration
- Discovering how many routes are available
The dump command extracts all data from a WordPress site's REST API and saves it to local JSONL files. This is the primary command for backing up WordPress content including posts, pages, media metadata, comments, users, and other API endpoints.
Syntax:
wparc dump <domain> [OPTIONS]Options:
-v, --verbose: Enable verbose output showing detailed progress and operations-a, --all: Include unknown API routes in the dump (default: True). Set to--no-allto only dump known routes--https: Force HTTPS protocol (default: True, use--no-httpsto disable)--no-verify-ssl: Disable SSL certificate verification (not recommended for security)--timeout INTEGER: Request timeout in seconds (default: 360). Increase for slow sites or large datasets--page-size INTEGER: Number of items per page (default: 100). Lower values use less memory but more requests--retry-count INTEGER: Number of retry attempts for failed requests (default: 5)
What it does:
- Discovers all available WordPress REST API routes
- Iterates through paginated endpoints (posts, pages, media, etc.)
- Downloads all data and saves to JSONL files (one JSON object per line)
- Creates organized directory structure:
<domain>/data/ - Shows progress with pagination information when available
- Handles errors gracefully with automatic retries
Examples:
Basic dump of all data from a WordPress site:
wparc dump example.comDump with verbose output to see detailed progress:
wparc dump example.com --verboseDump only known routes (skip unknown/custom routes):
wparc dump example.com --no-allDump from a large site with custom settings for better performance:
wparc dump large-site.com --timeout 600 --page-size 50 --retry-count 3Dump from a development site with self-signed certificate:
wparc dump dev.local --no-verify-ssl --no-httpsDump with HTTP instead of HTTPS (for local development):
wparc dump localhost --no-httpsExpected Output:
Processing route: /wp/v2/posts
Processing page 1 of 5 (100 items per page)
Processing page 2 of 5 (100 items per page)
...
✓ Data collection complete: 45 routes processed, 2 skipped
Output Files:
After completion, you'll find files in <domain>/data/:
wp-json.json- Main API index with all routeswp_v2_posts.jsonl- All posts (one JSON object per line)wp_v2_pages.jsonl- All pageswp_v2_media.jsonl- Media metadata (usegetfilesto download actual files)wp_v2_comments.jsonl- Commentswp_v2_users.jsonl- Users (public data only)- Additional route files as discovered
Note: The dump command automatically uses WordPress pagination headers (X-WP-TotalPages and X-WP-Total) when available to show accurate progress like "Processing page 1 of 5". This provides better visibility into the extraction progress for large sites.
Use Cases:
- Full site backup before migration or updates
- Content archival and preservation
- Data analysis and research
- Creating local copies for development
- Extracting content for static site generation
The getfiles command downloads all media files (images, videos, documents, etc.) that were referenced in the media metadata collected by the dump command. It reads from wp_v2_media.jsonl and downloads each file to the local filesystem, preserving the original directory structure.
Syntax:
wparc getfiles <domain> [OPTIONS]Options:
-v, --verbose: Enable verbose output showing download progress and file details--no-verify-ssl: Disable SSL certificate verification (not recommended for security)
What it does:
- Reads media metadata from
<domain>/data/wp_v2_media.jsonl(created bydumpcommand) - Downloads each media file referenced in the metadata
- Preserves original WordPress directory structure (
wp-content/uploads/...) - Supports resumable downloads (can be interrupted and resumed)
- Uses concurrent workers for faster downloads (default: 5 workers)
- Creates checkpoint files to track progress
Prerequisites:
- Must run
wparc dump <domain>first to generate the media metadata file - Requires the
wp_v2_media.jsonlfile to exist in<domain>/data/
Examples:
Download all media files after running dump:
# First, dump the data
wparc dump example.com
# Then download the media files
wparc getfiles example.comDownload with verbose output to see progress:
wparc getfiles example.com --verboseDownload from a site with SSL issues (development only):
wparc getfiles dev.local --no-verify-sslExpected Output:
Reading media files from example.com/data/wp_v2_media.jsonl
Found 1,234 media files to download
Downloading: image1.jpg [████████████] 100%
Downloading: video1.mp4 [████████████] 100%
...
✓ File download complete: 1234 downloaded, 0 failed, 0 skipped
Output Structure:
Files are downloaded to <domain>/files/wp-content/uploads/ preserving the original WordPress structure:
example.com/
└── files/
└── wp-content/
└── uploads/
├── 2024/
│ └── 12/
│ └── image.jpg
└── 2025/
└── 01/
└── video.mp4
Features:
- Resumable: If interrupted, can resume from checkpoint
- Concurrent: Downloads multiple files simultaneously (5 workers by default)
- Progress Tracking: Shows download progress for each file
- Error Handling: Continues downloading even if some files fail
- Checkpoint System: Saves progress to resume later
Use Cases:
- Complete site backup including all media files
- Migrating media files to a new server
- Creating offline archives of WordPress sites
- Downloading media for local development environments
- Preserving media assets for archival purposes
The analyze command performs a comprehensive analysis of a WordPress site's REST API routes. It compares discovered routes against a database of known routes, identifies unknown routes, automatically tests them to determine their characteristics, and generates YAML updates that can be added to the known routes database.
Syntax:
wparc analyze <domain> [OPTIONS]Options:
-v, --verbose: Enable verbose output showing detailed analysis and route testing progress--https: Force HTTPS protocol (default: True, use--no-httpsto disable)--no-verify-ssl: Disable SSL certificate verification (not recommended for security)--timeout INTEGER: Request timeout in seconds (default: 360)
What it does:
- Route Discovery: Fetches all available routes from
/wp-json/ - Route Comparison: Compares against known routes database (
known_routes.yml) - Route Categorization: Categorizes routes into:
protected: Routes requiring authentication (401/403 responses)public-list: Public routes returning arrays/lists (e.g., posts, pages)public-dict: Public routes returning objects/dictionariesuseless: Routes that don't provide useful data (individual items, regex patterns)unknown: Routes not in the known routes database
- Automatic Testing: Tests unknown routes to determine their category
- YAML Generation: Creates ready-to-use YAML for updating
known_routes.yml
Route Categories Explained:
- Protected: Requires authentication, returns 401/403 errors. Not useful for public backups.
- Public-list: Returns arrays of items (posts, pages, comments). Useful for bulk data extraction.
- Public-dict: Returns single objects/dictionaries. May contain useful site information.
- Useless: Individual item endpoints (e.g.,
/wp/v2/posts/123) or regex patterns. Not useful for bulk extraction.
Examples:
Basic analysis of a WordPress site:
wparc analyze example.comAnalysis with verbose output to see route testing details:
wparc analyze example.com --verboseAnalyze a site with custom plugins that may have unknown routes:
wparc analyze custom-site.com --verboseAnalyze a development site:
wparc analyze dev.local --no-verify-ssl --no-httpsExpected Output:
✓ Analysis complete for https://example.com/wp-json/
✓ Total routes: 45
Route Statistics:
Protected: 12
Public-list: 20
Public-dict: 5
Useless: 3
Unknown: 5
⚠ Found 5 unknown routes
Testing routes: 100%|████████████| 5/5 [00:02<00:00, 2.1route/s]
✓ Testing complete for unknown routes
Categorized routes:
public-list: 3
protected: 2
======================================================================
YAML Update for known_routes.yml:
======================================================================
protected:
- /wp/v2/users/me
- /wp/v2/settings
public-list:
- /wp/v2/custom-post-type
- /wp/v2/another-route
- /wp/v2/third-route
======================================================================
You can add the above YAML to known_routes.yml
With Verbose Output:
When using --verbose, you'll see additional details:
Testing route: /wp/v2/custom-post-type
Status: 200
Response type: list
Category: public-list
Testing route: /wp/v2/users/me
Status: 401
Category: protected
...
Using the Generated YAML:
The command outputs YAML that can be directly added to wparc/data/known_routes.yml:
- Copy the YAML output from the command
- Open
wparc/data/known_routes.yml(or your local copy) - Add the routes under the appropriate category
- This helps improve route recognition for future dumps
Use Cases:
- Discovering custom WordPress plugins and their API endpoints
- Understanding what data is available from a WordPress site
- Contributing to the known routes database
- Planning data extraction strategies
- Identifying protected vs. public endpoints
- Researching WordPress API capabilities
After running wparc dump <domain>, the following directory structure is created in your current working directory:
<domain>/
├── data/
│ ├── wp-json.json # Main API index with all routes and endpoints
│ ├── wp_v2_posts.jsonl # All posts (one JSON object per line)
│ ├── wp_v2_pages.jsonl # All pages
│ ├── wp_v2_media.jsonl # Media metadata (URLs, titles, descriptions)
│ ├── wp_v2_comments.jsonl # Comments
│ ├── wp_v2_users.jsonl # Users (public data only)
│ ├── wp_v2_categories.jsonl # Categories
│ ├── wp_v2_tags.jsonl # Tags
│ └── ... # Other routes discovered from the API
└── files/ # Media files (created after running getfiles)
└── wp-content/
└── uploads/
├── 2024/
│ └── 12/
│ └── image.jpg
└── 2025/
└── 01/
└── video.mp4
JSONL Format: Most data files use JSONL (JSON Lines) format where each line is a valid JSON object. This format is:
- Memory efficient (can process line by line)
- Easy to parse programmatically
- Suitable for large datasets
Example JSONL file content (wp_v2_posts.jsonl):
{"id":1,"date":"2024-01-01T00:00:00","title":{"rendered":"Hello World"},"content":{"rendered":"<p>Welcome to WordPress!</p>"},"excerpt":{"rendered":"<p>Welcome...</p>"},"author":1,"featured_media":0}
{"id":2,"date":"2024-01-02T00:00:00","title":{"rendered":"Sample Post"},"content":{"rendered":"<p>This is a sample post.</p>"},"excerpt":{"rendered":"<p>This is...</p>"},"author":1,"featured_media":123}Reading JSONL files:
import json
with open('example.com/data/wp_v2_posts.jsonl', 'r') as f:
for line in f:
post = json.loads(line)
print(post['title']['rendered'])Here's a complete example of backing up a WordPress site:
# 1. Check if the site is accessible
$ wparc ping mysite.com
✓ Endpoint https://mysite.com/wp-json/ is OK
✓ Total routes: 52
# 2. Analyze routes to understand what's available
$ wparc analyze mysite.com --verbose
✓ Analysis complete for https://mysite.com/wp-json/
✓ Total routes: 52
Route Statistics:
Protected: 15
Public-list: 28
Public-dict: 4
Useless: 3
Unknown: 2
# 3. Dump all data
$ wparc dump mysite.com --verbose
Processing route: /wp/v2/posts
Processing page 1 of 12 (100 items per page)
...
✓ Data collection complete: 50 routes processed, 2 skipped
# 4. Download media files
$ wparc getfiles mysite.com --verbose
Found 1,234 media files to download
Downloading: image1.jpg [████████████] 100%
...
✓ File download complete: 1234 downloaded, 0 failed, 0 skipped
# Result: Complete backup in mysite.com/ directory
$ ls -lh mysite.com/
data/ files/pytest# Format code
black wparc/
# Type checking
mypy wparc/
# Linting
flake8 wparc/The most common use case - creating a complete backup of a WordPress site:
# Step 1: Verify connectivity
wparc ping example.com
# Step 2: Extract all data
wparc dump example.com --verbose
# Step 3: Download all media files
wparc getfiles example.com --verboseAnalyze what content is available without downloading everything:
# Get route statistics
wparc analyze example.com --verbose
# Check specific route counts
wparc ping example.comFor sites with thousands of posts or slow connections:
# Use smaller page size and longer timeout
wparc dump large-site.com \
--timeout 900 \
--page-size 25 \
--retry-count 10 \
--verboseFor local development sites or sites with self-signed certificates:
# Disable SSL verification (development only!)
wparc dump dev.local --no-verify-ssl --no-https
# Download media files
wparc getfiles dev.local --no-verify-sslFor regular backups, you can run dump multiple times - it will overwrite existing files:
# Daily backup script
#!/bin/bash
DATE=$(date +%Y-%m-%d)
wparc dump example.com --verbose > backup-$DATE.log 2>&1
wparc getfiles example.com --verbose >> backup-$DATE.log 2>&1Find and document custom WordPress plugin endpoints:
# Analyze and get YAML for unknown routes
wparc analyze custom-site.com --verbose > analysis.txt
# The output will include YAML that can be added to known_routes.ymlIf you encounter SSL certificate errors, you can temporarily disable verification:
wparc dump example.com --no-verify-sslWarning: This is not recommended for production use as it makes you vulnerable to man-in-the-middle attacks.
If requests are timing out, increase the timeout:
wparc dump example.com --timeout 600For large WordPress sites, you may want to adjust the page size:
wparc dump example.com --page-size 50The dump command automatically detects pagination information from WordPress API headers, so you'll see progress like "Processing page 1 of 10" when available. This helps you estimate completion time for large extractions.
If you see domain validation errors, ensure you're using a valid domain format:
- Valid:
example.com,www.example.com,subdomain.example.com - Invalid:
http://example.com(protocol will be stripped automatically) - Invalid:
example.com/(trailing slash will be removed automatically)
wparc provides detailed error messages with actionable suggestions:
DomainValidationError: Invalid domain format
Error: Invalid domain 'example..com': Domain cannot contain consecutive dots
Solution: Check the domain name format. Valid formats: example.com, www.example.com, subdomain.example.com
APIError: WordPress API request failed
WordPress API error for https://example.com/wp-json/ (HTTP 404)
Suggestion: Check if the WordPress REST API is enabled on this site.
Solution:
- Verify the site is accessible:
curl https://example.com/wp-json/ - Check if REST API is disabled by plugins or theme
- Ensure WordPress version is 4.7+ (REST API was introduced in 4.7)
SSLVerificationError: SSL certificate verification failed
SSL verification failed for https://example.com/wp-json/: certificate verify failed
Suggestion: If you trust this site, you can use --no-verify-ssl (not recommended for production).
Solution:
- For production sites: Fix SSL certificate issues on the server
- For development/testing: Use
--no-verify-sslflag (not recommended for production)
FileDownloadError: File download failed
Failed to download https://example.com/wp-content/uploads/image.jpg: Connection timeout
Suggestion: Check your internet connection and verify the URL is accessible.
Solution:
- Check internet connectivity
- Verify the media file URL is accessible
- Try downloading manually to verify the file exists
- Check if the site requires authentication for media files
MediaFileNotFoundError: Media file list not found
Media file not found: example.com/data/wp_v2_media.jsonl
Suggestion: Run 'wparc dump <domain>' first to generate the media file list.
Solution: Run wparc dump <domain> before running wparc getfiles <domain>
Issue: "Connection timeout" errors
# Solution: Increase timeout
wparc dump example.com --timeout 900Issue: "Too many requests" or rate limiting
# Solution: Reduce page size and increase retry count
wparc dump example.com --page-size 25 --retry-count 10Issue: "SSL certificate verify failed" on valid sites
# Solution: Update certificates (macOS/Linux)
# Or temporarily disable for testing (not recommended)
wparc dump example.com --no-verify-sslIssue: Dump completes but getfiles fails
# Solution: Check if wp_v2_media.jsonl exists
ls -lh example.com/data/wp_v2_media.jsonl
# If missing, the site may not have media endpoints
# Try running dump again with --verbose to see what routes were processedIssue: Out of memory errors on large sites
# Solution: Use smaller page size
wparc dump example.com --page-size 25Issue: Some routes return 401/403 errors
# This is normal - protected routes require authentication
# These routes are automatically skipped during dump
# Use analyze command to see which routes are protected
wparc analyze example.com
For large sites:
- Use smaller
--page-size(25-50) to reduce memory usage - Increase
--timeoutfor slow connections - Run during off-peak hours to avoid impacting site performance
- Use
--verboseto monitor progress
For faster downloads:
- The
getfilescommand uses 5 concurrent workers by default - Ensure stable internet connection for best results
- Consider running
getfilesseparately if dump takes a long time
File organization:
- Each domain creates its own directory structure
- JSONL files can be processed line-by-line (memory efficient)
- Media files preserve original WordPress directory structure
Backup strategy:
- Run regular dumps to capture content changes
- Store backups in version control or cloud storage
- Consider compressing old backups to save space
Custom post types:
- Use
analyzecommand to discover custom endpoints - Custom routes are automatically included when using
--allflag (default) - Generated YAML from
analyzecan improve future dumps
Plugin-specific content:
- Many WordPress plugins expose their data via REST API
- Use
analyzeto discover plugin endpoints - Some plugin data may require authentication (will be skipped)
Local development:
# Backup production site
wparc dump production.com
# Restore to local (requires custom import script)
# Use JSONL files to import data into local WordPressTesting:
- Use
pingcommand to verify API accessibility - Use
analyzeto understand available endpoints - Test with
--verboseto see detailed operations
What wparc can do:
- Extract all public WordPress content
- Download publicly accessible media files
- Work with any WordPress site (4.7+)
- Discover and analyze API routes
What wparc cannot do:
- Access private/protected content (requires authentication)
- Extract database structure or settings
- Backup WordPress core files or themes
- Access content behind paywalls or membership plugins
- Extract user passwords or sensitive data
- SSL verification enabled by default: All HTTPS connections verify SSL certificates
- Secure file operations: All file operations use secure context managers
- No command injection: Safe subprocess usage prevents command injection vulnerabilities
- Error handling: Proper error handling prevents information leakage
- No authentication: Only accesses publicly available data (no credentials required or stored)
Security recommendations:
- Always use
--httpsfor production sites (default) - Only use
--no-verify-sslfor development/testing - Review downloaded content before using in production
- Keep wparc updated to latest version
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
See LICENSE file for details.
For detailed information about WordPress REST API endpoints, see WP_API_ENDPOINTS.md.
See CHANGELOG.md for a list of changes.