A high-performance .NET application for scraping and downloading images from the Hittite Portal archaeological database at the University of WΓΌrzburg.
The Hittite Portal Scraper automates the extraction and download of archaeological images from the Hethitologie Portal Mainz. It uses Playwright for browser automation and implements optimized batch processing with concurrent downloads.
- Automated Scraping: Systematically crawls the Hittite Portal catalog
- Batch Processing: Handles large datasets with configurable batch sizes
- Concurrent Downloads: Parallel processing with rate limiting to prevent overload
- Memory Optimized: Includes garbage collection strategies for long-running operations
- Retry Logic: Automatic retry mechanism for failed downloads
- Docker Support: Containerized deployment for consistent environments
- .NET 8.0 SDK or higher
- Docker (optional, for containerized deployment)
-
Clone the repository
git clone https://github.com/yourusername/HittitePortalScraper.git cd HittitePortalScraper -
Restore dependencies
dotnet restore
-
Build the project
dotnet build
-
Run the application
dotnet run --project Test/HittitePortalScraper.csproj
docker-compose up --build# Build the image
docker build -t hittite-scraper .
# Run the container
docker run -v $(pwd)/images:/app/images hittite-scraperThe downloaded images will be saved to the ./images directory on your host machine.
You can adjust the following parameters in Program.cs:
const int batchSize = 50; // Number of items per batch
const int maxConcurrent = 5; // Maximum concurrent downloadsPlaywright browser settings can be modified for your environment:
await playwright.Chromium.LaunchAsync(new BrowserTypeLaunchOptions
{
Headless = true,
Args = new[] {
"--disable-dev-shm-usage",
"--disable-gpu",
"--no-sandbox"
}
});- Microsoft.Playwright - Browser automation
- RestSharp - HTTP client library
- .NET 8.0 - Runtime framework
HittitePortalScraper/
βββ HittitePortalScraper/
β βββ HittitePortalScraper.csproj
β βββ Program.cs # Main application logic
β βββ RestSharpLinks.cs # Image scraping utilities
βββ images/ # Downloaded images (created at runtime)
βββ Dockerfile
βββ docker-compose.yml
βββ .dockerignore
βββ .gitignore
βββ README.md
- Initial Crawl: Starts from the main CTH portal page
- Link Extraction: Recursively extracts catalog links using XPath selectors
- Image Discovery: Identifies
touchpic.phpendpoints with image parameters - Batch Processing: Groups downloads into manageable batches
- Concurrent Execution: Uses semaphores to control parallel downloads
- Response Interception: Captures
pixl3.phpimage responses via Playwright - File Storage: Saves images with
bildnras filename
- Processes 50 items per batch by default
- Supports 5 concurrent downloads simultaneously
- Includes automatic garbage collection between batches
- Built-in retry logic (2 attempts per image)
- Timeout protection (20s page load, 6s image wait)
The Docker setup includes:
- Multi-stage build for optimized image size
- Playwright browser dependencies pre-installed
- Volume mounting for persistent image storage
- Non-root user for security
- Health checks for container monitoring
- Respect Rate Limits: The tool includes rate limiting, but always use responsibly
- Data Usage: Downloading large image collections will consume significant bandwidth
- Legal Compliance: Ensure you have permission to scrape and store these images
- Attribution: Images belong to the Hethitologie Portal Mainz at the University of WΓΌrzburg
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Hethitologie Portal Mainz - Source of the archaeological data
- University of WΓΌrzburg - Maintaining the Hittite Portal
- Playwright Team - For the excellent browser automation framework
For questions or suggestions, please open an issue on GitHub.
Note: This tool is for academic and research purposes. Always respect the terms of service of the target website and applicable copyright laws.