🏛️ Hittite Portal Scraper

A high-performance .NET application for scraping and downloading images from the Hittite Portal archaeological database at the University of Würzburg.

📋 Overview

The Hittite Portal Scraper automates the extraction and download of archaeological images from the Hethitologie Portal Mainz. It uses Playwright for browser automation and implements optimized batch processing with concurrent downloads.

Key Features

Automated Scraping: Systematically crawls the Hittite Portal catalog
Batch Processing: Handles large datasets with configurable batch sizes
Concurrent Downloads: Parallel processing with rate limiting to prevent overload
Memory Optimized: Includes garbage collection strategies for long-running operations
Retry Logic: Automatic retry mechanism for failed downloads
Docker Support: Containerized deployment for consistent environments

🚀 Quick Start

Prerequisites

.NET 8.0 SDK or higher
Docker (optional, for containerized deployment)

Local Development

Clone the repository

git clone https://github.com/yourusername/HittitePortalScraper.git
cd HittitePortalScraper

Restore dependencies
```
dotnet restore
```
Build the project
```
dotnet build
```

Run the application

dotnet run --project Test/HittitePortalScraper.csproj

Docker Deployment

Using Docker Compose (Recommended)

docker-compose up --build

Manual Docker Build

# Build the image
docker build -t hittite-scraper .

# Run the container
docker run -v $(pwd)/images:/app/images hittite-scraper

The downloaded images will be saved to the ./images directory on your host machine.

🛠️ Configuration

Performance Tuning

You can adjust the following parameters in Program.cs:

const int batchSize = 50;        // Number of items per batch
const int maxConcurrent = 5;     // Maximum concurrent downloads

Browser Options

Playwright browser settings can be modified for your environment:

await playwright.Chromium.LaunchAsync(new BrowserTypeLaunchOptions
{
    Headless = true,
    Args = new[] {
        "--disable-dev-shm-usage",
        "--disable-gpu",
        "--no-sandbox"
    }
});

📦 Dependencies

Microsoft.Playwright - Browser automation
RestSharp - HTTP client library
.NET 8.0 - Runtime framework

🏗️ Project Structure

HittitePortalScraper/
├── HittitePortalScraper/
│   ├── HittitePortalScraper.csproj
│   ├── Program.cs              # Main application logic
│   └── RestSharpLinks.cs       # Image scraping utilities
├── images/                     # Downloaded images (created at runtime)
├── Dockerfile
├── docker-compose.yml
├── .dockerignore
├── .gitignore
└── README.md

🔄 How It Works

Initial Crawl: Starts from the main CTH portal page
Link Extraction: Recursively extracts catalog links using XPath selectors
Image Discovery: Identifies touchpic.php endpoints with image parameters
Batch Processing: Groups downloads into manageable batches
Concurrent Execution: Uses semaphores to control parallel downloads
Response Interception: Captures pixl3.php image responses via Playwright
File Storage: Saves images with bildnr as filename

📊 Performance

Processes 50 items per batch by default
Supports 5 concurrent downloads simultaneously
Includes automatic garbage collection between batches
Built-in retry logic (2 attempts per image)
Timeout protection (20s page load, 6s image wait)

🐳 Docker Details

The Docker setup includes:

Multi-stage build for optimized image size
Playwright browser dependencies pre-installed
Volume mounting for persistent image storage
Non-root user for security
Health checks for container monitoring

⚠️ Important Notes

Respect Rate Limits: The tool includes rate limiting, but always use responsibly
Data Usage: Downloading large image collections will consume significant bandwidth
Legal Compliance: Ensure you have permission to scrape and store these images
Attribution: Images belong to the Hethitologie Portal Mainz at the University of Würzburg

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Hethitologie Portal Mainz - Source of the archaeological data
University of Würzburg - Maintaining the Hittite Portal
Playwright Team - For the excellent browser automation framework

📧 Contact

For questions or suggestions, please open an issue on GitHub.

Note: This tool is for academic and research purposes. Always respect the terms of service of the target website and applicable copyright laws.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏛️ Hittite Portal Scraper

📋 Overview

Key Features

🚀 Quick Start

Prerequisites

Local Development

Docker Deployment

Using Docker Compose (Recommended)

Manual Docker Build

🛠️ Configuration

Performance Tuning

Browser Options

📦 Dependencies

🏗️ Project Structure

🔄 How It Works

📊 Performance

🐳 Docker Details

⚠️ Important Notes

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
HittitePortalScraper		HittitePortalScraper
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
HittitePortalScraper.sln		HittitePortalScraper.sln
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
gitignore		gitignore

Folders and files

Latest commit

History

Repository files navigation

🏛️ Hittite Portal Scraper

📋 Overview

Key Features

🚀 Quick Start

Prerequisites

Local Development

Docker Deployment

Using Docker Compose (Recommended)

Manual Docker Build

🛠️ Configuration

Performance Tuning

Browser Options

📦 Dependencies

🏗️ Project Structure

🔄 How It Works

📊 Performance

🐳 Docker Details

⚠️ Important Notes

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages