Skip to content

🎢 Using the discogs database export for local graph exploration. 🎢

License

Notifications You must be signed in to change notification settings

SimplicityGuy/discogsography

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎡 Discogsography

Build Code Quality Tests E2E Tests License: MIT Python 3.13+ Rust uv just Ruff Cargo Clippy pre-commit mypy Bandit Docker

A modern Python 3.13+ microservices platform for transforming the complete Discogs music database into powerful, queryable knowledge graphs and analytics engines.

πŸš€ Quick Start | πŸ“– Documentation | 🎯 Features | πŸ’¬ Community | πŸ“‹ Emoji Guide

🎯 What is Discogsography?

Discogsography transforms monthly Discogs data dumps (50GB+ compressed XML) into:

  • πŸ”— Neo4j Graph Database: Navigate complex music industry relationships
  • 🐘 PostgreSQL Database: High-performance queries and full-text search
  • πŸ€– AI Discovery Engine: Intelligent recommendations and analytics
  • πŸ“Š Real-time Dashboard: Monitor system health and processing metrics

Perfect for music researchers, data scientists, developers, and music enthusiasts who want to explore the world's largest music database.

πŸ›οΈ Architecture Overview

βš™οΈ Core Services

Service Purpose Key Technologies
πŸ“₯ Python Extractor Downloads & processes Discogs XML dumps (Python) asyncio, orjson, aio-pika
⚑ Rust Extractor High-performance Rust-based extractor tokio, quick-xml, lapin
πŸ”— Graphinator Builds Neo4j knowledge graphs neo4j-driver, graph algorithms
🐘 Tableinator Creates PostgreSQL analytics tables psycopg3, JSONB, full-text search
🎡 Discovery AI-powered music intelligence sentence-transformers, plotly, networkx
πŸ“Š Dashboard Real-time system monitoring FastAPI, WebSocket, reactive UI

πŸ“ System Architecture

graph TD
    S3[("🌐 Discogs S3<br/>Monthly Data Dumps<br/>~50GB XML")]
    PYEXT[["πŸ“₯ Python Extractor<br/>XML β†’ JSON<br/>Deduplication"]]
    RSEXT[["⚑ Rust Extractor<br/>High-Performance<br/>XML Processing"]]
    RMQ{{"🐰 RabbitMQ<br/>Message Broker<br/>4 Queues"}}
    NEO4J[("πŸ”— Neo4j<br/>Graph Database<br/>Relationships")]
    PG[("🐘 PostgreSQL<br/>Analytics DB<br/>Full-text Search")]
    REDIS[("πŸ”΄ Redis<br/>Cache Layer<br/>Query & ML Cache")]
    GRAPH[["πŸ”— Graphinator<br/>Graph Builder"]]
    TABLE[["🐘 Tableinator<br/>Table Builder"]]
    DASH[["πŸ“Š Dashboard<br/>Real-time Monitor<br/>WebSocket"]]
    DISCO[["🎡 Discovery<br/>AI Engine<br/>ML Models"]]

    S3 -->|1a. Download & Parse| PYEXT
    S3 -->|1b. Download & Parse| RSEXT
    PYEXT -->|2. Publish Messages| RMQ
    RSEXT -->|2. Publish Messages| RMQ
    RMQ -->|3a. Artists/Labels/Releases/Masters| GRAPH
    RMQ -->|3b. Artists/Labels/Releases/Masters| TABLE
    GRAPH -->|4a. Build Graph| NEO4J
    TABLE -->|4b. Store Data| PG

    DISCO -.->|Query| NEO4J
    DISCO -.->|Query via asyncpg| PG
    DISCO -.->|Cache| REDIS
    DISCO -.->|Analyze| DISCO

    DASH -.->|Monitor| PYEXT
    DASH -.->|Monitor| RSEXT
    DASH -.->|Monitor| GRAPH
    DASH -.->|Monitor| TABLE
    DASH -.->|Monitor| DISCO
    DASH -.->|Cache| REDIS
    DASH -.->|Stats| RMQ
    DASH -.->|Stats| NEO4J
    DASH -.->|Stats| PG

    style S3 fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style PYEXT fill:#fff9c4,stroke:#f57c00,stroke-width:2px
    style RSEXT fill:#ffccbc,stroke:#d84315,stroke-width:2px
    style RMQ fill:#fff3e0,stroke:#e65100,stroke-width:2px
    style NEO4J fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style PG fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    style REDIS fill:#ffebee,stroke:#b71c1c,stroke-width:2px
    style DASH fill:#fce4ec,stroke:#880e4f,stroke-width:2px
    style DISCO fill:#e3f2fd,stroke:#0d47a1,stroke-width:2px
Loading

🌟 Key Features

πŸš€ Performance & Scale

  • ⚑ High-Speed Processing: 5,000-10,000 records/second XML parsing
  • πŸ”„ Smart Deduplication: SHA256 hash-based change detection prevents reprocessing
  • πŸ“ˆ Handles Big Data: Processes 15M+ releases, 2M+ artists efficiently
  • 🎯 Concurrent Processing: Multi-threaded parsing with async message handling

πŸ›‘οΈ Reliability & Operations

  • πŸ” Auto-Recovery: Automatic retries with exponential backoff
  • πŸ’Ύ Message Durability: RabbitMQ persistence with dead letter queues
  • πŸ₯ Health Monitoring: HTTP health checks for all services
  • πŸ“Š Real-time Metrics: WebSocket dashboard with live updates

πŸ”’ Security & Quality

  • πŸ‹ Container Security: Non-root users, read-only filesystems, dropped capabilities
  • πŸ” Code Security: Bandit scanning, secure defaults, parameterized queries
  • πŸ“ Type Safety: Full type hints with strict mypy validation
  • βœ… Comprehensive Testing: Unit, integration, and E2E tests with Playwright

πŸ€– AI & Analytics

  • 🧠 ML-Powered Discovery: Semantic search using sentence transformers
  • πŸ“Š Industry Analytics: Genre trends, label insights, market analysis
  • πŸ” Graph Algorithms: PageRank, community detection, path finding
  • 🎨 Interactive Visualizations: Plotly charts, vis.js network graphs

πŸ“– Documentation

🎯 Essential Guides

Document Purpose
CLAUDE.md πŸ€– Claude Code integration guide & development standards
Documentation Index πŸ“š Complete documentation directory with all guides
GitHub Actions Guide πŸš€ CI/CD workflows, automation & best practices
Task Automation ⚑ Complete taskipy command reference

πŸ—οΈ Development Standards

Document Purpose
Monorepo Guide πŸ“¦ Managing Python monorepo with shared dependencies
Testing Guide πŸ§ͺ Comprehensive testing strategies and patterns
Logging Guide πŸ“Š Structured logging standards and practices
Python Version Management 🐍 Managing Python 3.13+ across the project

πŸ›‘οΈ Operations & Security

Document Purpose
Docker Security πŸ”’ Container hardening & security practices
Dockerfile Standards πŸ‹ Best practices for writing Dockerfiles
Database Resilience πŸ’Ύ Database connection patterns & error handling
Performance Guide ⚑ Performance optimization strategies

πŸ“‹ Features & References

Document Purpose
Consumer Cancellation πŸ”„ File completion and consumer lifecycle
Platform Targeting 🎯 Cross-platform compatibility
Emoji Guide πŸ“‹ Standardized emoji usage
Recent Improvements πŸš€ Latest platform enhancements
Service Guides πŸ“š Individual README for each service

πŸš€ Quick Start

βœ… Prerequisites

Requirement Minimum Recommended Notes
Python 3.13+ Latest Install via uv
Docker 20.10+ Latest With Docker Compose v2
Storage 100GB 200GB SSD For data + processing
Memory 8GB 16GB+ More RAM = faster processing
Network 10 Mbps 100 Mbps+ Initial download ~50GB

🐳 Using Docker Compose (Recommended)

# 1. Clone and navigate to the repository
git clone https://github.com/SimplicityGuy/discogsography.git
cd discogsography

# 2. Copy environment template (optional - has sensible defaults)
cp .env.example .env

# 3. Start all services (default: Python Extractor)
docker-compose up -d

# 3b. (Optional) Use high-performance Rust Extractor instead
./scripts/switch-extractor.sh rust
# To switch back to Python Extractor: ./scripts/switch-extractor.sh python

# 4. Watch the magic happen!
docker-compose logs -f

# 5. Access the dashboard
open http://localhost:8003

🌐 Service Access

Service URL Default Credentials Purpose
πŸ“Š Dashboard http://localhost:8003 None System monitoring
🎡 Discovery http://localhost:8005 None AI music discovery
🐰 RabbitMQ http://localhost:15672 discogsography / discogsography Queue management
πŸ”— Neo4j http://localhost:7474 neo4j / discogsography Graph exploration
🐘 PostgreSQL localhost:5433 discogsography / discogsography Database access

πŸ’» Local Development

Quick Setup

# 1. Install uv (10-100x faster than pip)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install just (task runner)
brew install just  # macOS
# or: cargo install just
# or: https://just.systems/install.sh

# 3. Install all dependencies
just install

# 4. Set up pre-commit hooks
just init

# 5. Run any service
just dashboard         # Monitoring UI
just discovery         # AI discovery
just pyextractor       # Python data ingestion
just rustextractor-run # Rust data ingestion (requires cargo)
just graphinator       # Neo4j builder
just tableinator       # PostgreSQL builder

Environment Setup

Create a .env file or export variables:

# Core connections
export AMQP_CONNECTION="amqp://guest:guest@localhost:5672/"

# Neo4j settings
export NEO4J_ADDRESS="bolt://localhost:7687"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="password"

# PostgreSQL settings
export POSTGRES_ADDRESS="localhost:5433"
export POSTGRES_USERNAME="postgres"
export POSTGRES_PASSWORD="password"
export POSTGRES_DATABASE="discogsography"

βš™οΈ Configuration

πŸ”§ Environment Variables

All configuration is managed through environment variables. Copy .env.example to .env:

cp .env.example .env

Core Settings

Variable Description Default Used By
AMQP_CONNECTION RabbitMQ URL amqp://guest:guest@localhost:5672/ All services
DISCOGS_ROOT Data storage path /discogs-data Python/Rust Extractors
PERIODIC_CHECK_DAYS Update check interval 15 Python/Rust Extractors
PYTHON_VERSION Python version for builds 3.13 Docker, CI/CD

Database Connections

Variable Description Default Used By
NEO4J_ADDRESS Neo4j bolt URL bolt://localhost:7687 Graphinator, Dashboard, Discovery
NEO4J_USERNAME Neo4j username neo4j Graphinator, Dashboard, Discovery
NEO4J_PASSWORD Neo4j password Required Graphinator, Dashboard, Discovery
POSTGRES_ADDRESS PostgreSQL host:port localhost:5432 Tableinator, Dashboard, Discovery
POSTGRES_USERNAME PostgreSQL username postgres Tableinator, Dashboard, Discovery
POSTGRES_PASSWORD PostgreSQL password Required Tableinator, Dashboard, Discovery
POSTGRES_DATABASE Database name discogsography Tableinator, Dashboard, Discovery

Consumer Management Settings

Variable Description Default Used By
CONSUMER_CANCEL_DELAY Seconds before canceling idle consumers after file completion 300 (5 min) Graphinator, Tableinator
QUEUE_CHECK_INTERVAL Seconds between queue checks when all consumers are idle 3600 (1 hr) Graphinator, Tableinator

πŸ“ Note: The consumer management system implements smart connection lifecycle management:

  • Automatic Idle Detection: When all consumers complete processing, RabbitMQ connections are automatically closed to conserve resources
  • Periodic Queue Checking: Every QUEUE_CHECK_INTERVAL seconds, the service briefly connects to check for new messages in all queues
  • Auto-Reconnection: When new messages are detected, connections are re-established and consumers restart automatically
  • Silent When Idle: Progress logging stops when all queues are complete to reduce log noise

This ensures efficient resource usage while maintaining automatic responsiveness to new data.

Discovery Service & ML Configuration

Variable Description Default Used By
REDIS_URL Redis cache connection URL redis://localhost:6379/0 Discovery, Dashboard
HF_HOME Hugging Face models cache directory /models/huggingface Discovery
SENTENCE_TRANSFORMERS_HOME Sentence transformers cache directory /models/sentence-transformers Discovery
EMBEDDINGS_CACHE_DIR Embeddings cache directory /tmp/embeddings_cache Discovery
XDG_CACHE_HOME General cache directory /tmp/.cache Discovery

πŸ“ Note: The Discovery service uses several cache directories for ML models and embeddings:

  • HF_HOME: Primary cache for Hugging Face transformers models (replaces deprecated TRANSFORMERS_CACHE)
  • SENTENCE_TRANSFORMERS_HOME: Specific cache for sentence transformer models
  • EMBEDDINGS_CACHE_DIR: Configurable cache for generated embeddings
  • All cache directories must be writable by the service user (UID 1000)

πŸ’Ώ Dataset Scale

Data Type Record Count XML Size Processing Time
πŸ“€ Releases ~15 million ~40GB 1-3 hours
🎀 Artists ~2 million ~5GB 15-30 mins
🎡 Masters ~2 million ~3GB 10-20 mins
🏒 Labels ~1.5 million ~2GB 10-15 mins

πŸ“Š Total: ~20 million records β€’ 50GB compressed β€’ 100GB processed

πŸ’‘ Usage Examples

Once your data is loaded, explore the music universe through powerful queries and AI-driven insights.

πŸ”— Neo4j Graph Queries

Navigate the interconnected world of music with Cypher queries:

Find all albums by an artist

MATCH (a:Artist {name: "Pink Floyd"})-[:BY]-(r:Release)
RETURN r.title, r.year
ORDER BY r.year
LIMIT 10

Discover band members

MATCH (member:Artist)-[:MEMBER_OF]->(band:Artist {name: "The Beatles"})
RETURN member.name, member.real_name

Explore label catalogs

MATCH (r:Release)-[:ON]->(l:Label {name: "Blue Note"})
WHERE r.year >= 1950 AND r.year <= 1970
RETURN r.title, r.artist, r.year
ORDER BY r.year

Find artist collaborations

MATCH (a1:Artist {name: "Miles Davis"})-[:COLLABORATED_WITH]-(a2:Artist)
RETURN DISTINCT a2.name
ORDER BY a2.name

🐘 PostgreSQL Queries

Fast structured queries on denormalized data:

Full-text search releases

SELECT
    data->>'title' as title,
    data->>'artist' as artist,
    data->>'year' as year
FROM releases
WHERE data->>'title' ILIKE '%dark side%'
ORDER BY (data->>'year')::int DESC
LIMIT 10;

Artist discography

SELECT
    data->>'title' as title,
    data->>'year' as year,
    data->'genres' as genres
FROM releases
WHERE data->>'artist' = 'Miles Davis'
AND (data->>'year')::int BETWEEN 1950 AND 1960
ORDER BY (data->>'year')::int;

Genre statistics

SELECT
    genre,
    COUNT(*) as release_count,
    MIN((data->>'year')::int) as first_release,
    MAX((data->>'year')::int) as last_release
FROM releases,
     jsonb_array_elements_text(data->'genres') as genre
GROUP BY genre
ORDER BY release_count DESC
LIMIT 20;

πŸ“ˆ Monitoring & Operations

πŸ“Š Dashboard

Access the real-time monitoring dashboard at http://localhost:8003:

  • Service Health: Live status of all microservices
  • Queue Metrics: Message rates, depths, and consumer counts
  • Database Stats: Connection pools and storage usage
  • Activity Log: Recent system events and processing updates
  • WebSocket Updates: Real-time data without page refresh

πŸ” Debug Utilities

Monitor and debug your system with built-in tools:

# Check service logs for errors
uv run task check-errors

# Monitor RabbitMQ queues in real-time
uv run task monitor

# Comprehensive system health dashboard
uv run task system-monitor

# View logs for all services
uv run task logs

πŸ“Š Metrics

Each service provides detailed telemetry:

  • Processing Rates: Records/second for each data type
  • Queue Health: Depth, consumer count, throughput
  • Error Tracking: Failed messages, retry counts
  • Performance: Processing time, memory usage
  • Stall Detection: Alerts when processing stops

πŸ‘¨β€πŸ’» Development

πŸ› οΈ Modern Python Stack

The project leverages cutting-edge Python tooling:

Tool Purpose Configuration
uv 10-100x faster package management pyproject.toml
ruff Lightning-fast linting & formatting pyproject.toml
mypy Strict static type checking pyproject.toml
bandit Security vulnerability scanning pyproject.toml
pre-commit Git hooks for code quality .pre-commit-config.yaml

πŸ§ͺ Testing

Comprehensive test coverage with multiple test types:

# Run all tests (excluding E2E)
uv run task test

# Run with coverage report
uv run task test-cov

# Run specific test suites
uv run pytest tests/extractor/      # Extractor tests (Python)
uv run pytest tests/graphinator/    # Graphinator tests
uv run pytest tests/tableinator/    # Tableinator tests
uv run pytest tests/dashboard/      # Dashboard tests

🎭 E2E Testing with Playwright

# One-time browser setup
uv run playwright install chromium
uv run playwright install-deps chromium

# Run E2E tests (automatic server management)
uv run task test-e2e

# Run with specific browser
uv run pytest tests/dashboard/test_dashboard_ui.py -m e2e --browser firefox

πŸ”§ Development Workflow

# Setup development environment
uv sync --all-extras
uv run task init  # Install pre-commit hooks

# Before committing
just lint     # Run linting
just format   # Format code
uv run task test     # Run tests
just security # Security scan

# Or run everything at once
uv run pre-commit run --all-files

πŸ“ Project Structure

discogsography/
β”œβ”€β”€ πŸ“¦ common/              # Shared utilities and configuration
β”‚   β”œβ”€β”€ config.py           # Centralized configuration management
β”‚   └── health_server.py    # Health check endpoint server
β”œβ”€β”€ πŸ“Š dashboard/           # Real-time monitoring dashboard
β”‚   β”œβ”€β”€ dashboard.py        # FastAPI backend with WebSocket
β”‚   └── static/             # Frontend HTML/CSS/JS
β”œβ”€β”€ πŸ“₯ extractor/           # Data extraction services
β”‚   β”œβ”€β”€ pyextractor/        # Python-based Discogs data ingestion
β”‚   β”‚   β”œβ”€β”€ extractor.py    # Main processing logic
β”‚   β”‚   └── discogs.py      # S3 download and validation
β”‚   └── rustextractor/      # Rust-based high-performance extractor
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   └── main.rs     # Rust processing logic
β”‚       └── Cargo.toml      # Rust dependencies
β”œβ”€β”€ πŸ”— graphinator/         # Neo4j graph database service
β”‚   └── graphinator.py      # Graph relationship builder
β”œβ”€β”€ 🐘 tableinator/         # PostgreSQL storage service
β”‚   └── tableinator.py      # Relational data management
β”œβ”€β”€ πŸ”§ utilities/           # Operational tools
β”‚   β”œβ”€β”€ check_errors.py     # Log analysis
β”‚   β”œβ”€β”€ monitor_queues.py   # Real-time queue monitoring
β”‚   └── system_monitor.py   # System health dashboard
β”œβ”€β”€ πŸ§ͺ tests/               # Comprehensive test suite
β”œβ”€β”€ πŸ“ docs/                # Additional documentation
β”œβ”€β”€ πŸ‹ docker-compose.yml   # Container orchestration
└── πŸ“¦ pyproject.toml       # Project configuration

Logging Conventions

All logger calls (logger.info, logger.warning, logger.error) in this project follow a consistent emoji pattern for visual clarity. Each message starts with an emoji followed by exactly one space before the message text.

Emoji Key

Emoji Usage Example
πŸš€ Startup messages logger.info("πŸš€ Starting service...")
βœ… Success/completion messages logger.info("βœ… Operation completed successfully")
❌ Errors logger.error("❌ Failed to connect to database")
⚠️ Warnings logger.warning("⚠️ Connection timeout, retrying...")
πŸ›‘ Shutdown/stop messages logger.info("πŸ›‘ Shutting down gracefully")
πŸ“Š Progress/statistics logger.info("πŸ“Š Processed 1000 records")
πŸ“₯ Downloads logger.info("πŸ“₯ Starting download of data")
⬇️ Downloading files logger.info("⬇️ Downloading file.xml")
πŸ”„ Processing operations logger.info("πŸ”„ Processing batch of messages")
⏳ Waiting/pending logger.info("⏳ Waiting for messages...")
πŸ“‹ Metadata operations logger.info("πŸ“‹ Loaded metadata from cache")
πŸ” Checking/searching logger.info("πŸ” Checking for updates...")
πŸ“„ File operations logger.info("πŸ“„ File created successfully")
πŸ†• New versions logger.info("πŸ†• Found newer version available")
⏰ Periodic operations logger.info("⏰ Running periodic check")
πŸ”§ Setup/configuration logger.info("πŸ”§ Creating database indexes")
🐰 RabbitMQ connections logger.info("🐰 Connected to RabbitMQ")
πŸ”— Neo4j connections logger.info("πŸ”— Connected to Neo4j")
🐘 PostgreSQL operations logger.info("🐘 Connected to PostgreSQL")
πŸ’Ύ Database save operations logger.info("πŸ’Ύ Updated artist ID=123 in Neo4j")
πŸ₯ Health server logger.info("πŸ₯ Health server started on port 8001")
⏩ Skipping operations logger.info("⏩ Skipped artist ID=123 (no changes)")

Example Usage

logger.info("πŸš€ Starting Discogs data extractor")
logger.error("❌ Failed to connect to Neo4j: connection refused")
logger.warning("⚠️ Slow consumer detected, processing delayed")
logger.info("βœ… All files processed successfully")

πŸ—„οΈ Data Schema

πŸ”— Neo4j Graph Model

The graph database models complex music industry relationships:

Node Types

Node Description Key Properties
Artist Musicians, bands, producers id, name, real_name, profile
Label Record labels and imprints id, name, profile, parent_label
Master Master recordings id, title, year, main_release
Release Physical/digital releases id, title, year, country, format
Genre Musical genres name
Style Sub-genres and styles name

Relationships

🎀 Artist Relationships:
β”œβ”€β”€ MEMBER_OF ──────→ Artist (band membership)
β”œβ”€β”€ ALIAS_OF ───────→ Artist (alternative names)
β”œβ”€β”€ COLLABORATED_WITH β†’ Artist (collaborations)
└── PERFORMED_ON ───→ Release (credits)

πŸ“€ Release Relationships:
β”œβ”€β”€ BY ────────────→ Artist (performer credits)
β”œβ”€β”€ ON ────────────→ Label (release label)
β”œβ”€β”€ DERIVED_FROM ──→ Master (master recording)
β”œβ”€β”€ IS ────────────→ Genre (genre classification)
└── IS ────────────→ Style (style classification)

🏒 Label Relationships:
└── SUBLABEL_OF ───→ Label (parent/child labels)

🎡 Classification:
└── Style -[:PART_OF]β†’ Genre (hierarchy)

🐘 PostgreSQL Schema

Optimized for fast queries and full-text search:

-- Artists table with JSONB for flexible schema
CREATE TABLE artists (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_artists_name ON artists ((data->>'name'));
CREATE INDEX idx_artists_gin ON artists USING GIN (data);

-- Labels table
CREATE TABLE labels (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_labels_name ON labels ((data->>'name'));

-- Masters table
CREATE TABLE masters (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_masters_title ON masters ((data->>'title'));
CREATE INDEX idx_masters_year ON masters ((data->>'year'));

-- Releases table with extensive indexing
CREATE TABLE releases (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_releases_title ON releases ((data->>'title'));
CREATE INDEX idx_releases_artist ON releases ((data->>'artist'));
CREATE INDEX idx_releases_year ON releases ((data->>'year'));
CREATE INDEX idx_releases_gin ON releases USING GIN (data);

⚑ Performance & Optimization

πŸ“Š Processing Speed

Typical processing rates on modern hardware:

Service Records/Second Bottleneck
πŸ“₯ Python Extractor 5,000-10,000 XML parsing, I/O
⚑ Rust Extractor 20,000-400,000+ Network I/O (Rust-based)
πŸ”— Graphinator 1,000-2,000 Neo4j transactions
🐘 Tableinator 3,000-5,000 PostgreSQL inserts

πŸ’» Hardware Requirements

Minimum Specifications

  • CPU: 4 cores
  • RAM: 8GB
  • Storage: 200GB HDD
  • Network: 10 Mbps

Recommended Specifications

  • CPU: 8+ cores
  • RAM: 16GB+
  • Storage: 200GB+ SSD (NVMe preferred)
  • Network: 100 Mbps+

πŸš€ Optimization Guide

Database Tuning

Neo4j Configuration:

# neo4j.conf
dbms.memory.heap.initial_size=4g
dbms.memory.heap.max_size=4g
dbms.memory.pagecache.size=2g

PostgreSQL Configuration:

-- postgresql.conf
shared_buffers = 4GB
work_mem = 256MB
maintenance_work_mem = 1GB
effective_cache_size = 12GB

Message Queue Optimization

# RabbitMQ prefetch for consumers
PREFETCH_COUNT: 100  # Adjust based on processing speed

Storage Performance

  • Use SSD/NVMe for /discogs-data directory
  • Enable compression for PostgreSQL tables
  • Configure Neo4j for SSD optimization
  • Use separate disks for databases if possible

πŸ”§ Troubleshooting

❌ Common Issues & Solutions

Python/Rust Extractor Download Failures

# Check connectivity
curl -I https://discogs-data-dumps.s3.us-west-2.amazonaws.com

# Verify disk space
df -h /discogs-data

# Check permissions
ls -la /discogs-data

Solutions:

  • βœ… Ensure internet connectivity
  • βœ… Verify 100GB+ free space
  • βœ… Check directory permissions

RabbitMQ Connection Issues

# Check RabbitMQ status
docker-compose ps rabbitmq
docker-compose logs rabbitmq

# Test connection
curl -u discogsography:discogsography http://localhost:15672/api/overview

Solutions:

  • βœ… Wait for RabbitMQ startup (30-60s)
  • βœ… Check firewall settings
  • βœ… Verify credentials in .env

Database Connection Errors

Neo4j:

# Check Neo4j status
docker-compose logs neo4j
curl http://localhost:7474

# Test bolt connection
echo "MATCH (n) RETURN count(n);" | cypher-shell -u neo4j -p discogsography

PostgreSQL:

# Check PostgreSQL status
docker-compose logs postgres

# Test connection
PGPASSWORD=discogsography psql -h localhost -U discogsography -d discogsography -c "SELECT 1;"

πŸ› Debugging Guide

  1. πŸ“‹ Check Service Health

    curl http://localhost:8000/health  # Python/Rust Extractor
    curl http://localhost:8001/health  # Graphinator
    curl http://localhost:8002/health  # Tableinator
    curl http://localhost:8003/health  # Dashboard
    curl http://localhost:8004/health  # Discovery
  2. πŸ“Š Monitor Real-time Logs

    # All services
    uv run task logs
    
    # Specific service
    docker-compose logs -f extractor-python  # For Python Extractor
    docker-compose logs -f extractor-rust    # For Rust Extractor
  3. πŸ” Analyze Errors

    # Check for errors across all services
    uv run task check-errors
    
    # Monitor queue health
    uv run task monitor
  4. πŸ—„οΈ Verify Data Storage

    -- Neo4j: Check node counts
    MATCH (n) RETURN labels(n)[0] as type, count(n) as count;
    -- PostgreSQL: Check table counts
    SELECT 'artists' as table_name, COUNT(*) FROM artists
    UNION ALL
    SELECT 'releases', COUNT(*) FROM releases
    UNION ALL
    SELECT 'labels', COUNT(*) FROM labels
    UNION ALL
    SELECT 'masters', COUNT(*) FROM masters;

🀝 Contributing

We welcome contributions! Please follow these guidelines:

πŸ“‹ Contribution Process

  1. Fork & Clone

    git clone https://github.com/YOUR_USERNAME/discogsography.git
    cd discogsography
  2. Setup Development Environment

    uv sync --all-extras
    uv run task init  # Install pre-commit hooks
  3. Create Feature Branch

    git checkout -b feature/amazing-feature
  4. Make Changes

    • Write clean, documented code
    • Add comprehensive tests
    • Update relevant documentation
  5. Validate Changes

    just lint      # Fix any linting issues
    just test      # Ensure tests pass
    just security  # Check for vulnerabilities
  6. Commit with Conventional Commits

    git commit -m "feat: add amazing feature"
    # Types: feat, fix, docs, style, refactor, test, chore
  7. Push & Create PR

    git push origin feature/amazing-feature

πŸ“ Development Standards

  • Code Style: Follow ruff and black formatting
  • Type Hints: Required for all functions
  • Tests: Maintain >80% coverage
  • Docs: Update README and docstrings
  • Logging: Use emoji conventions (see above)
  • Security: Pass bandit checks

πŸ”§ Maintenance

Package Upgrades

Keep dependencies up-to-date with the provided upgrade script:

# Safely upgrade all dependencies (minor/patch versions)
./scripts/upgrade-packages.sh

# Preview what would be upgraded
./scripts/upgrade-packages.sh --dry-run

# Include major version upgrades
./scripts/upgrade-packages.sh --major

The script includes:

  • πŸ”’ Automatic backups before upgrades
  • βœ… Git safety checks (requires clean working directory)
  • πŸ§ͺ Automatic testing after upgrades
  • πŸ“¦ Comprehensive dependency management across all services

See scripts/README.md for more maintenance scripts.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • 🎡 Discogs for providing the monthly data dumps
  • 🐍 The Python community for excellent libraries and tools
  • 🌟 All contributors who help improve this project
  • πŸš€ uv for blazing-fast package management
  • πŸ”₯ Ruff for lightning-fast linting

πŸ’¬ Support & Community

Get Help

Documentation

Project Status

This project is actively maintained. We welcome contributions, bug reports, and feature requests!


Made with ❀️ by the Discogsography community

Sponsor this project

Contributors 4

  •  
  •  
  •  
  •