Skip to content

ahMADASSadi/DBMOC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Database Media Optimization and Compression (DBMOC)

This project connects to a PostgreSQL database, reads media file URLs from defined schemas, downloads images from S3-compatible object storage, compresses and converts them to WebP format, then reuploads them and updates the database records with the new URLs.

Project Structure

src/
├── aws/              # AWS S3 operations (download, upload, compression)
│   ├── __init__.py   # Main AWS module with S3 client and image processing
│   └── aws.py        # Legacy code (can be removed)
├── config/           # Configuration directory
│   └── .env          # Environment variables (create this file)
├── db/               # Database connection and session management
│   ├── __init__.py
│   └── db.py         # Database initialization and session factory
├── schemas/          # SQLAlchemy ORM models
│   ├── __init__.py   # Base model and data fetching utilities
│   ├── events.py     # Event model (cover, poster fields)
│   ├── posts.py      # Post model (media_file field)
│   └── profiles.py   # AccountProfile model (avatar, banner fields)
├── main.py           # Main script that orchestrates the processing
└── pyproject.toml    # Project dependencies

Features

  • Database Integration: Connects to PostgreSQL database using SQLAlchemy
  • Schema Support: Processes media files from multiple schemas:
    • AccountProfile: avatar and banner images
    • Post: media_file images (filtered by media_type='image')
    • Event: cover and poster images
  • S3 Integration: Downloads and uploads files from S3-compatible object storage
  • Image Compression: Converts images to WebP format with configurable quality
  • Automatic Updates: Updates database records with new WebP file URLs
  • Error Handling: Robust error handling with rollback on failures
  • Statistics: Provides processing statistics for each schema

Installation

  1. Install dependencies using uv (or your preferred package manager):

    cd src
    uv sync
  2. Create a .env file in the config/ directory with the following variables:

Environment Variables

Create a config/.env file with the following required variables:

# Database Configuration
DATABASE_URL=postgresql://user:password@localhost:5432/dbname

# AWS S3 Configuration
AWS_STORAGE_BUCKET_NAME=your-bucket-name
AWS_ACCESS_KEY_ID=your-access-key-id
AWS_SECRET_ACCESS_KEY=your-secret-access-key
AWS_S3_ENDPOINT_URL=https://s3.amazonaws.com
AWS_SERVICE_NAME=s3

# Image Processing Configuration
WEBP_QUALITY=80

Environment Variable Descriptions

  • DATABASE_URL: PostgreSQL connection string
  • AWS_STORAGE_BUCKET_NAME: Name of your S3 bucket
  • AWS_ACCESS_KEY_ID: S3 access key ID
  • AWS_SECRET_ACCESS_KEY: S3 secret access key
  • AWS_S3_ENDPOINT_URL: S3 endpoint URL (default: https://s3.amazonaws.com)
    • For AWS S3: https://s3.amazonaws.com or https://s3.region.amazonaws.com
    • For DigitalOcean Spaces: https://region.digitaloceanspaces.com
    • For MinIO: http://localhost:9000
  • AWS_SERVICE_NAME: S3 service name (default: s3)
  • WEBP_QUALITY: WebP compression quality (0-100, default: 80)
    • Lower values = smaller files but lower quality
    • Higher values = better quality but larger files

Usage

Run the main script:

cd src
python main.py

The script will:

  1. Connect to the database
  2. Fetch records from all schemas (AccountProfile, Post, Event)
  3. For each media URL:
    • Download the image from S3
    • Compress and convert to WebP format
    • Upload the compressed image to S3
    • Update the database record with the new URL
  4. Print processing statistics

How It Works

1. Database Connection

  • The db/db.py module initializes a SQLAlchemy engine and session factory
  • Uses connection pooling for efficient database access
  • Loads database URL from config/.env

2. Schema Data Fetching

  • The schemas/__init__.py module provides a fetch_db_data() function
  • Fetches records with specific fields and optional filter conditions
  • Supports loading only required fields for efficiency

3. Image Processing Pipeline

  • URL Extraction: Extracts S3 key from full URL (supports multiple URL formats)
  • Download: Downloads image from S3 as bytes
  • Compression: Converts image to WebP format with configurable quality
  • Upload: Uploads compressed image to S3
  • Database Update: Updates database record with new URL

4. S3 URL Formats Supported

  • Virtual-hosted-style: https://bucket.s3.amazonaws.com/key
  • Path-style: https://s3.amazonaws.com/bucket/key
  • Custom endpoints: https://endpoint/bucket/key or https://bucket.endpoint/key

Database Schemas

AccountProfile (accounts_profile)

  • id: Integer (primary key)
  • avatar: String(255) - Avatar image URL
  • banner: String(255) - Banner image URL

Post (post_postmedia)

  • id: Integer (primary key)
  • media_type: String(20) - Media type (filtered for 'image')
  • media_file: String(255) - Media file URL

Event (event_event)

  • id: Integer (primary key)
  • cover: String(255) - Cover image URL
  • poster: String(255) - Poster image URL

Image Processing

WebP Compression

  • Preserves transparency when present (RGBA mode)
  • Converts images to RGB for non-transparent images
  • Configurable quality setting (0-100)
  • Uses method 6 (best compression) for WebP encoding

File Replacement Strategy

  • By default, replaces original file with .webp extension
  • Original file extension is removed
  • If replace_original=False, creates a new file with .webp appended

Error Handling

  • Database errors: Rolls back transaction on failure
  • S3 errors: Logs error and continues with next file
  • Image processing errors: Logs error and skips file
  • Statistics track processed, failed, and skipped files

Statistics

The script provides statistics for each schema:

  • processed: Number of successfully processed files
  • failed: Number of files that failed to process
  • skipped: Number of files that were skipped (e.g., already WebP, no URL)

Dependencies

  • boto3: AWS S3 client
  • sqlalchemy: Database ORM
  • psycopg2: PostgreSQL adapter
  • pillow: Image processing
  • python-dotenv: Environment variable management

Notes

  • The script processes files sequentially (not in parallel) to avoid overwhelming the database and S3 service
  • Each file is committed to the database immediately after processing
  • Files that are already in WebP format are skipped
  • The script handles various image formats (JPEG, PNG, GIF, BMP, TIFF, etc.)

Troubleshooting

Database Connection Issues

  • Verify DATABASE_URL is correct
  • Check database server is running
  • Verify network connectivity

S3 Connection Issues

  • Verify AWS credentials are correct
  • Check AWS_S3_ENDPOINT_URL matches your S3 provider
  • Verify bucket name is correct
  • Check network connectivity to S3 endpoint

Image Processing Issues

  • Verify image files are valid
  • Check file permissions
  • Verify Pillow supports the image format

URL Parsing Issues

  • Check URL format matches supported formats
  • Verify bucket name in URL matches AWS_STORAGE_BUCKET_NAME
  • For custom endpoints, ensure URL structure is correct

About

A Simple python project using SQLAlchemy and Boto3, for reading the media files off of os(object storage), compressing them and restoring them to os; along with db update

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages