Skip to content

kaledz/gu_census_crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Common Crawl Census Bureau Data Usage Analysis

A comprehensive Databricks-based pipeline for ingesting, processing, and analyzing Common Crawl web data to identify how U.S. Census Bureau data is being used across the internet. This project distinguishes between content that properly cites Census Bureau sources versus content that repackages Census data without proper attribution.

🎯 Project Purpose

This project addresses a critical data governance question: How is U.S. Census Bureau data being used across the web, and is it being properly attributed?

The pipeline:

  • Ingests Common Crawl metadata and web content from AWS S3
  • Processes web text to identify Census Bureau-related content
  • Classifies content as either properly citing Census sources or repackaging data without attribution
  • Analyzes usage patterns through n-gram analysis and domain investigation

πŸ“ Project Structure

common_crawl/
β”œβ”€β”€ cc_segement_ingestion/     # Common Crawl metadata ingestion pipeline
β”‚   β”œβ”€β”€ src/                    # Ingestion notebooks and helper functions
β”‚   β”œβ”€β”€ databricks.yml          # Databricks Asset Bundle configuration
β”‚   └── pyproject.toml          # Python dependencies
β”‚
β”œβ”€β”€ nlp_pipeline/               # NLP processing pipeline for WET files
β”‚   β”œβ”€β”€ src/                    # Pipeline notebooks and logic
β”‚   β”œβ”€β”€ resources/              # Job configurations
β”‚   β”œβ”€β”€ databricks.yml          # Databricks Asset Bundle configuration
β”‚   └── pyproject.toml          # Python dependencies
β”‚
β”œβ”€β”€ analysis/                   # Classification and analysis notebooks
β”‚   β”œβ”€β”€ classification_model.py # ML models for cite vs. repackage classification
β”‚   β”œβ”€β”€ dictionary.py           # Census Bureau terminology dictionaries
β”‚   β”œβ”€β”€ ngram_analysis.py       # N-gram pattern analysis
β”‚   └── sites_of_interest.py    # Domain-specific investigations
β”‚
└── power_bi/                   # Power BI report
    └── census_repackaged.pbix  # Main dashboard for Census repackage data analysis

πŸ—οΈ Architecture Overview

The project follows a medallion architecture (Bronze β†’ Silver β†’ Gold) in Databricks Unity Catalog:

Data Flow

  1. Bronze Layer (Raw Data)

    • raw_master_crawls: Common Crawl master index metadata
    • raw_all_crawls: All crawl file metadata (S3 object info)
    • sample_size_raw: Sampled WET crawl metadata
  2. Silver Layer (Cleaned Data)

    • cleaned_master_crawls_2025: Filtered 2025 master indexes
    • census_product_cleaned: Filtered and processed Census-related web content
  3. Gold Layer (Enriched Analytics)

    • census_repackaged_enriched: Classified content (cites vs. repackages)
    • unigrams_repackaged: N-gram analysis results
    • nlp_pipeline_summary: Pipeline execution summaries

πŸš€ Getting Started

Prerequisites

  • Databricks Workspace: Access to https://dbc-a1e27fcd-fa88.cloud.databricks.com
  • Unity Catalog: Access to census_bureau_capstone catalog (or create your own)
  • AWS Access: S3 read access to commoncrawl bucket and destination bucket for WET files
  • Databricks CLI: For deployment (optional - can use UI instead)
  • Python 3.10-3.13: With uv package manager

Databricks Setup

This project requires several Databricks components to be configured. For detailed setup instructions, refer to the official Databricks documentation:

Required Setup Steps

  1. Configure Secrets: Create a secret scope named aws_cc with AWS credentials

  2. Set Up Unity Catalog: Create catalog and schemas (bronze, silver, gold)

  3. Configure S3 Access: Set up cluster IAM role or access keys for S3

  4. Install Databricks Extension (for IDE development)

πŸ“¦ Deployment

This project uses Databricks Asset Bundles for deployment. Each component (cc_segement_ingestion and nlp_pipeline) can be deployed independently.

Deployment Methods

Option 1: Using Databricks UI (Recommended)

  1. Open the Deployments panel (πŸš€ rocket icon) in Databricks
  2. Select your target environment (dev or prod)
  3. Click Deploy

Option 2: Using Databricks CLI

# Deploy to development (default)
cd cc_segement_ingestion && databricks bundle deploy
cd nlp_pipeline && databricks bundle deploy

# Deploy to production
cd cc_segement_ingestion && databricks bundle deploy -t prod
cd nlp_pipeline && databricks bundle deploy -t prod

Deployment Targets

  • Development (dev): Resources prefixed with [dev username], schedules paused
  • Production (prod): Full resource names, schedules active, production permissions

For detailed deployment instructions, see:

πŸ”„ Usage

Pipeline Components

1. Common Crawl Ingestion (cc_segement_ingestion/)

  • Ingests Common Crawl metadata from AWS S3
  • Notebooks: master_index.py, batch_2025_crawls.py, batch_last_5_years_crawls.py, incremental_ingestion_crawls.py
  • See cc_segement_ingestion README for detailed usage

2. NLP Pipeline (nlp_pipeline/)

  • Processes WET files to extract and filter Census-related content
  • Runs automatically every 3 hours in production
  • See nlp_pipeline README for detailed usage

3. Analysis (analysis/)

  • Classification models, n-gram analysis, and domain investigations
  • See analysis README for detailed usage

4. Power Bi (power_bi/)

Running Jobs

After deployment, jobs can be run via:

  • Databricks UI: Workflows β†’ Jobs β†’ Run
  • Deployments Panel: Click job name β†’ Run
  • Scheduled: Jobs run automatically based on configured schedules

For detailed job management, see:

πŸ”§ Configuration

Bundle Configuration

Both pipelines are configured via databricks.yml files with two targets:

  • Development (dev): Development mode with prefixed resources and paused schedules
  • Production (prod): Production mode with active schedules and full permissions

Workspace: https://dbc-a1e27fcd-fa88.cloud.databricks.com workspace used but requires new one to utlize

For configuration details, see:

πŸ“Š Data Sources

Common Crawl

  • Bucket: s3://commoncrawl/
  • Region: us-east-1
  • Data Types:
    • WARC: Web ARChive format (raw web content)
    • WAT: Web Archive Transformation (metadata)
    • WET: Web Extract Text (extracted text content)

This project primarily uses:

  • Master indexes for crawl discovery
  • WET files for text extraction and analysis

Common Crawl Documentation

Local Development

For local development, use Databricks Connect:

πŸ“š Key Technologies

  • Databricks: Cloud data platform and compute
  • PySpark: Distributed data processing
  • Unity Catalog: Data governance and cataloging
  • Databricks Asset Bundles: Infrastructure as code for Databricks
  • boto3: AWS S3 integration
  • FastText: Language detection
  • Spark MLlib: Machine learning models

πŸ” Monitoring and Troubleshooting

Job Monitoring

Monitor jobs via:

  • Databricks UI: Workflows β†’ Jobs β†’ View runs
  • Deployments Panel: Click job name β†’ View run history
  • Email Notifications: Configured for job failures

Troubleshooting

For common issues and solutions, refer to:

πŸ“– Additional Documentation

Databricks Documentation

External Resources

πŸ‘₯ Contributing

When adding new features:

  1. Ingestion Pipeline: Add helper functions to cc_segement_ingestion/src/cc_segement_ingestion/helpers.py
  2. NLP Pipeline: Add logic to nlp_pipeline/src/pipeline_logic/
  3. Analysis: Add notebooks to analysis/ directory
  4. Update relevant README files with new functionality

πŸ†˜ Support

For issues or questions:

  • Check individual component READMEs for specific guidance
  • Review Databricks job logs for error details
  • Consult Databricks and Common Crawl documentation

Last Updated: 12/4/2025 Maintainer: kl1147@georgetown.edu

About

Census Bureau Georgetown Common Crawl Project Repository. Contains code related to Data/ML Pipeline and models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors