Common Crawl Census Bureau Data Usage Analysis

A comprehensive Databricks-based pipeline for ingesting, processing, and analyzing Common Crawl web data to identify how U.S. Census Bureau data is being used across the internet. This project distinguishes between content that properly cites Census Bureau sources versus content that repackages Census data without proper attribution.

🎯 Project Purpose

This project addresses a critical data governance question: How is U.S. Census Bureau data being used across the web, and is it being properly attributed?

The pipeline:

Ingests Common Crawl metadata and web content from AWS S3
Processes web text to identify Census Bureau-related content
Classifies content as either properly citing Census sources or repackaging data without attribution
Analyzes usage patterns through n-gram analysis and domain investigation

📁 Project Structure

common_crawl/
├── cc_segement_ingestion/     # Common Crawl metadata ingestion pipeline
│   ├── src/                    # Ingestion notebooks and helper functions
│   ├── databricks.yml          # Databricks Asset Bundle configuration
│   └── pyproject.toml          # Python dependencies
│
├── nlp_pipeline/               # NLP processing pipeline for WET files
│   ├── src/                    # Pipeline notebooks and logic
│   ├── resources/              # Job configurations
│   ├── databricks.yml          # Databricks Asset Bundle configuration
│   └── pyproject.toml          # Python dependencies
│
├── analysis/                   # Classification and analysis notebooks
│   ├── classification_model.py # ML models for cite vs. repackage classification
│   ├── dictionary.py           # Census Bureau terminology dictionaries
│   ├── ngram_analysis.py       # N-gram pattern analysis
│   └── sites_of_interest.py    # Domain-specific investigations
│
└── power_bi/                   # Power BI report
    └── census_repackaged.pbix  # Main dashboard for Census repackage data analysis

🏗️ Architecture Overview

The project follows a medallion architecture (Bronze → Silver → Gold) in Databricks Unity Catalog:

Data Flow

Bronze Layer (Raw Data)
- raw_master_crawls: Common Crawl master index metadata
- raw_all_crawls: All crawl file metadata (S3 object info)
- sample_size_raw: Sampled WET crawl metadata
Silver Layer (Cleaned Data)
- cleaned_master_crawls_2025: Filtered 2025 master indexes
- census_product_cleaned: Filtered and processed Census-related web content
Gold Layer (Enriched Analytics)
- census_repackaged_enriched: Classified content (cites vs. repackages)
- unigrams_repackaged: N-gram analysis results
- nlp_pipeline_summary: Pipeline execution summaries

🚀 Getting Started

Prerequisites

Databricks Workspace: Access to https://dbc-a1e27fcd-fa88.cloud.databricks.com
Unity Catalog: Access to census_bureau_capstone catalog (or create your own)
AWS Access: S3 read access to commoncrawl bucket and destination bucket for WET files
Databricks CLI: For deployment (optional - can use UI instead)
Python 3.10-3.13: With uv package manager

Databricks Setup

This project requires several Databricks components to be configured. For detailed setup instructions, refer to the official Databricks documentation:

Required Setup Steps

Configure Secrets: Create a secret scope named aws_cc with AWS credentials
- Databricks Secrets Documentation
- Required keys: aws_access_key_id, aws_secret_access_key
Set Up Unity Catalog: Create catalog and schemas (bronze, silver, gold)
- Unity Catalog Setup Guide
- Target catalog: census_bureau_capstone
Configure S3 Access: Set up cluster IAM role or access keys for S3
- AWS S3 Access with Databricks
- Source: s3://commoncrawl/ (public, read-only)
- Destination: Your S3 bucket for WET file storage
Install Databricks Extension (for IDE development)
- Databricks Extension for VS Code
- Or use the Databricks UI for deployment

📦 Deployment

This project uses Databricks Asset Bundles for deployment. Each component (cc_segement_ingestion and nlp_pipeline) can be deployed independently.

Deployment Methods

Option 1: Using Databricks UI (Recommended)

Open the Deployments panel (🚀 rocket icon) in Databricks
Select your target environment (dev or prod)
Click Deploy

Option 2: Using Databricks CLI

# Deploy to development (default)
cd cc_segement_ingestion && databricks bundle deploy
cd nlp_pipeline && databricks bundle deploy

# Deploy to production
cd cc_segement_ingestion && databricks bundle deploy -t prod
cd nlp_pipeline && databricks bundle deploy -t prod

Deployment Targets

Development (dev): Resources prefixed with [dev username], schedules paused
Production (prod): Full resource names, schedules active, production permissions

For detailed deployment instructions, see:

🔄 Usage

Pipeline Components

1. Common Crawl Ingestion (cc_segement_ingestion/)

Ingests Common Crawl metadata from AWS S3
Notebooks: master_index.py, batch_2025_crawls.py, batch_last_5_years_crawls.py, incremental_ingestion_crawls.py
See cc_segement_ingestion README for detailed usage

2. NLP Pipeline (nlp_pipeline/)

Processes WET files to extract and filter Census-related content
Runs automatically every 3 hours in production
See nlp_pipeline README for detailed usage

3. Analysis (analysis/)

Classification models, n-gram analysis, and domain investigations
See analysis README for detailed usage

4. Power Bi (power_bi/)

Power BI report to summarize insights
See power_bi README

Running Jobs

After deployment, jobs can be run via:

Databricks UI: Workflows → Jobs → Run
Deployments Panel: Click job name → Run
Scheduled: Jobs run automatically based on configured schedules

For detailed job management, see:

🔧 Configuration

Bundle Configuration

Both pipelines are configured via databricks.yml files with two targets:

Development (dev): Development mode with prefixed resources and paused schedules
Production (prod): Production mode with active schedules and full permissions

Workspace: https://dbc-a1e27fcd-fa88.cloud.databricks.com workspace used but requires new one to utlize

For configuration details, see:

📊 Data Sources

Common Crawl

Bucket: s3://commoncrawl/
Region: us-east-1
Data Types:
- WARC: Web ARChive format (raw web content)
- WAT: Web Archive Transformation (metadata)
- WET: Web Extract Text (extracted text content)

This project primarily uses:

Master indexes for crawl discovery
WET files for text extraction and analysis

Common Crawl Documentation

Local Development

For local development, use Databricks Connect:

Databricks Connect Documentation
Recommended version: databricks-connect>=15.4,<15.5 (compatible with cluster runtime)

📚 Key Technologies

Databricks: Cloud data platform and compute
PySpark: Distributed data processing
Unity Catalog: Data governance and cataloging
Databricks Asset Bundles: Infrastructure as code for Databricks
boto3: AWS S3 integration
FastText: Language detection
Spark MLlib: Machine learning models

🔍 Monitoring and Troubleshooting

Job Monitoring

Monitor jobs via:

Databricks UI: Workflows → Jobs → View runs
Deployments Panel: Click job name → View run history
Email Notifications: Configured for job failures

Troubleshooting

For common issues and solutions, refer to:

📖 Additional Documentation

cc_segement_ingestion README - Detailed ingestion pipeline docs
nlp_pipeline README - Detailed NLP pipeline docs
analysis README - Analysis notebook documentation

Databricks Documentation

External Resources

Common Crawl Documentation

👥 Contributing

When adding new features:

Ingestion Pipeline: Add helper functions to cc_segement_ingestion/src/cc_segement_ingestion/helpers.py
NLP Pipeline: Add logic to nlp_pipeline/src/pipeline_logic/
Analysis: Add notebooks to analysis/ directory
Update relevant README files with new functionality

🆘 Support

For issues or questions:

Check individual component READMEs for specific guidance
Review Databricks job logs for error details
Consult Databricks and Common Crawl documentation

Last Updated: 12/4/2025 Maintainer: kl1147@georgetown.edu

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
analysis		analysis
cc_segement_ingestion		cc_segement_ingestion
nlp_pipeline		nlp_pipeline
power_bi		power_bi
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Common Crawl Census Bureau Data Usage Analysis

🎯 Project Purpose

📁 Project Structure

🏗️ Architecture Overview

Data Flow

🚀 Getting Started

Prerequisites

Databricks Setup

Required Setup Steps

📦 Deployment

Deployment Methods

Deployment Targets

🔄 Usage

Pipeline Components

Running Jobs

🔧 Configuration

Bundle Configuration

📊 Data Sources

Common Crawl

Common Crawl Documentation

Local Development

📚 Key Technologies

🔍 Monitoring and Troubleshooting

Job Monitoring

Troubleshooting

📖 Additional Documentation

Databricks Documentation

External Resources

👥 Contributing

🆘 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages