A comprehensive Databricks-based pipeline for ingesting, processing, and analyzing Common Crawl web data to identify how U.S. Census Bureau data is being used across the internet. This project distinguishes between content that properly cites Census Bureau sources versus content that repackages Census data without proper attribution.
This project addresses a critical data governance question: How is U.S. Census Bureau data being used across the web, and is it being properly attributed?
The pipeline:
- Ingests Common Crawl metadata and web content from AWS S3
- Processes web text to identify Census Bureau-related content
- Classifies content as either properly citing Census sources or repackaging data without attribution
- Analyzes usage patterns through n-gram analysis and domain investigation
common_crawl/
βββ cc_segement_ingestion/ # Common Crawl metadata ingestion pipeline
β βββ src/ # Ingestion notebooks and helper functions
β βββ databricks.yml # Databricks Asset Bundle configuration
β βββ pyproject.toml # Python dependencies
β
βββ nlp_pipeline/ # NLP processing pipeline for WET files
β βββ src/ # Pipeline notebooks and logic
β βββ resources/ # Job configurations
β βββ databricks.yml # Databricks Asset Bundle configuration
β βββ pyproject.toml # Python dependencies
β
βββ analysis/ # Classification and analysis notebooks
β βββ classification_model.py # ML models for cite vs. repackage classification
β βββ dictionary.py # Census Bureau terminology dictionaries
β βββ ngram_analysis.py # N-gram pattern analysis
β βββ sites_of_interest.py # Domain-specific investigations
β
βββ power_bi/ # Power BI report
βββ census_repackaged.pbix # Main dashboard for Census repackage data analysis
The project follows a medallion architecture (Bronze β Silver β Gold) in Databricks Unity Catalog:
-
Bronze Layer (Raw Data)
raw_master_crawls: Common Crawl master index metadataraw_all_crawls: All crawl file metadata (S3 object info)sample_size_raw: Sampled WET crawl metadata
-
Silver Layer (Cleaned Data)
cleaned_master_crawls_2025: Filtered 2025 master indexescensus_product_cleaned: Filtered and processed Census-related web content
-
Gold Layer (Enriched Analytics)
census_repackaged_enriched: Classified content (cites vs. repackages)unigrams_repackaged: N-gram analysis resultsnlp_pipeline_summary: Pipeline execution summaries
- Databricks Workspace: Access to
https://dbc-a1e27fcd-fa88.cloud.databricks.com - Unity Catalog: Access to
census_bureau_capstonecatalog (or create your own) - AWS Access: S3 read access to
commoncrawlbucket and destination bucket for WET files - Databricks CLI: For deployment (optional - can use UI instead)
- Python 3.10-3.13: With
uvpackage manager
This project requires several Databricks components to be configured. For detailed setup instructions, refer to the official Databricks documentation:
-
Configure Secrets: Create a secret scope named
aws_ccwith AWS credentials- Databricks Secrets Documentation
- Required keys:
aws_access_key_id,aws_secret_access_key
-
Set Up Unity Catalog: Create catalog and schemas (bronze, silver, gold)
- Unity Catalog Setup Guide
- Target catalog:
census_bureau_capstone
-
Configure S3 Access: Set up cluster IAM role or access keys for S3
- AWS S3 Access with Databricks
- Source:
s3://commoncrawl/(public, read-only) - Destination: Your S3 bucket for WET file storage
-
Install Databricks Extension (for IDE development)
- Databricks Extension for VS Code
- Or use the Databricks UI for deployment
This project uses Databricks Asset Bundles for deployment. Each component (cc_segement_ingestion and nlp_pipeline) can be deployed independently.
Option 1: Using Databricks UI (Recommended)
- Open the Deployments panel (π rocket icon) in Databricks
- Select your target environment (dev or prod)
- Click Deploy
Option 2: Using Databricks CLI
# Deploy to development (default)
cd cc_segement_ingestion && databricks bundle deploy
cd nlp_pipeline && databricks bundle deploy
# Deploy to production
cd cc_segement_ingestion && databricks bundle deploy -t prod
cd nlp_pipeline && databricks bundle deploy -t prod- Development (
dev): Resources prefixed with[dev username], schedules paused - Production (
prod): Full resource names, schedules active, production permissions
For detailed deployment instructions, see:
1. Common Crawl Ingestion (cc_segement_ingestion/)
- Ingests Common Crawl metadata from AWS S3
- Notebooks:
master_index.py,batch_2025_crawls.py,batch_last_5_years_crawls.py,incremental_ingestion_crawls.py - See cc_segement_ingestion README for detailed usage
2. NLP Pipeline (nlp_pipeline/)
- Processes WET files to extract and filter Census-related content
- Runs automatically every 3 hours in production
- See nlp_pipeline README for detailed usage
3. Analysis (analysis/)
- Classification models, n-gram analysis, and domain investigations
- See analysis README for detailed usage
4. Power Bi (power_bi/)
- Power BI report to summarize insights
- See power_bi README
After deployment, jobs can be run via:
- Databricks UI: Workflows β Jobs β Run
- Deployments Panel: Click job name β Run
- Scheduled: Jobs run automatically based on configured schedules
For detailed job management, see:
Both pipelines are configured via databricks.yml files with two targets:
- Development (
dev): Development mode with prefixed resources and paused schedules - Production (
prod): Production mode with active schedules and full permissions
Workspace: https://dbc-a1e27fcd-fa88.cloud.databricks.com workspace used but requires new one to utlize
For configuration details, see:
- Bucket:
s3://commoncrawl/ - Region:
us-east-1 - Data Types:
- WARC: Web ARChive format (raw web content)
- WAT: Web Archive Transformation (metadata)
- WET: Web Extract Text (extracted text content)
This project primarily uses:
- Master indexes for crawl discovery
- WET files for text extraction and analysis
For local development, use Databricks Connect:
- Databricks Connect Documentation
- Recommended version:
databricks-connect>=15.4,<15.5(compatible with cluster runtime)
- Databricks: Cloud data platform and compute
- PySpark: Distributed data processing
- Unity Catalog: Data governance and cataloging
- Databricks Asset Bundles: Infrastructure as code for Databricks
- boto3: AWS S3 integration
- FastText: Language detection
- Spark MLlib: Machine learning models
Monitor jobs via:
- Databricks UI: Workflows β Jobs β View runs
- Deployments Panel: Click job name β View run history
- Email Notifications: Configured for job failures
For common issues and solutions, refer to:
- cc_segement_ingestion README - Detailed ingestion pipeline docs
- nlp_pipeline README - Detailed NLP pipeline docs
- analysis README - Analysis notebook documentation
- Databricks Asset Bundles
- Asset Bundles in Workspace
- Unity Catalog
- Secrets Management
- Jobs and Workflows
- AWS S3 Integration
When adding new features:
- Ingestion Pipeline: Add helper functions to
cc_segement_ingestion/src/cc_segement_ingestion/helpers.py - NLP Pipeline: Add logic to
nlp_pipeline/src/pipeline_logic/ - Analysis: Add notebooks to
analysis/directory - Update relevant README files with new functionality
For issues or questions:
- Check individual component READMEs for specific guidance
- Review Databricks job logs for error details
- Consult Databricks and Common Crawl documentation
Last Updated: 12/4/2025 Maintainer: kl1147@georgetown.edu