Skip to content

allgpt-co/googlemapscontacts

Repository files navigation

US Healthcare Facility Discovery Platform

A complete platform to discover, verify, and export every healthcare facility in the United States — hospitals, clinics, dialysis centers, nursing homes, hospices, and more.

One-click pipeline from the browser. No terminal commands needed.


Quick Start

# 1. Install dependencies
pip install requests beautifulsoup4 pandas openpyxl lxml flask

# 2. Set your Serper API key (get free at serper.dev — 2,500 credits)
export SERPER_API_KEY='your_key_here'

# 3. Start the platform
python3 run.py

# 4. Open browser
http://localhost:5000

That's it. Everything runs from the browser.


What It Does

Searches Google Maps + imports government databases to build a complete list of US healthcare facilities with their official websites.

Input: Select a state, click "Start Pipeline" Output: CSV/Excel/JSON with facility name, address, phone, website, type, confidence score


How It Works — 8-Step Pipeline

User selects state (e.g., Wyoming) → Clicks "Start Pipeline"
                    │
    ┌───────────────┼───────────────────────────────────────┐
    │                                                       │
    ▼                                                       │
Step 1: CMS Hospital Import (FREE)                         │
  • Downloads 5,426 Medicare-certified hospitals            │
  • Source: data.cms.gov                                    │
  • Data: name, address, phone, CMS ID                     │
  • NO website, NO email                                   │
  • ~35 seconds                                            │
    │                                                       │
    ▼                                                       │
Step 2: Government Datasets (FREE)                          │
  • Dialysis centers (~7,600)                              │
  • Hospice providers (~5,500)                             │
  • Home health agencies (~11,000)                         │
  • Nursing homes (~15,000)                                │
  • Filtered to selected state                             │
  • ~3-8 minutes                                           │
    │                                                       │
    ▼                                                       │
Step 3: NPPES/NPI Import (FREE)                             │
  • Every billing healthcare provider has an NPI            │
  • Catches small practices CMS misses                     │
  • Source: npiregistry.cms.hhs.gov                        │
  • ~1-5 minutes per state                                 │
    │                                                       │
    ▼                                                       │
Step 4: Google Discovery (USES API CREDITS)                 │
  • Searches Google Maps + Web for each city × category    │
  • "hospital in Cheyenne WY", "dental in Casper WY"...   │
  • 5 cities × 26 categories = 130 searches per state      │
  • These results COME WITH websites + coordinates         │
  • ~5-8 minutes per state                                 │
    │                                                       │
    ▼                                                       │
Step 5: Normalize (FREE, ~15 seconds)                       │
  • Cleans names: "ST. JOSEPHS MED CTR LLC"                │
    → "Saint Josephs Medical Center"                        │
  • Formats phones: "2145551234" → "(214) 555-1234"        │
  • Standardizes addresses, URLs, states, ZIPs             │
    │                                                       │
    ▼                                                       │
Step 6: Find Websites (USES API CREDITS)                    │
  • For facilities WITHOUT a website (from Steps 1-3)      │
  • Searches Google 3 times per facility                   │
  • Scores candidates, picks best official website         │
  • Blacklists Yelp, Facebook, Healthgrades, etc.          │
  • ~3 credits per facility                                │
    │                                                       │
    ▼                                                       │
Step 7: Score (FREE, ~5 seconds)                            │
  • Assigns 0.0-1.0 confidence to each facility            │
  • Based on: has name, address, phone, website, NPI, etc. │
    │                                                       │
    ▼                                                       │
Step 8: Export (FREE, ~2 seconds)                           │
  • CSV (open in Excel)                                    │
  • XLSX (formatted Excel)                                 │
  • JSON (for developers)                                  │
    │                                                       │
    ▼                                                       │
  DONE — Dashboard shows results                           │
    └───────────────────────────────────────────────────────┘

Data Sources

Source Type Records Has Website? Cost
CMS Hospitals Federal database 5,426 No FREE
CMS Dialysis Federal database ~7,600 No FREE
CMS Hospice Federal database ~5,500 No FREE
CMS Home Health Federal database ~11,000 No FREE
CMS Nursing Homes Federal database ~15,000 No FREE
NPPES/NPI Registry Federal database Varies No FREE
Google Maps (via Serper) Search API ~20 per search Yes 1 credit/search
Google Web (via Serper) Search API ~10 per search Yes 1 credit/search
Website Resolution Search API 3 searches/facility Finds website 3 credits/facility

Free tier: 2,500 Serper credits (enough for ~5-6 full state runs)


Web UI Pages

Page URL What it shows
Dashboard / Pipeline controls, metrics, charts
Facilities /facilities Searchable list with filters (state, type, website)
Detail /facilities/123 Single facility — all fields, sources, scores
Map /map All facilities on a map (green=website, red=none)
Export /export Download CSV/Excel/JSON with state filter

Pipeline Controls

Full Pipeline

Runs all 8 steps. Select a state, set website limit, click Start.

Find Websites Only

Skips import/discovery. Just searches Google for official websites of facilities already in the database. Use this after the first full run to increase coverage.

Skip Government Data

Checkbox to skip Step 2 (supplemental datasets). Makes the pipeline ~5 minutes faster.


API Credit Usage

Operation Credits Notes
CMS import 0 Free government data
Supplemental 0 Free government data
NPPES import 0 Free government data
Serper Maps search 1 Per search query
Serper Web search 1 Per search query
Website resolution 3 Per facility (3 Google searches)

Per-state estimates:

State Size Discovery Websites (50) Total
Small (WY, VT) ~260 credits ~150 credits ~410 credits
Medium (AL, OR) ~260 credits ~150 credits ~410 credits
Large (CA, TX) ~728 credits ~150 credits ~878 credits

Free tier: 2,500 credits. Paid: $1 per 1,000 credits.


Output Fields

Each exported record contains:

Field Description Example
facility_name Normalized name Saint Josephs Medical Center
facility_type Classification Hospital, Clinic, Ambulatory
address_1 Street address 123 North Main Street Suite 200
city City Cheyenne
state 2-letter code WY
zip 5-digit ZIP 82001
phone_primary Formatted phone (307) 634-2273
website_url Official website https://cheyenneregional.org
website_domain Root domain cheyenneregional.org
entity_confidence Data quality (0-1) 0.75
website_confidence Website match (0-1) 0.92
specialties Medical specialties cardiology, orthopedic
npi_ids NPI number 1234567890
cms_ids CMS/Medicare ID 530010
status Active/closed active

Website Resolution — How It Works

For each facility without a website:

Facility: "Crenshaw Community Hospital" in Luverne, AL

  Search Google 3 times:
    1. "Crenshaw Community Hospital" Luverne AL official website
    2. Crenshaw Community Hospital Luverne AL
    3. Crenshaw Community Hospital +13343353374

  Collect ~20 candidate URLs from results

  Score each candidate (0 to 1):
    crenshawcommunityhospital.com  → 0.95  ✓ WINNER
    healthgrades.com/hospital/...  → 0.00  ✗ BLACKLISTED
    facebook.com/Crenshaw          → 0.00  ✗ BLACKLISTED
    yelp.com/biz/crenshaw          → 0.00  ✗ BLACKLISTED

  Scoring signals:
    +0.30 domain contains facility name
    +0.20 title contains facility name
    +0.10 snippet mentions city/state
    +0.10 domain is .com or .org
    +0.20 NOT a directory site
    +0.10 position #1 in results
    -0.50 IS a directory (yelp, facebook, etc.)

  Best score > 0.4 → save as official website

Blacklisted domains (never selected as official website): yelp.com, facebook.com, healthgrades.com, zocdoc.com, vitals.com, webmd.com, linkedin.com, instagram.com, twitter.com, youtube.com, yellowpages.com, bbb.org, indeed.com, google.com, wikipedia.org, npidb.org, mapquest.com


Project Structure

serper_scraper/
├── run.py                      ← START HERE
├── database.py                 ← SQLite database (11 tables)
├── config.py                   ← API key + settings
│
├── connectors/                 ← Data source connectors
│   ├── serper_connector.py     ← Google Maps/Web via Serper API
│   ├── cms_connector.py        ← CMS hospital data (free)
│   ├── nppes_connector.py      ← NPI registry (free)
│   └── supplemental.py         ← Dialysis/hospice/nursing/home health
│
├── engines/                    ← Processing engines
│   ├── geo_discovery.py        ← Orchestrates all imports + discovery
│   ├── website_resolution.py   ← Finds official websites via Google
│   ├── normalization.py        ← Cleans names/phones/addresses
│   ├── scoring.py              ← Confidence scoring (0-1)
│   ├── deduplication.py        ← Finds and merges duplicates
│   └── ...                     ← 15+ more engines
│
├── web_ui/                     ← Flask web interface
│   ├── app.py                  ← Routes + pipeline runner
│   └── templates/
│       ├── index.html          ← Dashboard + pipeline controls
│       ├── facilities.html     ← Searchable facility list
│       ├── detail.html         ← Single facility detail
│       ├── map.html            ← Map view (Leaflet.js)
│       └── export.html         ← Export page
│
├── export/
│   └── exporter.py             ← CSV/XLSX/JSON generation
│
├── exports/                    ← Generated export files go here
│
└── healthcare_providers.db     ← SQLite database (auto-created)

Database Schema

Table Purpose Key Fields
facilities Main entity table name, address, phone, website, type, confidence
organizations Health systems/chains name, type, location_count
source_records Raw data lineage source_type, raw_name, raw_payload
crawl_results Website crawl data url, status_code, extracted_json
website_candidates Website scoring candidate_url, score, reasons
discovery_jobs Search job tracking state, city, category, status
geo_cells Grid-based coverage lat/lng bounds, results_count
crawl_policies robots.txt compliance domain, rate_limit, is_excluded
change_log Field change tracking field_name, old_value, new_value
review_queue Manual QA queue review_type, priority, status
facilities_fts Full-text search index name, city, specialties

Coverage Explained

Coverage = facilities with website / total facilities × 100

After Step 1-3 (imports):     0% coverage (no websites)
After Step 4 (Google):       ~5-10% coverage (Serper results have websites)
After Step 6 (resolve 50):   ~10-12% coverage
After Step 6 (resolve 500):  ~15-20% coverage
After Step 6 (resolve 5000): ~80-90% coverage (uses 15,000 API credits)

To increase coverage cheaply: Use "Find Websites Only" button repeatedly.


Troubleshooting

Problem Solution
403 Forbidden from Serper API key expired. Get new one at serper.dev
database disk image is malformed Delete healthcare_providers.db and restart
Pipeline stuck on Step 2 Check "Skip govt data" or use smaller state (WY)
unhashable type: 'dict' Known scoring bug — scoring still errors on some records, non-critical
Port 5000 in use Kill other process: `lsof -ti:5000
No facilities on map Only Serper-discovered facilities have lat/lng

Testing

Run all 80 tests:

python3 -c "exec(open('TESTING_GUIDE.md').read().split('```bash')[-1].split('```')[0])"

Or test individual components:

# Database
python3 -c "from database import init_db; init_db(); print('DB OK')"

# CMS API
python3 -c "from connectors.cms_connector import fetch_cms_hospitals; print(len(fetch_cms_hospitals(limit=5)), 'hospitals')"

# Serper API
python3 -c "from connectors.serper_connector import search_maps; print(len(search_maps('hospital Houston TX').get('places',[])), 'places')"

Tech Stack

  • Python 3.10+ — all backend
  • Flask — web UI server
  • SQLite — database (no external DB needed)
  • Serper.dev — Google Maps/Search API
  • Leaflet.js — map visualization
  • Chart.js — dashboard charts
  • Pandas + openpyxl — Excel export

No Docker, no Redis, no PostgreSQL, no Node.js. Just Python + SQLite.


License

Internal project. Not for public distribution.

About

Google Maps Contacts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors