nccsdata provides tools to download, filter, and analyze nonprofit organization data from the National Center for Charitable Statistics (NCCS). It reads IRS Business Master File (BMF) data stored as parquet files in a public S3 bucket, with support for predicate-pushdown filtering by state, county, NTEE subsector, and exempt organization type.
Note: This is version 2.0.0, a ground-up rewrite of the package. The v1 API (
get_data(),preview_sample(),parse_ntee()) has been replaced. See the migration section below.
Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("UrbanInstitute/nccsdata")nccs_read() downloads BMF data from S3 with optional filters.
Filtering happens at the Arrow level via predicate pushdown, so only
matching rows are read into memory.
library(nccsdata)
# All Pennsylvania nonprofits (default columns)
pa <- nccs_read(state = "PA")
# Arts nonprofits in New York
ny_arts <- nccs_read(state = "NY", ntee_subsector = "ART")
# 501(c)(3) private foundations in California
ca_pf <- nccs_read(state = "CA", org_type = "private_foundation")
# All 501(c)(3) public charities (the most common analyst cut)
pc <- nccs_read(org_type = "public_charity")
# Select specific columns
pa_slim <- nccs_read(
state = "PA",
columns = c("ein", "org_name_display", "geo_county", "income_amount")
)
# Lazy query for custom dplyr pipelines
query <- nccs_read(state = "PA", collect = FALSE)
result <- query |>
dplyr::filter(geo_county == "Lackawanna County") |>
dplyr::collect()The upstream parquet stores most columns as character (vintage
stacking requires it). nccs_read() coerces known financial, date,
and indicator columns to their natural types on the collected tibble
by default — pass coerce = FALSE to opt out. ZIPs and other ID-like
codes are intentionally left as character.
nccs_read() caches the geocoded master parquet (hundreds of MB) in
tools::R_user_dir("nccsdata", "cache") so subsequent calls in the
same or future sessions skip the S3 download. The cached copy refreshes
once it's more than cache_max_age days old (30 by default; the
upstream master is rebuilt monthly). Pass cache = FALSE to always
read from S3, or use nccs_cache_clear() to force the next call to
re-download.
nccs_cache_dir()
nccs_cache_clear() # force re-download next timenccs_read() reads the rolling geocoded master. For a specific dated
monthly snapshot — useful for reproducible research — use
nccs_vintage_url() to get the S3 URI, then read the CSV directly.
Vintage schemas differ by month and across the modern/legacy seam, so
inspect the matching data dictionary before filtering.
uri <- nccs_vintage_url("2023_07")
dict <- arrow::read_csv_arrow(nccs_vintage_url("2023_07", kind = "dictionary"))
bmf_2023_07 <- arrow::read_csv_arrow(uri)nccs_summary() produces grouped count summaries from a collected data
frame.
pa <- nccs_read(state = "PA")
# Total count
nccs_summary(pa)
# Count by county
nccs_summary(pa, group_by = "geo_county")
# Count by county and subsector, export to CSV
nccs_summary(pa, group_by = c("geo_county", "nteev2_subsector"),
output_csv = "pa_counts.csv")nccs_catalog() lists valid values for nccs_read() filters without
any network calls.
nccs_catalog("state")
nccs_catalog("ntee_subsector")
nccs_catalog("exempt_org_type")
# Pass `labels = TRUE` for a code + description tibble, sourced from the
# bundled BMF lookup tables.
nccs_catalog("ntee_subsector", labels = TRUE)
nccs_catalog("foundation_code", labels = TRUE)The BMF returned by nccs_read() is already normalized upstream, but
two helpers are exposed for users joining external CSVs or API responses
against it:
# Coerce EINs in any format to canonical XX-XXXXXXX
nccs_normalize_ein(c("123456789", "12-3456789", 1234567))
#> [1] "12-3456789" "12-3456789" "00-1234567"
# Coerce IRS binary-indicator columns to logical
nccs_as_indicator(c("Y", "N", "1", "2"))
#> [1] TRUE FALSE TRUE FALSE
# e-file indicator accepts E/P (2015, 2018+) and Y/N (2016-2017)
nccs_as_indicator(c("E", "P", "Y", "N"), scheme = "efile")nccs_dictionary() returns a tibble describing all BMF columns, with
optional pattern filtering.
# All columns
nccs_dictionary()
# Find geocoding-related columns
nccs_dictionary("geo")
# Find NTEE-related columns
nccs_dictionary("ntee")| v1 function | v2 replacement |
|---|---|
get_data() |
nccs_read() |
preview_sample() |
nccs_summary() |
ntee_preview() / parse_ntee() |
nccs_catalog("ntee_subsector") |
Key changes:
- Data source moved from legacy Core/BMF CSVs to geocoded BMF parquet files on S3.
- Filtering now uses Arrow predicate pushdown instead of downloading full files.
- Dependencies reduced from 12 packages to 3 (
arrow,dplyr,utils).
Full documentation is available at https://urbaninstitute.github.io/nccsdata/.
- Browse the getting started vignette
- Open an issue on GitHub
- Contact the maintainer at
tpoongundranar@urban.org