Split Data Axle Historic Business Academic archive files into subsets by U.S. Census geographic region and division.
Data Axle's business database provides 52 attributes about tens of millions of businesses across the United States, from Fortune 500 companies to small and home-based businesses.
- R (4.0+)
- R packages:
stringr,dplyr,data.table,R.utils
Packages are installed automatically on first run if not already present.
Rscript region_split.R <input_path> [output_path]
input_path-- Directory containing Data Axle.txt.gzor.txtfiles.output_path-- Directory where output files will be written. Defaults toinput_pathif not provided.
Rscript region_split.R ~/data/dataaxle ~/data/output
For each Data Axle archive file (*_Business_Academic_QCQ.txt) found in the
input directory, the script:
- Extracts
.txt.gzfiles if the uncompressed.txtdoes not already exist. - Reads the file and assigns standardized column names (52 attributes covering company info, SIC/NAICS codes, employee size, sales volume, location, and census geography).
- Writes a complete CSV copy to
<year>-data-axle-complete/. - Splits the data by Census region (Northeast, Midwest, South, West) and writes
each subset to
<year>-data-axle-region/. - Splits the data by Census division (New England, Middle Atlantic, East North
Central, etc.) and writes each subset to
<year>-data-axle-division/.
Region and division assignments are defined in data/census-territory.csv,
which includes all 50 states, D.C., and U.S. territories.
<output_path>/
2024-data-axle-complete/
2024-data-axle-complete.csv
2024-data-axle-region/
2024-data-axle-northeast.csv
2024-data-axle-midwest.csv
2024-data-axle-south.csv
2024-data-axle-west.csv
2024-data-axle-division/
2024-data-axle-new-england.csv
2024-data-axle-middle-atlantic.csv
...
region_split.R: Main script
functions.R: Helper function for package loading
data/census-territory.csv: State-to-region/division mapping
data/census_regions.csv: Census region reference data
MIT -- Johns Hopkins University Data Services