Skip to content

jhu-data-services/data-axle-region-split

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Axle Region Split

Split Data Axle Historic Business Academic archive files into subsets by U.S. Census geographic region and division.

Data Axle's business database provides 52 attributes about tens of millions of businesses across the United States, from Fortune 500 companies to small and home-based businesses.

Requirements

  • R (4.0+)
  • R packages: stringr, dplyr, data.table, R.utils

Packages are installed automatically on first run if not already present.

Usage

Rscript region_split.R <input_path> [output_path]
  • input_path -- Directory containing Data Axle .txt.gz or .txt files.
  • output_path -- Directory where output files will be written. Defaults to input_path if not provided.

Example

Rscript region_split.R ~/data/dataaxle ~/data/output

What it does

For each Data Axle archive file (*_Business_Academic_QCQ.txt) found in the input directory, the script:

  1. Extracts .txt.gz files if the uncompressed .txt does not already exist.
  2. Reads the file and assigns standardized column names (52 attributes covering company info, SIC/NAICS codes, employee size, sales volume, location, and census geography).
  3. Writes a complete CSV copy to <year>-data-axle-complete/.
  4. Splits the data by Census region (Northeast, Midwest, South, West) and writes each subset to <year>-data-axle-region/.
  5. Splits the data by Census division (New England, Middle Atlantic, East North Central, etc.) and writes each subset to <year>-data-axle-division/.

Region and division assignments are defined in data/census-territory.csv, which includes all 50 states, D.C., and U.S. territories.

Output structure (2024 example)

<output_path>/
  2024-data-axle-complete/
    2024-data-axle-complete.csv
  2024-data-axle-region/
    2024-data-axle-northeast.csv
    2024-data-axle-midwest.csv
    2024-data-axle-south.csv
    2024-data-axle-west.csv
  2024-data-axle-division/
    2024-data-axle-new-england.csv
    2024-data-axle-middle-atlantic.csv
    ...

Project files

region_split.R: Main script

functions.R: Helper function for package loading

data/census-territory.csv: State-to-region/division mapping

data/census_regions.csv: Census region reference data

License

MIT -- Johns Hopkins University Data Services

About

R script for splitting a refUSA dataset into subsets by region.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages