diff --git a/README.md b/README.md index bf5c76ffe..7bf11d49e 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # SimPaths -by Matteo Richiardi, Patryk Bronka, Justin van de Ven +by CeMPA (Centre for Microsimulation and Policy Analysis). ## What is SimPaths and how to use it? @@ -8,139 +8,4 @@ SimPaths is an open-source framework for modelling individual and household life SimPaths models currently exist for the UK, Greece, Hungary, Italy, and Poland. This page refers to the UK model; the other European models are available at the corresponding [SimPathsEU](https://github.com/centreformicrosimulation/SimPathsEU) page. -The entire SimPaths documentation is available on its [website](https://centreformicrosimulation.github.io/SimPaths/), which includes: a detailed description of its building blocks; instructions on how to set up and run the model; information about contributing to the model's development. - -## Quick start - -### Prerequisites - -- Java 19 -- Maven 3.8+ -- Optional IDE: IntelliJ IDEA (import as a Maven project) - -### Build and run - -```bash -mvn clean package -java -jar multirun.jar -DBSetup -java -jar multirun.jar -``` - -The first command builds the JARs. The second creates the H2 donor database from the input data. The third runs the simulation using `default.yml`. - -To use a different config file: - -```bash -java -jar multirun.jar -config my_run.yml -``` - -For configuration options, see the annotated `config/default.yml`. For the data pipeline and further reference, see [`documentation/`](documentation/README.md). - - - - - \ No newline at end of file +The entire SimPaths documentation is available on its [website](https://centreformicrosimulation.github.io/SimPaths/), which includes: a detailed description of its building blocks; instructions on how to set up and run the model; information about contributing to the model's development. \ No newline at end of file diff --git a/documentation/README.md b/documentation/README.md deleted file mode 100644 index c9756cd50..000000000 --- a/documentation/README.md +++ /dev/null @@ -1,105 +0,0 @@ -# Data Pipeline Reference - -For building and running SimPaths, see the [root README](../README.md). For the full model documentation, see the [website](https://centreformicrosimulation.github.io/SimPaths/). - ---- - -This section explains how the simulation-ready input files in `input/` are generated from raw survey data, and what to do if you need to update or extend them. - -The pipeline has three independent parts: (1) initial populations, (2) regression coefficients, (3) alignment targets. Each can be re-run separately. - -### Data sources - -| Source | Description | Access | -|--------|-------------|--------| -| **UKHLS** (Understanding Society) | Main household panel survey; waves 1 to O (UKDA-6614-stata) | Requires EUL licence from UK Data Service | -| **BHPS** (British Household Panel Survey) | Historical predecessor to UKHLS; used for pre-2009 employment history | Bundled with UKHLS EUL | -| **WAS** (Wealth and Assets Survey) | Biennial survey of household wealth; waves 1 to 7 (UKDA-7215-stata) | Requires EUL licence from UK Data Service | -| **EUROMOD / UKMOD** | Tax-benefit microsimulation system | See [Tax-Benefit Donors (UK)](wiki/getting-started/data/tax-benefit-donors-uk.md) on the website | - -### Part 1 — Initial populations (`input/InitialPopulations/compile/`) - -**What it produces:** Annual CSV files `population_initial_UK_.csv` used as the starting population for each simulation run. - -**Master script:** `input/InitialPopulations/compile/00_master.do` - -The pipeline runs in numbered stages: - -| Script | What it does | -|--------|-------------| -| `01_prepare_UKHLS_pooled_data.do` | Pools and standardises UKHLS waves | -| `02_create_UKHLS_variables.do` | Constructs all required variables (demographics, labour, health, income, wealth flags) and applies simulation-consistency rules (retirement as absorbing state, education age bounds, work/hours consistency) | -| `02_01_checks.do` | Data quality checks | -| `03_social_care_received.do` | Social care receipt variables | -| `04_social_care_provided.do` | Informal care provision variables | -| `05_create_benefit_units.do` | Groups individuals into benefit units (tax units) following UK tax-benefit rules | -| `06_reweight_and_slice.do` | Reweighting and year-specific slicing | -| `07_was_wealth_data.do` | Prepares Wealth and Assets Survey data | -| `08_wealth_to_ukhls.do` | Merges WAS wealth into UKHLS records | -| `09_finalise_input_data.do` | Final cleaning and formatting | -| `10_check_yearly_data.do` | Per-year consistency checks | -| `99_training_data.do` | Produces the de-identified training population committed to `input/InitialPopulations/training/` | - -#### Employment history sub-pipeline (`compile/do_emphist/`) - -Reconstructs each respondent's monthly employment history from January 2007 onwards by combining UKHLS and BHPS interview records. The output variable `liwwh` (months employed since Jan 2007) feeds into the labour supply models. - -| Script | Purpose | -|--------|---------| -| `00_Master_emphist.do` | Master; sets parameters and calls sub-scripts | -| `01_Intdate.do` – `07_Empcal1a.do` | Sequential stages: interview dating, BHPS linkage, employment spell reconstruction, new-entrant identification | - -### Part 2 — Regression coefficients (`input/InitialPopulations/compile/RegressionEstimates/`) - -**What it produces:** The `reg_*.xlsx` coefficient tables read by `Parameters.java` at simulation startup. - -**Master script:** `input/InitialPopulations/compile/RegressionEstimates/master.do` - -> **Note:** Income and union-formation regressions depend on predicted wages, so `reg_wages.do` must complete before `reg_income.do` and `reg_partnership.do`. All other scripts can run in any order. - -**Required Stata packages:** `fre`, `tsspell`, `carryforward`, `outreg2`, `oparallel`, `gologit2`, `winsor`, `reghdfe`, `ftools`, `require` - -| Script | Module | Method | -|--------|--------|--------| -| `reg_wages.do` | Hourly wages | Heckman selection model (males and females separately) | -| `reg_income.do` | Non-labour income | Hurdle model (selection + amount); requires predicted wages | -| `reg_partnership.do` | Partnership formation/dissolution | Probit; requires predicted wages | -| `reg_education.do` | Education transitions | Generalised ordered logit | -| `reg_fertility.do` | Fertility | Probit | -| `reg_health.do` | Physical health (SF-12 PCS) | Linear regression | -| `reg_health_mental.do` | Mental health (GHQ-12, SF-12 MCS) | Linear regression | -| `reg_health_wellbeing.do` | Life satisfaction | Linear regression | -| `reg_home_ownership.do` | Homeownership transitions | Probit | -| `reg_retirement.do` | Retirement | Probit | -| `reg_leave_parental_home.do` | Leaving parental home | Probit | -| `reg_socialcare.do` | Social care receipt and provision | Probit / ordered logit | -| `reg_unemployment.do` | Unemployment transitions | Probit | -| `reg_financial_distress.do` | Financial distress | Probit | -| `programs.do` | Shared utility programs called by the estimation scripts | — | -| `variable_update.do` | Prepares and recodes variables before estimation | — | - -After running, output Excel files are placed in `input/` (overwriting the existing `reg_*.xlsx` files). - -### Part 3 — Alignment targets (`input/DoFilesTarget/`) - -**What it produces:** The `align_*.xlsx` and `*_targets.xlsx` files that the alignment modules use to rescale simulated rates. - -| Script | Output file | -|--------|------------| -| `01_employment_shares_initpopdata.do` | `input/employment_targets.xlsx` — employment shares by benefit-unit subgroup and year | -| `01_inSchool_targets_initpopdata.do` | `input/inSchool_targets.xlsx` — school participation rates by year | -| `03_calculate_partneredShare_initialPop_BUlogic.do` | `input/partnered_share_targets.xlsx` — partnership shares by year | -| `03_calculate_partnership_target.do` | Supplementary partnership targets | -| `02_person_risk_employment_stats.do` | `employment_risk_emp_stats.csv` — person-level at-risk diagnostics used for employment alignment group construction | - -Population projection targets (`align_popProjections.xlsx`) and fertility/mortality projections (`projections_*.xlsx`) come from ONS published projections and are not generated by these scripts. - -### When to re-run each part - -| Situation | What to re-run | -|-----------|---------------| -| Adding a new data year to the simulation | Part 1 (re-slice the population for the new year) + Part 3 (update alignment targets) | -| Re-estimating a behavioural module | Part 2 (the affected `reg_*.do` script only) + Stage 1 validation | -| Updating employment alignment targets | Part 3 (`01_employment_shares_initpopdata.do`) | - -After re-running any part, re-run setup (`singlerun -Setup` or `multirun -DBSetup`) to rebuild `input/input.mv.db` before running the simulation. diff --git a/documentation/SimPaths Stata Parameters.xlsx b/documentation/SimPaths_Stata_Parameters.xlsx similarity index 100% rename from documentation/SimPaths Stata Parameters.xlsx rename to documentation/SimPaths_Stata_Parameters.xlsx diff --git a/documentation/repository-guide.md b/documentation/repository-guide.md new file mode 100644 index 000000000..e571662e2 --- /dev/null +++ b/documentation/repository-guide.md @@ -0,0 +1,533 @@ +# SimPaths Repository Guide + +A guide to navigating the SimPaths repository structure and codebase. + +--- + +## Table of Contents + +1. [Repository Structure](#repository-structure) +2. [Core Components](#core-components) +3. [Key Directories Explained](#key-directories-explained) +4. [Sub-package Detail](#sub-package-detail) +5. [Data Pipeline Reference](#data-pipeline-reference) +6. [Development Workflow](#development-workflow) +7. [Code Navigation Tips](#code-navigation-tips) +8. [Additional Resources](#additional-resources) + +--- + +## Repository Structure + +``` +SimPaths/ +├── config/ # Configuration files for simulations +│ ├── default.yml # Default simulation parameters +│ ├── test_create_database.yml # Database creation test config +│ └── test_run.yml # Test run configuration +│ +├── documentation/ # Comprehensive documentation +│ ├── figures/ # Diagrams and illustrations +│ ├── wiki/ # Full documentation website +│ │ ├── getting-started/ # Setup and first simulation guides +│ │ ├── overview/ # Model description and modules +│ │ ├── user-guide/ # Running simulations +│ │ ├── developer-guide/ # Extending the model +│ │ │ └── repository-guide.md # Repository guide (copy for website) +│ │ ├── jasmine-reference/ # JAS-mine library reference +│ │ ├── research/ # Published papers +│ │ └── validation/ # Model validation results +│ ├── repository-guide.md # Repository structure and navigation guide +│ ├── SimPaths_Variable_Codebook.xlsx # Codebook of all variables in SimPaths +│ ├── SimPaths_Stata_Parameters.xlsx # Comparison of parameters: Stata do-files vs Java code +│ └── SimPathsUK_Schedule.xlsx # Detailed schedule of events and corresponding classes +│ +├── input/ # Input data and parameters +│ ├── InitialPopulations/ # Starting population data +│ │ ├── training/ # De-identified training population (included in repo) +│ │ └── compile/ # Stata pipeline: builds populations, estimates regressions +│ │ ├── do_emphist/ # Employment history reconstruction sub-pipeline +│ │ └── RegressionEstimates/ # Regression coefficient estimation scripts +│ ├── EUROMODoutput/ # Tax-benefit model outputs +│ │ └── training/ # Training UKMOD outputs (included in repo) +│ ├── DoFilesTarget/ # Stata scripts that generate alignment targets +│ ├── align_*.xlsx # Alignment files (population, employment, etc.) +│ ├── reg_*.xlsx # Regression parameter files +│ ├── scenario_*.xlsx # Scenario configuration files +│ ├── projections_*.xlsx # Mortality/fertility projections +│ ├── DatabaseCountryYear.xlsx # Database metadata +│ ├── EUROMODpolicySchedule.xlsx # Policy schedule +│ ├── policy parameters.xlsx # Tax-benefit parameters +│ ├── validation_statistics.xlsx # Validation targets +│ └── input.mv.db # H2 donor database (generated by setup) +│ +├── output/ # Simulation outputs +│ ├── [timestamp]_[seed]_[run]/ # Timestamped output folders +│ │ ├── csv/ +│ │ │ ├── Statistics1.csv # Income distribution, Gini, S-Index +│ │ │ ├── Statistics2.csv # Demographics by age and gender +│ │ │ ├── Statistics3.csv # Alignment diagnostics +│ │ │ ├── Person.csv # Person-level output +│ │ │ ├── BenefitUnit.csv # Benefit-unit-level output +│ │ │ └── Household.csv # Household-level output +│ │ ├── database/ # Run-specific persistence output +│ │ └── input/ # Copied run input artifacts +│ └── logs/ # Log files (with -f flag on multirun) +│ +├── src/ # Source code +│ ├── main/ +│ │ ├── java/simpaths/ +│ │ │ ├── data/ # Data handling and parameters +│ │ │ ├── experiment/ # Simulation execution classes +│ │ │ └── model/ # Core model implementation +│ │ │ ├── decisions/ # Intertemporal optimisation grids +│ │ │ ├── enums/ # Categorical variable definitions +│ │ │ ├── taxes/ # EUROMOD donor matching +│ │ │ └── lifetime_incomes/ # Synthetic income trajectory generation +│ │ └── resources/ # Configuration resources +│ └── test/ # Test classes +│ +├── validation/ # Validation scripts and results +│ ├── 01_estimate_validation/ # Estimation validation +│ └── 02_simulated_output_validation/ # Output validation +│ +├── pom.xml # Maven build configuration +├── singlerun.jar # Executable for single runs +├── multirun.jar # Executable for multiple runs +└── README.md # Project overview +``` + +--- + +## Core Components + +### 1. **Entry Points** + +#### SimPathsStart (`src/main/java/simpaths/experiment/SimPathsStart.java`) +- Main class for single simulation execution +- Handles GUI and command-line interfaces +- Manages database setup phases +- **Key methods**: + - `main()`: Entry point + - `runGUIdialog()`: Launch GUI + - `runGUIlessSetup()`: Command-line setup + +#### SimPathsMultiRun (`src/main/java/simpaths/experiment/SimPathsMultiRun.java`) +- Coordinates multiple simulation runs +- Manages parallel execution +- Aggregates results across runs +- Configurable via YAML files + +### 2. **Core Model** + +#### SimPathsModel (`src/main/java/simpaths/model/SimPathsModel.java`) +- Central simulation manager +- Implements `AbstractSimulationManager` from JAS-mine +- Defines the simulation schedule via `buildSchedule()` +- Manages all simulation modules and processes +- **Key responsibilities**: + - Population initialization + - Event scheduling + - Module coordination + - Time progression + +### 3. **Data & Parameters** + +#### Parameters (`src/main/java/simpaths/data/Parameters.java`) +- Global parameter storage +- Loads regression coefficients from Excel +- Manages country-specific configurations +- Stores alignment targets +- **Key data structures**: + - Regression coefficient maps + - Policy parameters + - Alignment targets + - EUROMOD variable definitions + +--- + +## Key Directories Explained + +### `/src/main/java/simpaths/` + +#### `data/` +**Purpose**: Data handling, parameter management, and utility classes + +- **Parameters.java**: Global parameter storage and Excel data loading +- **ManagerRegressions.java**: Regression coefficient management +- **CallEUROMOD.java** / **CallEMLight.java**: Interface with tax-benefit models +- **filters/**: Collection filters for querying simulated populations +- **startingpop/**: Initial population data parsing +- **statistics/**: Statistical utilities + +#### `experiment/` +**Purpose**: Simulation execution and coordination + +- **SimPathsStart.java**: Single-run entry point +- **SimPathsMultiRun.java**: Multi-run orchestration +- **SimPathsCollector.java**: Output collection and aggregation +- **SimPathsObserver.java**: GUI updates and monitoring + +#### `model/` +**Purpose**: Core simulation logic + +- **SimPathsModel.java**: Main simulation manager +- **Person.java**: Individual-level processes and attributes +- **BenefitUnit.java**: Fiscal unit processes +- **Household.java**: Residential unit processes +- **decisions/**: Labour supply and consumption optimization +- **enums/**: Type-safe enumerations (Gender, Country, HealthStatus, etc.) +- **taxes/**: Tax-benefit donor matching system +- **lifetime_incomes/**: Lifetime income projection utilities + +### `/input/` + +**Critical input files**: + +| File Pattern | Purpose | +|--------------|---------| +| `align_*.xlsx` | Alignment targets (population, employment, education, etc.) | +| `reg_*.xlsx` | Regression parameters for behavioral processes | +| `scenario_*.xlsx` | Policy scenarios and projections | +| `projections_*.xlsx` | Demographic projections (mortality, fertility) | +| `DatabaseCountryYear.xlsx` | Tracks current database country/year | +| `EUROMODpolicySchedule.xlsx` | Tax-benefit policy schedule | +| `policy parameters.xlsx` | Detailed policy parameters | + +**Subdirectories**: +- `InitialPopulations/`: Starting population databases +- `EUROMODoutput/`: Tax-benefit donor population data +- `DoFilesTarget/`: Stata-generated alignment targets + +### `/config/` + +YAML configuration files override default parameters. The main file is **default.yml**, which contains several configuration sections: + +- **model_args**: SimPathsModel parameters (alignment switches, behavioral responses) +- **collector_args**: Output options (CSV, database, statistics) +- **parameter_args**: Data directories and input years +- **innovation_args**: Experimental parameters for sensitivity analysis + +Additional configuration files for testing: **test_create_database.yml**, **test_run.yml** + +### `/documentation/wiki/` + +Complete documentation organized by audience: + +- **getting-started/**: Environment setup, data access, first simulation +- **overview/**: Model description, modules, parameterization +- **user-guide/**: GUI, parameter modification, multiple runs +- **developer-guide/**: JAS-mine architecture, internals, how-to guides +- **jasmine-reference/**: Statistical packages, alignment, regression tools +- **research/**: Published papers and validation results + +--- + +## Sub-package Detail + +The following sub-packages are self-contained subsystems whose internals are not obvious from the class names alone. + +### `model/decisions/` — IO engine + +When IO is enabled, computing optimal consumption–labour choices for every agent at every time step would be prohibitively slow. This package solves the problem once before the simulation runs: it constructs a grid covering all meaningful combinations of state variables (wealth, age, health, family status, etc.), then works backwards from the end of life to find the optimal choice at each grid point (backward induction). During the simulation, agents simply look up their current state in the pre-computed grid. + +| Class | Purpose | +| --- | --- | +| `DecisionParams` | Defines the state-space dimensions and grid parameters for the optimisation problem. | +| `ManagerPopulateGrids` | Populates the state-space grid points and evaluates value functions by backward induction. | +| `ManagerSolveGrids` | Solves for optimal policy at each grid point. | +| `ManagerFileGrids` | Reads and writes pre-computed grids to disk, so they can be reused across runs. | +| `Grids` | Container for the set of solved decision grids. | +| `States` | Enumerates the state variables that define each grid point. | +| `Expectations` / `LocalExpectations` | Computes expected future values over stochastic transitions. | +| `CESUtility` | CES utility function used in the optimisation. | + +### `model/taxes/` — EUROMOD donor matching + +Imputes taxes and benefits onto simulated benefit units by matching them to pre-computed EUROMOD donor records. + +| Class | Purpose | +| --- | --- | +| `DonorTaxImputation` | Main entry point. Implements the three-step matching process: coarse-exact matching on characteristics, income proximity filtering, and candidate selection/averaging. | +| `KeyFunction` / `KeyFunction1`–`4` | Four progressively relaxed matching-key definitions. The system tries the tightest key first and falls back through wider keys if no donors are found. | +| `DonorKeys` | Builds composite matching keys from benefit-unit characteristics. | +| `DonorTaxUnit` / `DonorPerson` | Represent the pre-computed EUROMOD donor records loaded from the database. | +| `CandidateList` | Ranked list of donor matches for a given benefit unit, sorted by income proximity. | +| `Match` / `Matches` | Store the final selected donor(s) and their imputed tax-benefit values. | + +The `taxes/database/` sub-package handles loading donor data from the H2 database into memory (`TaxDonorDataParser`, `DatabaseExtension`, `MatchIndices`). + +### `model/lifetime_incomes/` — synthetic income trajectories + +When IO is enabled, this package creates projected income paths for birth cohorts using an AR(2) process anchored to age-gender geometric means, and matches simulated persons to donor income profiles. + +| Class | Purpose | +| --- | --- | +| `ManagerProjectLifetimeIncomes` | Generates the synthetic income trajectory database for all birth cohorts in the simulation horizon. | +| `LifetimeIncomeImputation` | Matches each simulated person to a donor income trajectory via binary search on the income CDF. | +| `AnnualIncome` | Implements the AR(2) income process with age-gender anchoring. | +| `BirthCohort` | Groups individuals by birth year for cohort-level income projection. | +| `Individual` | Entity carrying age dummies and log GDP per capita for income regression. | + +CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file. + +For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](wiki/overview/parameterisation.md) on the website. + +--- + +## Data Pipeline Reference + +This section explains how the simulation-ready input files in `input/` are generated from raw survey data, and what to do if you need to update or extend them. + +The pipeline has three independent parts: (1) initial populations, (2) regression coefficients, (3) alignment targets. Each can be re-run separately. + +### Data sources + +| Source | Description | Access | +|--------|-------------|--------| +| **UKHLS** (Understanding Society) | Main household panel survey; waves 1 to O (UKDA-6614-stata) | Requires EUL licence from UK Data Service | +| **BHPS** (British Household Panel Survey) | Historical predecessor to UKHLS; used for pre-2009 employment history | Bundled with UKHLS EUL | +| **WAS** (Wealth and Assets Survey) | Biennial survey of household wealth; waves 1 to 7 (UKDA-7215-stata) | Requires EUL licence from UK Data Service | +| **EUROMOD / UKMOD** | Tax-benefit microsimulation system | See [Tax-Benefit Donors (UK)](wiki/getting-started/data/tax-benefit-donors-uk.md) on the website | + +### Part 1 — Initial populations (`input/InitialPopulations/compile/`) + +**What it produces:** Annual CSV files `population_initial_UK_.csv` used as the starting population for each simulation run. + +**Master script:** `input/InitialPopulations/compile/00_master.do` + +The pipeline runs in numbered stages: + +| Script | What it does | +|--------|-------------| +| `01_prepare_UKHLS_pooled_data.do` | Pools and standardises UKHLS waves | +| `02_create_UKHLS_variables.do` | Constructs all required variables (demographics, labour, health, income, wealth flags) and applies simulation-consistency rules (retirement as absorbing state, education age bounds, work/hours consistency) | +| `02_01_checks.do` | Data quality checks | +| `03_social_care_received.do` | Social care receipt variables | +| `04_social_care_provided.do` | Informal care provision variables | +| `05_create_benefit_units.do` | Groups individuals into benefit units (tax units) following UK tax-benefit rules | +| `06_reweight_and_slice.do` | Reweighting and year-specific slicing | +| `07_was_wealth_data.do` | Prepares Wealth and Assets Survey data | +| `08_wealth_to_ukhls.do` | Merges WAS wealth into UKHLS records | +| `09_finalise_input_data.do` | Final cleaning and formatting | +| `10_check_yearly_data.do` | Per-year consistency checks | +| `99_training_data.do` | Produces the de-identified training population committed to `input/InitialPopulations/training/` | + +#### Employment history sub-pipeline (`compile/do_emphist/`) + +Reconstructs each respondent's monthly employment history from January 2007 onwards by combining UKHLS and BHPS interview records. The output variable `liwwh` (months employed since Jan 2007) feeds into the labour supply models. + +| Script | Purpose | +|--------|---------| +| `00_Master_emphist.do` | Master; sets parameters and calls sub-scripts | +| `01_Intdate.do` – `07_Empcal1a.do` | Sequential stages: interview dating, BHPS linkage, employment spell reconstruction, new-entrant identification | + +### Part 2 — Regression coefficients (`input/InitialPopulations/compile/RegressionEstimates/`) + +**What it produces:** The `reg_*.xlsx` coefficient tables read by `Parameters.java` at simulation startup. + +**Master script:** `input/InitialPopulations/compile/RegressionEstimates/master.do` + +> **Note:** Income and union-formation regressions depend on predicted wages, so `reg_wages.do` must complete before `reg_income.do` and `reg_partnership.do`. All other scripts can run in any order. + +**Required Stata packages:** `fre`, `tsspell`, `carryforward`, `outreg2`, `oparallel`, `gologit2`, `winsor`, `reghdfe`, `ftools`, `require` + +| Script | Module | Method | +|--------|--------|--------| +| `reg_wages.do` | Hourly wages | Heckman selection model (males and females separately) | +| `reg_income.do` | Non-labour income | Hurdle model (selection + amount); requires predicted wages | +| `reg_partnership.do` | Partnership formation/dissolution | Probit; requires predicted wages | +| `reg_education.do` | Education transitions | Generalised ordered logit | +| `reg_fertility.do` | Fertility | Probit | +| `reg_health.do` | Physical health (SF-12 PCS) | Linear regression | +| `reg_health_mental.do` | Mental health (GHQ-12, SF-12 MCS) | Linear regression | +| `reg_health_wellbeing.do` | Life satisfaction | Linear regression | +| `reg_home_ownership.do` | Homeownership transitions | Probit | +| `reg_retirement.do` | Retirement | Probit | +| `reg_leave_parental_home.do` | Leaving parental home | Probit | +| `reg_socialcare.do` | Social care receipt and provision | Probit / ordered logit | +| `reg_unemployment.do` | Unemployment transitions | Probit | +| `reg_financial_distress.do` | Financial distress | Probit | +| `programs.do` | Shared utility programs called by the estimation scripts | — | +| `variable_update.do` | Prepares and recodes variables before estimation | — | + +After running, output Excel files are placed in `input/` (overwriting the existing `reg_*.xlsx` files). + +### Part 3 — Alignment targets (`input/DoFilesTarget/`) + +**What it produces:** The `align_*.xlsx` and `*_targets.xlsx` files that the alignment modules use to rescale simulated rates. + +| Script | Output file | +|--------|------------| +| `01_employment_shares_initpopdata.do` | `input/employment_targets.xlsx` — employment shares by benefit-unit subgroup and year | +| `01_inSchool_targets_initpopdata.do` | `input/inSchool_targets.xlsx` — school participation rates by year | +| `03_calculate_partneredShare_initialPop_BUlogic.do` | `input/partnered_share_targets.xlsx` — partnership shares by year | +| `03_calculate_partnership_target.do` | Supplementary partnership targets | +| `02_person_risk_employment_stats.do` | `employment_risk_emp_stats.csv` — person-level at-risk diagnostics used for employment alignment group construction | + +Population projection targets (`align_popProjections.xlsx`) and fertility/mortality projections (`projections_*.xlsx`) come from ONS published projections and are not generated by these scripts. + +### When to re-run each part + +| Situation | What to re-run | +|-----------|---------------| +| Adding a new data year to the simulation | Part 1 (re-slice the population for the new year) + Part 3 (update alignment targets) | +| Re-estimating a behavioural module | Part 2 (the affected `reg_*.do` script only) + Stage 1 validation | +| Updating employment alignment targets | Part 3 (`01_employment_shares_initpopdata.do`) | + +After re-running any part, re-run setup (`singlerun -Setup` or `multirun -DBSetup`) to rebuild `input/input.mv.db` before running the simulation. + +### Setup-generated artifacts + +Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: + +- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes +- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems +- `DatabaseCountryYear.xlsx` — year-specific macro parameters + +These must exist before any simulation run. If they are missing, re-run setup. + +### Training mode + +The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. + +### Logging + +With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). + +--- + +## Development Workflow + +### 1. Understanding the Code + +**Start here**: +1. `SimPathsStart.java` — Understand initialization +2. `SimPathsModel.java` — Understand the simulation loop (`buildSchedule()`) +3. `Person.java`, `BenefitUnit.java`, `Household.java` — Understand agents +4. Module-specific methods in `Person.java` (e.g., `health()`, `education()`, `fertility()`) + +### 2. Key Design Patterns + +**JAS-mine Event Scheduling**: +```java +// In SimPathsModel.buildSchedule() +getEngine().getEventQueue().scheduleRepeat( + new SingleTargetEvent(this, Processes.UpdateYear), + 0.0, // Start time + 1.0 // Repeat interval +); +``` + +**Regression-based processes**: +```java +double score = Parameters.getRegression(RegressionName.HealthMentalHMLevel) + .getScore(regressors, Person.class.getDeclaredField("les_c4_lag1")); +``` + +**Alignment**: +```java +ResamplingAlignment.align( + population, // Collection to align + filter, // Subgroup filter + closure, // Alignment closure + targetValue // Target to match +); +``` + +### 3. Adding New Features + +**Example: Add a new person attribute** + +1. **Add field** to `Person.java`: + ```java + private Integer newAttribute; + ``` + +2. **Add getter/setter**: + ```java + public Integer getNewAttribute() { return newAttribute; } + public void setNewAttribute(Integer value) { this.newAttribute = value; } + ``` + +3. **Initialize** in constructor or relevant process method + +4. **Update database schema** if persisting (in `PERSON_VARIABLES_INITIAL`) + +5. **Add to outputs** in `SimPathsCollector.java` if needed + +**See**: `documentation/wiki/developer-guide/how-to/new-variable.md` + +### 4. Modifying Parameters + +**Regression coefficients**: Edit Excel files in `input/reg_*.xlsx` + +**Policy parameters**: Edit `input/policy parameters.xlsx` + +**Alignment targets**: Edit `input/align_*.xlsx` + +**Simulation options**: Edit `config/default.yml` or use GUI + +### 5. Adding GUI Parameters + +**Example**: +```java +@GUIparameter(description = "Enable new feature") +private Boolean enableNewFeature = true; +``` + +This automatically adds the parameter to the GUI interface. + +**See**: `documentation/wiki/developer-guide/how-to/add-gui-parameters.md` + +### 6. Testing + +Run tests via: +```bash +mvn test +``` + +Or via IDE test runner. + +### 7. Version Control + +**Branch naming conventions**: +- `feature/your-feature-name` — New features +- `bugfix/issue-number-description` — Bug fixes +- `docs/documentation-topic` — Documentation updates +- `experimental/your-description` — Experimental work + +**Main branches**: +- `main` — Stable release +- `develop` — Development integration + +--- + +## Code Navigation Tips + +**Find where a process runs**: +1. Search for the process name in `SimPathsModel.buildSchedule()` +2. Follow the method call to the implementation + +**Find regression parameters**: +1. Search for `Parameters.getRegression(RegressionName.XXX)` +2. The corresponding Excel file is in `input/reg_XXX.xlsx` + +**Find alignment logic**: +1. Search for classes ending in `Alignment` (e.g., `FertilityAlignment.java`) +2. Check `buildSchedule()` for when alignment occurs + +**Understand data flow**: +1. **Input**: Excel files → `Parameters.java` → Coefficient maps +2. **Process**: Regression score → Probability → Random draw → State change +3. **Output**: `SimPathsCollector.java` → CSV/Database + +--- + +## Additional Resources + +- **Full Documentation**: See `documentation/wiki/` for comprehensive guides +- **Getting Started**: `documentation/wiki/getting-started/` +- **Running Simulations**: `documentation/wiki/user-guide/` +- **Model Overview**: `documentation/wiki/overview/` +- **Issues**: [GitHub Issues](https://github.com/centreformicrosimulation/SimPaths/issues) diff --git a/documentation/wiki/developer-guide/internals/file-organisation.md b/documentation/wiki/developer-guide/internals/file-organisation.md deleted file mode 100644 index 0720bb9d1..000000000 --- a/documentation/wiki/developer-guide/internals/file-organisation.md +++ /dev/null @@ -1,142 +0,0 @@ -# File Organisation - -This page describes the directory and package layout of the SimPaths repository. For the generic JAS-mine project structure, see [Project Structure](../jasmine/project-structure.md). - -# Repository Structure - -``` -SimPaths/ -├── config/ # YAML configuration files for simulation runs -│ ├── default.yml # Default simulation parameters (fully annotated) -│ ├── test_create_database.yml # Database creation config (CI) -│ └── test_run.yml # Test run config (CI) -│ -├── documentation/ # Quick-reference docs (this folder) -│ ├── wiki/ # Website source (model description, guides, research) -│ ├── SimPaths_Variable_Codebook.xlsx # Variable definitions for output CSVs -│ ├── SimPaths Stata Parameters.xlsx # Parameter comparison: Stata do-files vs Java -│ └── SimPathsUK_Schedule.xlsx # Event schedule with corresponding Java classes -│ -├── input/ # Input data and parameters -│ ├── InitialPopulations/ -│ │ ├── training/ # De-identified training population (included in repo) -│ │ └── compile/ # Stata pipeline: builds populations, estimates regressions -│ │ ├── do_emphist/ # Employment history reconstruction sub-pipeline -│ │ └── RegressionEstimates/ # Regression coefficient estimation scripts -│ ├── EUROMODoutput/ -│ │ └── training/ # Training UKMOD outputs (included in repo) -│ ├── DoFilesTarget/ # Stata scripts that generate alignment targets -│ ├── reg_*.xlsx # Regression coefficient tables -│ ├── align_*.xlsx # Alignment targets -│ ├── projections_*.xlsx # ONS demographic projections -│ ├── scenario_*.xlsx # Scenario-specific parameter overrides -│ ├── policy parameters.xlsx # Tax-benefit policy parameters -│ ├── validation_statistics.xlsx # Validation targets -│ ├── input.mv.db # H2 donor database (generated by setup) -│ ├── EUROMODpolicySchedule.xlsx # Policy year mapping (generated by setup) -│ └── DatabaseCountryYear.xlsx # Macro parameters (generated by setup) -│ -├── output/ # Simulation outputs (created at runtime) -│ └── / -│ ├── csv/ -│ │ ├── Statistics1.csv # Income distribution, Gini, S-Index -│ │ ├── Statistics2.csv # Demographics by age and gender -│ │ ├── Statistics3.csv # Alignment diagnostics -│ │ ├── Person.csv # Person-level output -│ │ ├── BenefitUnit.csv # Benefit-unit-level output -│ │ └── Household.csv # Household-level output -│ ├── database/ # Run-specific persistence output -│ └── input/ # Copied run input artifacts -│ -├── src/ -│ ├── main/java/simpaths/ -│ │ ├── data/ # Parameters, input parsing, filters, statistics -│ │ ├── experiment/ # Entry points: SimPathsStart, SimPathsMultiRun, -│ │ │ # SimPathsCollector, SimPathsObserver -│ │ └── model/ # Core simulation: Person, BenefitUnit, Household, -│ │ ├── decisions/ # intertemporal optimisation grids -│ │ ├── enums/ # categorical variable definitions -│ │ ├── taxes/ # EUROMOD donor matching -│ │ └── lifetime_incomes/ # synthetic income trajectory generation -│ └── test/java/simpaths/ # Unit and integration tests -│ -├── validation/ # Stata validation scripts and reference graphs -│ ├── 01_estimate_validation/ # Predicted vs observed for each regression module -│ └── 02_simulated_output_validation/ # Simulated output vs UKHLS survey data -│ -├── pom.xml # Maven build configuration -├── singlerun.jar # Single-run executable -└── multirun.jar # Multi-run executable -``` - - -## Sub-package detail - -The following sub-packages are self-contained subsystems whose internals are not obvious from the class names alone. - -### `model/decisions/` — IO engine - -When IO is enabled, computing optimal consumption–labour choices for every agent at every time step would be prohibitively slow. This package solves the problem once before the simulation runs: it constructs a grid covering all meaningful combinations of state variables (wealth, age, health, family status, etc.), then works backwards from the end of life to find the optimal choice at each grid point (backward induction). During the simulation, agents simply look up their current state in the pre-computed grid. - -| Class | Purpose | -| --- | --- | -| `DecisionParams` | Defines the state-space dimensions and grid parameters for the optimisation problem. | -| `ManagerPopulateGrids` | Populates the state-space grid points and evaluates value functions by backward induction. | -| `ManagerSolveGrids` | Solves for optimal policy at each grid point. | -| `ManagerFileGrids` | Reads and writes pre-computed grids to disk, so they can be reused across runs. | -| `Grids` | Container for the set of solved decision grids. | -| `States` | Enumerates the state variables that define each grid point. | -| `Expectations` / `LocalExpectations` | Computes expected future values over stochastic transitions. | -| `CESUtility` | CES utility function used in the optimisation. | - -### `model/taxes/` — EUROMOD donor matching - -Imputes taxes and benefits onto simulated benefit units by matching them to pre-computed EUROMOD donor records. - -| Class | Purpose | -| --- | --- | -| `DonorTaxImputation` | Main entry point. Implements the three-step matching process: coarse-exact matching on characteristics, income proximity filtering, and candidate selection/averaging. | -| `KeyFunction` / `KeyFunction1`–`4` | Four progressively relaxed matching-key definitions. The system tries the tightest key first and falls back through wider keys if no donors are found. | -| `DonorKeys` | Builds composite matching keys from benefit-unit characteristics. | -| `DonorTaxUnit` / `DonorPerson` | Represent the pre-computed EUROMOD donor records loaded from the database. | -| `CandidateList` | Ranked list of donor matches for a given benefit unit, sorted by income proximity. | -| `Match` / `Matches` | Store the final selected donor(s) and their imputed tax-benefit values. | - -The `taxes/database/` sub-package handles loading donor data from the H2 database into memory (`TaxDonorDataParser`, `DatabaseExtension`, `MatchIndices`). - -### `model/lifetime_incomes/` — synthetic income trajectories - -When IO is enabled, this package creates projected income paths for birth cohorts using an AR(2) process anchored to age-gender geometric means, and matches simulated persons to donor income profiles. - -| Class | Purpose | -| --- | --- | -| `ManagerProjectLifetimeIncomes` | Generates the synthetic income trajectory database for all birth cohorts in the simulation horizon. | -| `LifetimeIncomeImputation` | Matches each simulated person to a donor income trajectory via binary search on the income CDF. | -| `AnnualIncome` | Implements the AR(2) income process with age-gender anchoring. | -| `BirthCohort` | Groups individuals by birth year for cohort-level income projection. | -| `Individual` | Entity carrying age dummies and log GDP per capita for income regression. | - -CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file. - -For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](../documentation/wiki/overview/parameterisation.md) on the website. - -## Setup-generated artifacts - -Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: - -- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes -- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems -- `DatabaseCountryYear.xlsx` — year-specific macro parameters - -These must exist before any simulation run. If they are missing, re-run setup. - -## Training mode - -The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. - -## Logging - -With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). - ---- - diff --git a/documentation/wiki/developer-guide/internals/index.md b/documentation/wiki/developer-guide/internals/index.md index 25aa30320..1bd4468bf 100644 --- a/documentation/wiki/developer-guide/internals/index.md +++ b/documentation/wiki/developer-guide/internals/index.md @@ -5,7 +5,7 @@ This section documents the internal structure of the SimPaths codebase — how i ## Sections - [SimPaths API](api.md) — public API reference -- [File Organisation](file-organisation.md) — directory and package layout +- [Repository Guide](../repository-guide.md) — directory and package layout - [The SimPathsModel Class](simpaths-model.md) — the central model class - [Start Class Implementation](start-class-implementation.md) — SimPaths-specific start class - [MultiRun Implementation](multirun-implementation.md) — SimPaths-specific MultiRun class diff --git a/documentation/wiki/developer-guide/internals/multirun-implementation.md b/documentation/wiki/developer-guide/internals/multirun-implementation.md index ab2f03dfd..71ea39ffc 100644 --- a/documentation/wiki/developer-guide/internals/multirun-implementation.md +++ b/documentation/wiki/developer-guide/internals/multirun-implementation.md @@ -3,3 +3,140 @@ !!! warning "In progress" This page is under development. Contributions welcome — see the [Developer Guide](../index.md) for how to contribute. + +## Running the MultiRun in the command line + +### Prerequisites + +- Java 19 +- Maven 3.8+ +- Optional IDE: IntelliJ IDEA (import as a Maven project) + +### Build and run + +In the command line, navigate to the project directory and run the following commands: + +```bash +mvn clean package +java -jar multirun.jar -DBSetup +java -jar multirun.jar +``` + +The first command builds the JARs. The second creates the H2 donor database from the input data. The third runs the simulation using `default.yml`. + +To use a different config file: + +```bash +java -jar multirun.jar -config my_run.yml +``` + +For configuration options, see the annotated `config/default.yml`. For the data pipeline and further reference, see [`documentation/`](../../developer-guide/repository-guide.md). + + + + + \ No newline at end of file diff --git a/documentation/wiki/developer-guide/repository-guide.md b/documentation/wiki/developer-guide/repository-guide.md new file mode 100644 index 000000000..f7b2c0c9e --- /dev/null +++ b/documentation/wiki/developer-guide/repository-guide.md @@ -0,0 +1,533 @@ +# SimPaths Repository Guide + +A guide to navigating the SimPaths repository structure and codebase. + +--- + +## Table of Contents + +1. [Repository Structure](#repository-structure) +2. [Core Components](#core-components) +3. [Key Directories Explained](#key-directories-explained) +4. [Sub-package Detail](#sub-package-detail) +5. [Data Pipeline Reference](#data-pipeline-reference) +6. [Development Workflow](#development-workflow) +7. [Code Navigation Tips](#code-navigation-tips) +8. [Additional Resources](#additional-resources) + +--- + +## Repository Structure + +``` +SimPaths/ +├── config/ # Configuration files for simulations +│ ├── default.yml # Default simulation parameters +│ ├── test_create_database.yml # Database creation test config +│ └── test_run.yml # Test run configuration +│ +├── documentation/ # Comprehensive documentation +│ ├── figures/ # Diagrams and illustrations +│ ├── wiki/ # Full documentation website +│ │ ├── getting-started/ # Setup and first simulation guides +│ │ ├── overview/ # Model description and modules +│ │ ├── user-guide/ # Running simulations +│ │ ├── developer-guide/ # Extending the model +│ │ │ └── repository-guide.md # Repository guide (copy for website) +│ │ ├── jasmine-reference/ # JAS-mine library reference +│ │ ├── research/ # Published papers +│ │ └── validation/ # Model validation results +│ ├── repository-guide.md # Repository structure and navigation guide +│ ├── SimPaths_Variable_Codebook.xlsx # Codebook of all variables in SimPaths +│ ├── SimPaths_Stata_Parameters.xlsx # Comparison of parameters: Stata do-files vs Java code +│ └── SimPathsUK_Schedule.xlsx # Detailed schedule of events and corresponding classes +│ +├── input/ # Input data and parameters +│ ├── InitialPopulations/ # Starting population data +│ │ ├── training/ # De-identified training population (included in repo) +│ │ └── compile/ # Stata pipeline: builds populations, estimates regressions +│ │ ├── do_emphist/ # Employment history reconstruction sub-pipeline +│ │ └── RegressionEstimates/ # Regression coefficient estimation scripts +│ ├── EUROMODoutput/ # Tax-benefit model outputs +│ │ └── training/ # Training UKMOD outputs (included in repo) +│ ├── DoFilesTarget/ # Stata scripts that generate alignment targets +│ ├── align_*.xlsx # Alignment files (population, employment, etc.) +│ ├── reg_*.xlsx # Regression parameter files +│ ├── scenario_*.xlsx # Scenario configuration files +│ ├── projections_*.xlsx # Mortality/fertility projections +│ ├── DatabaseCountryYear.xlsx # Database metadata +│ ├── EUROMODpolicySchedule.xlsx # Policy schedule +│ ├── policy parameters.xlsx # Tax-benefit parameters +│ ├── validation_statistics.xlsx # Validation targets +│ └── input.mv.db # H2 donor database (generated by setup) +│ +├── output/ # Simulation outputs +│ ├── [timestamp]_[seed]_[run]/ # Timestamped output folders +│ │ ├── csv/ +│ │ │ ├── Statistics1.csv # Income distribution, Gini, S-Index +│ │ │ ├── Statistics2.csv # Demographics by age and gender +│ │ │ ├── Statistics3.csv # Alignment diagnostics +│ │ │ ├── Person.csv # Person-level output +│ │ │ ├── BenefitUnit.csv # Benefit-unit-level output +│ │ │ └── Household.csv # Household-level output +│ │ ├── database/ # Run-specific persistence output +│ │ └── input/ # Copied run input artifacts +│ └── logs/ # Log files (with -f flag on multirun) +│ +├── src/ # Source code +│ ├── main/ +│ │ ├── java/simpaths/ +│ │ │ ├── data/ # Data handling and parameters +│ │ │ ├── experiment/ # Simulation execution classes +│ │ │ └── model/ # Core model implementation +│ │ │ ├── decisions/ # Intertemporal optimisation grids +│ │ │ ├── enums/ # Categorical variable definitions +│ │ │ ├── taxes/ # EUROMOD donor matching +│ │ │ └── lifetime_incomes/ # Synthetic income trajectory generation +│ │ └── resources/ # Configuration resources +│ └── test/ # Test classes +│ +├── validation/ # Validation scripts and results +│ ├── 01_estimate_validation/ # Estimation validation +│ └── 02_simulated_output_validation/ # Output validation +│ +├── pom.xml # Maven build configuration +├── singlerun.jar # Executable for single runs +├── multirun.jar # Executable for multiple runs +└── README.md # Project overview +``` + +--- + +## Core Components + +### 1. **Entry Points** + +#### SimPathsStart (`src/main/java/simpaths/experiment/SimPathsStart.java`) +- Main class for single simulation execution +- Handles GUI and command-line interfaces +- Manages database setup phases +- **Key methods**: + - `main()`: Entry point + - `runGUIdialog()`: Launch GUI + - `runGUIlessSetup()`: Command-line setup + +#### SimPathsMultiRun (`src/main/java/simpaths/experiment/SimPathsMultiRun.java`) +- Coordinates multiple simulation runs +- Manages parallel execution +- Aggregates results across runs +- Configurable via YAML files + +### 2. **Core Model** + +#### SimPathsModel (`src/main/java/simpaths/model/SimPathsModel.java`) +- Central simulation manager +- Implements `AbstractSimulationManager` from JAS-mine +- Defines the simulation schedule via `buildSchedule()` +- Manages all simulation modules and processes +- **Key responsibilities**: + - Population initialization + - Event scheduling + - Module coordination + - Time progression + +### 3. **Data & Parameters** + +#### Parameters (`src/main/java/simpaths/data/Parameters.java`) +- Global parameter storage +- Loads regression coefficients from Excel +- Manages country-specific configurations +- Stores alignment targets +- **Key data structures**: + - Regression coefficient maps + - Policy parameters + - Alignment targets + - EUROMOD variable definitions + +--- + +## Key Directories Explained + +### `/src/main/java/simpaths/` + +#### `data/` +**Purpose**: Data handling, parameter management, and utility classes + +- **Parameters.java**: Global parameter storage and Excel data loading +- **ManagerRegressions.java**: Regression coefficient management +- **CallEUROMOD.java** / **CallEMLight.java**: Interface with tax-benefit models +- **filters/**: Collection filters for querying simulated populations +- **startingpop/**: Initial population data parsing +- **statistics/**: Statistical utilities + +#### `experiment/` +**Purpose**: Simulation execution and coordination + +- **SimPathsStart.java**: Single-run entry point +- **SimPathsMultiRun.java**: Multi-run orchestration +- **SimPathsCollector.java**: Output collection and aggregation +- **SimPathsObserver.java**: GUI updates and monitoring + +#### `model/` +**Purpose**: Core simulation logic + +- **SimPathsModel.java**: Main simulation manager +- **Person.java**: Individual-level processes and attributes +- **BenefitUnit.java**: Fiscal unit processes +- **Household.java**: Residential unit processes +- **decisions/**: Labour supply and consumption optimization +- **enums/**: Type-safe enumerations (Gender, Country, HealthStatus, etc.) +- **taxes/**: Tax-benefit donor matching system +- **lifetime_incomes/**: Lifetime income projection utilities + +### `/input/` + +**Critical input files**: + +| File Pattern | Purpose | +|--------------|---------| +| `align_*.xlsx` | Alignment targets (population, employment, education, etc.) | +| `reg_*.xlsx` | Regression parameters for behavioral processes | +| `scenario_*.xlsx` | Policy scenarios and projections | +| `projections_*.xlsx` | Demographic projections (mortality, fertility) | +| `DatabaseCountryYear.xlsx` | Tracks current database country/year | +| `EUROMODpolicySchedule.xlsx` | Tax-benefit policy schedule | +| `policy parameters.xlsx` | Detailed policy parameters | + +**Subdirectories**: +- `InitialPopulations/`: Starting population databases +- `EUROMODoutput/`: Tax-benefit donor population data +- `DoFilesTarget/`: Stata-generated alignment targets + +### `/config/` + +YAML configuration files override default parameters. The main file is **default.yml**, which contains several configuration sections: + +- **model_args**: SimPathsModel parameters (alignment switches, behavioral responses) +- **collector_args**: Output options (CSV, database, statistics) +- **parameter_args**: Data directories and input years +- **innovation_args**: Experimental parameters for sensitivity analysis + +Additional configuration files for testing: **test_create_database.yml**, **test_run.yml** + +### `/documentation/wiki/` + +Complete documentation organized by audience: + +- **getting-started/**: Environment setup, data access, first simulation +- **overview/**: Model description, modules, parameterization +- **user-guide/**: GUI, parameter modification, multiple runs +- **developer-guide/**: JAS-mine architecture, internals, how-to guides +- **jasmine-reference/**: Statistical packages, alignment, regression tools +- **research/**: Published papers and validation results + +--- + +## Sub-package Detail + +The following sub-packages are self-contained subsystems whose internals are not obvious from the class names alone. + +### `model/decisions/` — IO engine + +When IO is enabled, computing optimal consumption–labour choices for every agent at every time step would be prohibitively slow. This package solves the problem once before the simulation runs: it constructs a grid covering all meaningful combinations of state variables (wealth, age, health, family status, etc.), then works backwards from the end of life to find the optimal choice at each grid point (backward induction). During the simulation, agents simply look up their current state in the pre-computed grid. + +| Class | Purpose | +| --- | --- | +| `DecisionParams` | Defines the state-space dimensions and grid parameters for the optimisation problem. | +| `ManagerPopulateGrids` | Populates the state-space grid points and evaluates value functions by backward induction. | +| `ManagerSolveGrids` | Solves for optimal policy at each grid point. | +| `ManagerFileGrids` | Reads and writes pre-computed grids to disk, so they can be reused across runs. | +| `Grids` | Container for the set of solved decision grids. | +| `States` | Enumerates the state variables that define each grid point. | +| `Expectations` / `LocalExpectations` | Computes expected future values over stochastic transitions. | +| `CESUtility` | CES utility function used in the optimisation. | + +### `model/taxes/` — EUROMOD donor matching + +Imputes taxes and benefits onto simulated benefit units by matching them to pre-computed EUROMOD donor records. + +| Class | Purpose | +| --- | --- | +| `DonorTaxImputation` | Main entry point. Implements the three-step matching process: coarse-exact matching on characteristics, income proximity filtering, and candidate selection/averaging. | +| `KeyFunction` / `KeyFunction1`–`4` | Four progressively relaxed matching-key definitions. The system tries the tightest key first and falls back through wider keys if no donors are found. | +| `DonorKeys` | Builds composite matching keys from benefit-unit characteristics. | +| `DonorTaxUnit` / `DonorPerson` | Represent the pre-computed EUROMOD donor records loaded from the database. | +| `CandidateList` | Ranked list of donor matches for a given benefit unit, sorted by income proximity. | +| `Match` / `Matches` | Store the final selected donor(s) and their imputed tax-benefit values. | + +The `taxes/database/` sub-package handles loading donor data from the H2 database into memory (`TaxDonorDataParser`, `DatabaseExtension`, `MatchIndices`). + +### `model/lifetime_incomes/` — synthetic income trajectories + +When IO is enabled, this package creates projected income paths for birth cohorts using an AR(2) process anchored to age-gender geometric means, and matches simulated persons to donor income profiles. + +| Class | Purpose | +| --- | --- | +| `ManagerProjectLifetimeIncomes` | Generates the synthetic income trajectory database for all birth cohorts in the simulation horizon. | +| `LifetimeIncomeImputation` | Matches each simulated person to a donor income trajectory via binary search on the income CDF. | +| `AnnualIncome` | Implements the AR(2) income process with age-gender anchoring. | +| `BirthCohort` | Groups individuals by birth year for cohort-level income projection. | +| `Individual` | Entity carrying age dummies and log GDP per capita for income regression. | + +CSV filenames follow the pattern `.csv`. With a single run the suffix is `1`; with multiple runs each run produces its own numbered file. + +For a description of the variables in output CSV files, see `documentation/SimPaths_Variable_Codebook.xlsx`. For a description of each `reg_*`, `align_*`, and `scenario_*` input file, see [Model Parameterisation](../overview/parameterisation.md) on the website. + +--- + +## Data Pipeline Reference + +This section explains how the simulation-ready input files in `input/` are generated from raw survey data, and what to do if you need to update or extend them. + +The pipeline has three independent parts: (1) initial populations, (2) regression coefficients, (3) alignment targets. Each can be re-run separately. + +### Data sources + +| Source | Description | Access | +|--------|-------------|--------| +| **UKHLS** (Understanding Society) | Main household panel survey; waves 1 to O (UKDA-6614-stata) | Requires EUL licence from UK Data Service | +| **BHPS** (British Household Panel Survey) | Historical predecessor to UKHLS; used for pre-2009 employment history | Bundled with UKHLS EUL | +| **WAS** (Wealth and Assets Survey) | Biennial survey of household wealth; waves 1 to 7 (UKDA-7215-stata) | Requires EUL licence from UK Data Service | +| **EUROMOD / UKMOD** | Tax-benefit microsimulation system | See [Tax-Benefit Donors (UK)](../getting-started/data/tax-benefit-donors-uk.md) on the website | + +### Part 1 — Initial populations (`input/InitialPopulations/compile/`) + +**What it produces:** Annual CSV files `population_initial_UK_.csv` used as the starting population for each simulation run. + +**Master script:** `input/InitialPopulations/compile/00_master.do` + +The pipeline runs in numbered stages: + +| Script | What it does | +|--------|-------------| +| `01_prepare_UKHLS_pooled_data.do` | Pools and standardises UKHLS waves | +| `02_create_UKHLS_variables.do` | Constructs all required variables (demographics, labour, health, income, wealth flags) and applies simulation-consistency rules (retirement as absorbing state, education age bounds, work/hours consistency) | +| `02_01_checks.do` | Data quality checks | +| `03_social_care_received.do` | Social care receipt variables | +| `04_social_care_provided.do` | Informal care provision variables | +| `05_create_benefit_units.do` | Groups individuals into benefit units (tax units) following UK tax-benefit rules | +| `06_reweight_and_slice.do` | Reweighting and year-specific slicing | +| `07_was_wealth_data.do` | Prepares Wealth and Assets Survey data | +| `08_wealth_to_ukhls.do` | Merges WAS wealth into UKHLS records | +| `09_finalise_input_data.do` | Final cleaning and formatting | +| `10_check_yearly_data.do` | Per-year consistency checks | +| `99_training_data.do` | Produces the de-identified training population committed to `input/InitialPopulations/training/` | + +#### Employment history sub-pipeline (`compile/do_emphist/`) + +Reconstructs each respondent's monthly employment history from January 2007 onwards by combining UKHLS and BHPS interview records. The output variable `liwwh` (months employed since Jan 2007) feeds into the labour supply models. + +| Script | Purpose | +|--------|---------| +| `00_Master_emphist.do` | Master; sets parameters and calls sub-scripts | +| `01_Intdate.do` – `07_Empcal1a.do` | Sequential stages: interview dating, BHPS linkage, employment spell reconstruction, new-entrant identification | + +### Part 2 — Regression coefficients (`input/InitialPopulations/compile/RegressionEstimates/`) + +**What it produces:** The `reg_*.xlsx` coefficient tables read by `Parameters.java` at simulation startup. + +**Master script:** `input/InitialPopulations/compile/RegressionEstimates/master.do` + +> **Note:** Income and union-formation regressions depend on predicted wages, so `reg_wages.do` must complete before `reg_income.do` and `reg_partnership.do`. All other scripts can run in any order. + +**Required Stata packages:** `fre`, `tsspell`, `carryforward`, `outreg2`, `oparallel`, `gologit2`, `winsor`, `reghdfe`, `ftools`, `require` + +| Script | Module | Method | +|--------|--------|--------| +| `reg_wages.do` | Hourly wages | Heckman selection model (males and females separately) | +| `reg_income.do` | Non-labour income | Hurdle model (selection + amount); requires predicted wages | +| `reg_partnership.do` | Partnership formation/dissolution | Probit; requires predicted wages | +| `reg_education.do` | Education transitions | Generalised ordered logit | +| `reg_fertility.do` | Fertility | Probit | +| `reg_health.do` | Physical health (SF-12 PCS) | Linear regression | +| `reg_health_mental.do` | Mental health (GHQ-12, SF-12 MCS) | Linear regression | +| `reg_health_wellbeing.do` | Life satisfaction | Linear regression | +| `reg_home_ownership.do` | Homeownership transitions | Probit | +| `reg_retirement.do` | Retirement | Probit | +| `reg_leave_parental_home.do` | Leaving parental home | Probit | +| `reg_socialcare.do` | Social care receipt and provision | Probit / ordered logit | +| `reg_unemployment.do` | Unemployment transitions | Probit | +| `reg_financial_distress.do` | Financial distress | Probit | +| `programs.do` | Shared utility programs called by the estimation scripts | — | +| `variable_update.do` | Prepares and recodes variables before estimation | — | + +After running, output Excel files are placed in `input/` (overwriting the existing `reg_*.xlsx` files). + +### Part 3 — Alignment targets (`input/DoFilesTarget/`) + +**What it produces:** The `align_*.xlsx` and `*_targets.xlsx` files that the alignment modules use to rescale simulated rates. + +| Script | Output file | +|--------|------------| +| `01_employment_shares_initpopdata.do` | `input/employment_targets.xlsx` — employment shares by benefit-unit subgroup and year | +| `01_inSchool_targets_initpopdata.do` | `input/inSchool_targets.xlsx` — school participation rates by year | +| `03_calculate_partneredShare_initialPop_BUlogic.do` | `input/partnered_share_targets.xlsx` — partnership shares by year | +| `03_calculate_partnership_target.do` | Supplementary partnership targets | +| `02_person_risk_employment_stats.do` | `employment_risk_emp_stats.csv` — person-level at-risk diagnostics used for employment alignment group construction | + +Population projection targets (`align_popProjections.xlsx`) and fertility/mortality projections (`projections_*.xlsx`) come from ONS published projections and are not generated by these scripts. + +### When to re-run each part + +| Situation | What to re-run | +|-----------|---------------| +| Adding a new data year to the simulation | Part 1 (re-slice the population for the new year) + Part 3 (update alignment targets) | +| Re-estimating a behavioural module | Part 2 (the affected `reg_*.do` script only) + Stage 1 validation | +| Updating employment alignment targets | Part 3 (`01_employment_shares_initpopdata.do`) | + +After re-running any part, re-run setup (`singlerun -Setup` or `multirun -DBSetup`) to rebuild `input/input.mv.db` before running the simulation. + +### Setup-generated artifacts + +Running setup (`multirun -DBSetup`) creates or refreshes three files in `input/`: + +- `input.mv.db` — H2 database of EUROMOD donor tax-benefit outcomes +- `EUROMODpolicySchedule.xlsx` — maps simulation years to EUROMOD policy systems +- `DatabaseCountryYear.xlsx` — year-specific macro parameters + +These must exist before any simulation run. If they are missing, re-run setup. + +### Training mode + +The repository includes de-identified training data under `input/InitialPopulations/training/` and `input/EUROMODoutput/training/`. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation. + +### Logging + +With `-f` on `multirun.jar`, logs are written to `output/logs/run_.txt` (stdout) and `output/logs/run_.log` (log4j). + +--- + +## Development Workflow + +### 1. Understanding the Code + +**Start here**: +1. `SimPathsStart.java` — Understand initialization +2. `SimPathsModel.java` — Understand the simulation loop (`buildSchedule()`) +3. `Person.java`, `BenefitUnit.java`, `Household.java` — Understand agents +4. Module-specific methods in `Person.java` (e.g., `health()`, `education()`, `fertility()`) + +### 2. Key Design Patterns + +**JAS-mine Event Scheduling**: +```java +// In SimPathsModel.buildSchedule() +getEngine().getEventQueue().scheduleRepeat( + new SingleTargetEvent(this, Processes.UpdateYear), + 0.0, // Start time + 1.0 // Repeat interval +); +``` + +**Regression-based processes**: +```java +double score = Parameters.getRegression(RegressionName.HealthMentalHMLevel) + .getScore(regressors, Person.class.getDeclaredField("les_c4_lag1")); +``` + +**Alignment**: +```java +ResamplingAlignment.align( + population, // Collection to align + filter, // Subgroup filter + closure, // Alignment closure + targetValue // Target to match +); +``` + +### 3. Adding New Features + +**Example: Add a new person attribute** + +1. **Add field** to `Person.java`: + ```java + private Integer newAttribute; + ``` + +2. **Add getter/setter**: + ```java + public Integer getNewAttribute() { return newAttribute; } + public void setNewAttribute(Integer value) { this.newAttribute = value; } + ``` + +3. **Initialize** in constructor or relevant process method + +4. **Update database schema** if persisting (in `PERSON_VARIABLES_INITIAL`) + +5. **Add to outputs** in `SimPathsCollector.java` if needed + +**See**: `documentation/wiki/developer-guide/how-to/new-variable.md` + +### 4. Modifying Parameters + +**Regression coefficients**: Edit Excel files in `input/reg_*.xlsx` + +**Policy parameters**: Edit `input/policy parameters.xlsx` + +**Alignment targets**: Edit `input/align_*.xlsx` + +**Simulation options**: Edit `config/default.yml` or use GUI + +### 5. Adding GUI Parameters + +**Example**: +```java +@GUIparameter(description = "Enable new feature") +private Boolean enableNewFeature = true; +``` + +This automatically adds the parameter to the GUI interface. + +**See**: `documentation/wiki/developer-guide/how-to/add-gui-parameters.md` + +### 6. Testing + +Run tests via: +```bash +mvn test +``` + +Or via IDE test runner. + +### 7. Version Control + +**Branch naming conventions**: +- `feature/your-feature-name` — New features +- `bugfix/issue-number-description` — Bug fixes +- `docs/documentation-topic` — Documentation updates +- `experimental/your-description` — Experimental work + +**Main branches**: +- `main` — Stable release +- `develop` — Development integration + +--- + +## Code Navigation Tips + +**Find where a process runs**: +1. Search for the process name in `SimPathsModel.buildSchedule()` +2. Follow the method call to the implementation + +**Find regression parameters**: +1. Search for `Parameters.getRegression(RegressionName.XXX)` +2. The corresponding Excel file is in `input/reg_XXX.xlsx` + +**Find alignment logic**: +1. Search for classes ending in `Alignment` (e.g., `FertilityAlignment.java`) +2. Check `buildSchedule()` for when alignment occurs + +**Understand data flow**: +1. **Input**: Excel files → `Parameters.java` → Coefficient maps +2. **Process**: Regression score → Probability → Random draw → State change +3. **Output**: `SimPathsCollector.java` → CSV/Database + +--- + +## Additional Resources + +- **Full Documentation**: See `documentation/wiki/` for comprehensive guides +- **Getting Started**: `documentation/wiki/getting-started/` +- **Running Simulations**: `documentation/wiki/user-guide/` +- **Model Overview**: `documentation/wiki/overview/` +- **Issues**: [GitHub Issues](https://github.com/centreformicrosimulation/SimPaths/issues) diff --git a/mkdocs.yml b/mkdocs.yml index 76a189813..a6d8d9baf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -119,6 +119,7 @@ nav: - Developer Guide: - developer-guide/index.md - Working in GitHub: developer-guide/working-in-github.md + - Repository Guide: developer-guide/repository-guide.md - JAS-mine Architecture: - developer-guide/jasmine/index.md - Project Structure: developer-guide/jasmine/project-structure.md @@ -129,7 +130,6 @@ nav: - SimPaths Internals: - developer-guide/internals/index.md - SimPaths API: developer-guide/internals/api.md - - File Organisation: developer-guide/internals/file-organisation.md - The SimPathsModel Class: developer-guide/internals/simpaths-model.md - Start Class Implementation: developer-guide/internals/start-class-implementation.md - MultiRun Implementation: developer-guide/internals/multirun-implementation.md