|
| 1 | +# setjoin |
| 2 | + |
| 3 | +Record linkage that keeps groups together. Match persons while preserving household membership, students while respecting school assignments, or any hierarchical data where group integrity matters. |
| 4 | + |
| 5 | +## Installation |
| 6 | + |
| 7 | +```bash |
| 8 | +pip install setjoin |
| 9 | +``` |
| 10 | + |
| 11 | +## Quick Start |
| 12 | + |
| 13 | +```python |
| 14 | +import numpy as np |
| 15 | +from setjoin import match, HierarchySpec |
| 16 | + |
| 17 | +# Score matrix: how well does each source record match each target? |
| 18 | +scores = np.array([ |
| 19 | + [10.0, 2.0, 1.0, 1.0], # Person A scores high with targets 0,1 |
| 20 | + [9.0, 10.0, 1.0, 1.0], # Person B scores high with targets 0,1 |
| 21 | + [1.0, 1.0, 10.0, 2.0], # Person C scores high with targets 2,3 |
| 22 | + [1.0, 1.0, 9.0, 10.0], # Person D scores high with targets 2,3 |
| 23 | +]) |
| 24 | + |
| 25 | +# Define household structure: persons 0,1 are in household 0; persons 2,3 in household 1 |
| 26 | +hierarchy = HierarchySpec( |
| 27 | + source_groups={0: [0, 1], 1: [2, 3]}, |
| 28 | + target_groups={0: [0, 1], 1: [2, 3]}, |
| 29 | +) |
| 30 | + |
| 31 | +# Match while keeping households together |
| 32 | +result = match(scores, method="structure_aware", hierarchy=hierarchy) |
| 33 | +print(result.matches) # [(0, 0), (1, 1), (2, 2), (3, 3)] |
| 34 | +print(result.group_assignments) # {0: 0, 1: 1} - household mappings |
| 35 | +``` |
| 36 | + |
| 37 | +## When to Use setjoin |
| 38 | + |
| 39 | +- **Household/person matching**: Link survey respondents to administrative records while ensuring all household members map to the same target household |
| 40 | +- **Hierarchical data joining**: Match students to schools, employees to firms, or items to orders where group membership must be preserved |
| 41 | +- **Soft/probabilistic matching**: Get probability weights instead of hard assignments for uncertainty quantification |
| 42 | +- **Calibration to known marginals**: Ensure matched records reproduce known population distributions (age, geography, etc.) |
| 43 | + |
| 44 | +## Examples |
| 45 | + |
| 46 | +### Basic Matching (No Hierarchy) |
| 47 | + |
| 48 | +```python |
| 49 | +import numpy as np |
| 50 | +from setjoin import hungarian_match, greedy_match |
| 51 | + |
| 52 | +scores = np.array([ |
| 53 | + [10.0, 1.0, 1.0], |
| 54 | + [1.0, 10.0, 1.0], |
| 55 | + [1.0, 1.0, 10.0], |
| 56 | +]) |
| 57 | + |
| 58 | +# Optimal global assignment |
| 59 | +result = hungarian_match(scores) |
| 60 | +print(result.matches) # [(0, 0), (1, 1), (2, 2)] |
| 61 | +print(result.total_score) # 30.0 |
| 62 | + |
| 63 | +# Fast greedy alternative |
| 64 | +result = greedy_match(scores) |
| 65 | +``` |
| 66 | + |
| 67 | +### Building Scores from DataFrames |
| 68 | + |
| 69 | +```python |
| 70 | +import pandas as pd |
| 71 | +from setjoin import Scorer, FieldConfig |
| 72 | + |
| 73 | +source = pd.DataFrame({"age": [25, 30, 35], "income": [50000, 60000, 70000]}) |
| 74 | +target = pd.DataFrame({"age": [26, 31, 34], "income": [51000, 59000, 72000]}) |
| 75 | + |
| 76 | +scorer = Scorer({ |
| 77 | + "age": FieldConfig(weight=1.0, comparator="abs_diff"), |
| 78 | + "income": FieldConfig(weight=0.001, comparator="abs_diff"), |
| 79 | +}) |
| 80 | +scores = scorer.score(source, target) |
| 81 | +``` |
| 82 | + |
| 83 | +### Structure-Aware Matching (Groups) |
| 84 | + |
| 85 | +```python |
| 86 | +import pandas as pd |
| 87 | +from setjoin import match, HierarchySpec, Scorer, FieldConfig |
| 88 | + |
| 89 | +# Survey data with household IDs |
| 90 | +survey = pd.DataFrame({ |
| 91 | + "household_id": [1, 1, 2, 2], |
| 92 | + "age": [35, 10, 45, 42], |
| 93 | + "income": [50000, 0, 60000, 58000], |
| 94 | +}) |
| 95 | + |
| 96 | +# Admin records with household IDs |
| 97 | +admin = pd.DataFrame({ |
| 98 | + "household_id": [101, 101, 102, 102], |
| 99 | + "age": [36, 11, 44, 43], |
| 100 | + "income": [51000, 0, 59000, 57000], |
| 101 | +}) |
| 102 | + |
| 103 | +# Build score matrix (higher = better match, abs_diff returns negative distances) |
| 104 | +scorer = Scorer({ |
| 105 | + "age": FieldConfig(weight=1.0, comparator="abs_diff"), |
| 106 | + "income": FieldConfig(weight=0.0001, comparator="abs_diff"), |
| 107 | +}) |
| 108 | +scores = scorer.score(survey, admin) |
| 109 | + |
| 110 | +# Define hierarchy from dataframes |
| 111 | +hierarchy = HierarchySpec.from_dataframe( |
| 112 | + survey, admin, |
| 113 | + source_group_col="household_id", |
| 114 | + target_group_col="household_id", |
| 115 | +) |
| 116 | + |
| 117 | +# Match: all members of survey household 1 -> same admin household |
| 118 | +result = match(scores, method="structure_aware", hierarchy=hierarchy) |
| 119 | +``` |
| 120 | + |
| 121 | +### Soft Matching (Uncertainty) |
| 122 | + |
| 123 | +```python |
| 124 | +import numpy as np |
| 125 | +from setjoin import soft_match |
| 126 | + |
| 127 | +scores = np.array([ |
| 128 | + [10.0, 9.0], |
| 129 | + [9.0, 10.0], |
| 130 | +]) |
| 131 | + |
| 132 | +# Get probabilistic weights instead of hard assignments |
| 133 | +weights = soft_match(scores, regularization=0.5) |
| 134 | +print(weights.matrix) # Soft assignment probabilities |
| 135 | +print(weights.to_hard()) # Convert to hard matches when needed |
| 136 | +``` |
| 137 | + |
| 138 | +### Calibration to Known Marginals |
| 139 | + |
| 140 | +```python |
| 141 | +import numpy as np |
| 142 | +import pandas as pd |
| 143 | +from setjoin import calibrated_match, CalibrationSpec |
| 144 | + |
| 145 | +scores = np.eye(100) * 10 # 100 records |
| 146 | +source_df = pd.DataFrame({"region": ["north"] * 60 + ["south"] * 40}) |
| 147 | + |
| 148 | +# Target: 50/50 split, not the 60/40 in source |
| 149 | +calibration = CalibrationSpec( |
| 150 | + margins={"region": {"north": 0.5, "south": 0.5}} |
| 151 | +) |
| 152 | + |
| 153 | +result = calibrated_match(scores, source_df, calibration) |
| 154 | +print(result.weights) # Calibration weights for each match |
| 155 | +print(result.calibration_achieved) # Achieved proportions |
| 156 | +``` |
| 157 | + |
| 158 | +## API Overview |
| 159 | + |
| 160 | +| Function | Purpose | |
| 161 | +|----------|---------| |
| 162 | +| `match()` | Main entry point - routes to greedy, hungarian, or structure_aware | |
| 163 | +| `hungarian_match()` | Optimal 1-to-1 assignment maximizing total score | |
| 164 | +| `greedy_match()` | Fast heuristic picking highest scores first | |
| 165 | +| `structure_aware_match()` | Optimal assignment preserving group structure | |
| 166 | +| `soft_match()` | Probabilistic weights via entropy-regularized transport | |
| 167 | +| `calibrated_match()` | Match + rake weights to hit target marginals | |
| 168 | +| `Scorer` | Build score matrices from DataFrames with configurable comparators | |
| 169 | +| `HierarchySpec` | Define group structure for structure-aware matching | |
| 170 | +| `CalibrationSpec` | Define target marginal distributions | |
| 171 | + |
| 172 | +## License |
| 173 | + |
| 174 | +MIT |
0 commit comments