Skip to content

Commit cb79258

Browse files
committed
v0
0 parents  commit cb79258

27 files changed

Lines changed: 4020 additions & 0 deletions

.github/workflows/ci.yml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
matrix:
14+
python-version: ["3.12", "3.13"]
15+
16+
steps:
17+
- uses: actions/checkout@v4
18+
19+
- name: Install uv
20+
uses: astral-sh/setup-uv@v5
21+
22+
- name: Set up Python ${{ matrix.python-version }}
23+
run: uv python install ${{ matrix.python-version }}
24+
25+
- name: Install dependencies
26+
run: uv sync --all-extras
27+
28+
- name: Lint with ruff
29+
run: uv run ruff check setjoin tests
30+
31+
- name: Type check with mypy
32+
run: uv run mypy setjoin
33+
34+
- name: Run tests
35+
run: uv run pytest tests/ -v
36+
37+
lint:
38+
runs-on: ubuntu-latest
39+
steps:
40+
- uses: actions/checkout@v4
41+
42+
- name: Install uv
43+
uses: astral-sh/setup-uv@v5
44+
45+
- name: Set up Python
46+
run: uv python install 3.12
47+
48+
- name: Install dependencies
49+
run: uv sync --all-extras
50+
51+
- name: Check formatting with ruff
52+
run: uv run ruff format --check setjoin tests

.github/workflows/docs.yml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
name: Documentation
2+
3+
on:
4+
push:
5+
branches: [main]
6+
7+
permissions:
8+
contents: read
9+
pages: write
10+
id-token: write
11+
12+
concurrency:
13+
group: pages
14+
cancel-in-progress: false
15+
16+
jobs:
17+
build:
18+
runs-on: ubuntu-latest
19+
steps:
20+
- uses: actions/checkout@v4
21+
22+
- name: Install uv
23+
uses: astral-sh/setup-uv@v5
24+
25+
- name: Set up Python
26+
run: uv python install 3.12
27+
28+
- name: Install dependencies
29+
run: uv sync --all-extras
30+
31+
- name: Build documentation
32+
run: uv run sphinx-build docs docs/_build/html
33+
34+
- name: Upload artifact
35+
uses: actions/upload-pages-artifact@v3
36+
with:
37+
path: docs/_build/html
38+
39+
deploy:
40+
environment:
41+
name: github-pages
42+
url: ${{ steps.deployment.outputs.page_url }}
43+
runs-on: ubuntu-latest
44+
needs: build
45+
steps:
46+
- name: Deploy to GitHub Pages
47+
id: deployment
48+
uses: actions/deploy-pages@v4

.gitignore

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
build/
8+
develop-eggs/
9+
dist/
10+
downloads/
11+
eggs/
12+
.eggs/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
wheels/
19+
*.egg-info/
20+
.installed.cfg
21+
*.egg
22+
23+
# Virtual environments
24+
.venv/
25+
venv/
26+
ENV/
27+
28+
# IDE
29+
.idea/
30+
.vscode/
31+
*.swp
32+
*.swo
33+
*~
34+
35+
# Testing
36+
.pytest_cache/
37+
.coverage
38+
htmlcov/
39+
.tox/
40+
.nox/
41+
42+
# Type checking
43+
.mypy_cache/
44+
.dmypy.json
45+
dmypy.json
46+
47+
# Linting
48+
.ruff_cache/
49+
50+
# Documentation
51+
docs/_build/
52+
53+
# LaTeX
54+
*.aux
55+
*.bbl
56+
*.blg
57+
*.fdb_latexmk
58+
*.fls
59+
*.log
60+
*.out
61+
*.synctex.gz
62+
*.toc
63+
64+
# OS
65+
.DS_Store
66+
Thumbs.db
67+
68+
# uv
69+
uv.lock

README.md

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# setjoin
2+
3+
Record linkage that keeps groups together. Match persons while preserving household membership, students while respecting school assignments, or any hierarchical data where group integrity matters.
4+
5+
## Installation
6+
7+
```bash
8+
pip install setjoin
9+
```
10+
11+
## Quick Start
12+
13+
```python
14+
import numpy as np
15+
from setjoin import match, HierarchySpec
16+
17+
# Score matrix: how well does each source record match each target?
18+
scores = np.array([
19+
[10.0, 2.0, 1.0, 1.0], # Person A scores high with targets 0,1
20+
[9.0, 10.0, 1.0, 1.0], # Person B scores high with targets 0,1
21+
[1.0, 1.0, 10.0, 2.0], # Person C scores high with targets 2,3
22+
[1.0, 1.0, 9.0, 10.0], # Person D scores high with targets 2,3
23+
])
24+
25+
# Define household structure: persons 0,1 are in household 0; persons 2,3 in household 1
26+
hierarchy = HierarchySpec(
27+
source_groups={0: [0, 1], 1: [2, 3]},
28+
target_groups={0: [0, 1], 1: [2, 3]},
29+
)
30+
31+
# Match while keeping households together
32+
result = match(scores, method="structure_aware", hierarchy=hierarchy)
33+
print(result.matches) # [(0, 0), (1, 1), (2, 2), (3, 3)]
34+
print(result.group_assignments) # {0: 0, 1: 1} - household mappings
35+
```
36+
37+
## When to Use setjoin
38+
39+
- **Household/person matching**: Link survey respondents to administrative records while ensuring all household members map to the same target household
40+
- **Hierarchical data joining**: Match students to schools, employees to firms, or items to orders where group membership must be preserved
41+
- **Soft/probabilistic matching**: Get probability weights instead of hard assignments for uncertainty quantification
42+
- **Calibration to known marginals**: Ensure matched records reproduce known population distributions (age, geography, etc.)
43+
44+
## Examples
45+
46+
### Basic Matching (No Hierarchy)
47+
48+
```python
49+
import numpy as np
50+
from setjoin import hungarian_match, greedy_match
51+
52+
scores = np.array([
53+
[10.0, 1.0, 1.0],
54+
[1.0, 10.0, 1.0],
55+
[1.0, 1.0, 10.0],
56+
])
57+
58+
# Optimal global assignment
59+
result = hungarian_match(scores)
60+
print(result.matches) # [(0, 0), (1, 1), (2, 2)]
61+
print(result.total_score) # 30.0
62+
63+
# Fast greedy alternative
64+
result = greedy_match(scores)
65+
```
66+
67+
### Building Scores from DataFrames
68+
69+
```python
70+
import pandas as pd
71+
from setjoin import Scorer, FieldConfig
72+
73+
source = pd.DataFrame({"age": [25, 30, 35], "income": [50000, 60000, 70000]})
74+
target = pd.DataFrame({"age": [26, 31, 34], "income": [51000, 59000, 72000]})
75+
76+
scorer = Scorer({
77+
"age": FieldConfig(weight=1.0, comparator="abs_diff"),
78+
"income": FieldConfig(weight=0.001, comparator="abs_diff"),
79+
})
80+
scores = scorer.score(source, target)
81+
```
82+
83+
### Structure-Aware Matching (Groups)
84+
85+
```python
86+
import pandas as pd
87+
from setjoin import match, HierarchySpec, Scorer, FieldConfig
88+
89+
# Survey data with household IDs
90+
survey = pd.DataFrame({
91+
"household_id": [1, 1, 2, 2],
92+
"age": [35, 10, 45, 42],
93+
"income": [50000, 0, 60000, 58000],
94+
})
95+
96+
# Admin records with household IDs
97+
admin = pd.DataFrame({
98+
"household_id": [101, 101, 102, 102],
99+
"age": [36, 11, 44, 43],
100+
"income": [51000, 0, 59000, 57000],
101+
})
102+
103+
# Build score matrix (higher = better match, abs_diff returns negative distances)
104+
scorer = Scorer({
105+
"age": FieldConfig(weight=1.0, comparator="abs_diff"),
106+
"income": FieldConfig(weight=0.0001, comparator="abs_diff"),
107+
})
108+
scores = scorer.score(survey, admin)
109+
110+
# Define hierarchy from dataframes
111+
hierarchy = HierarchySpec.from_dataframe(
112+
survey, admin,
113+
source_group_col="household_id",
114+
target_group_col="household_id",
115+
)
116+
117+
# Match: all members of survey household 1 -> same admin household
118+
result = match(scores, method="structure_aware", hierarchy=hierarchy)
119+
```
120+
121+
### Soft Matching (Uncertainty)
122+
123+
```python
124+
import numpy as np
125+
from setjoin import soft_match
126+
127+
scores = np.array([
128+
[10.0, 9.0],
129+
[9.0, 10.0],
130+
])
131+
132+
# Get probabilistic weights instead of hard assignments
133+
weights = soft_match(scores, regularization=0.5)
134+
print(weights.matrix) # Soft assignment probabilities
135+
print(weights.to_hard()) # Convert to hard matches when needed
136+
```
137+
138+
### Calibration to Known Marginals
139+
140+
```python
141+
import numpy as np
142+
import pandas as pd
143+
from setjoin import calibrated_match, CalibrationSpec
144+
145+
scores = np.eye(100) * 10 # 100 records
146+
source_df = pd.DataFrame({"region": ["north"] * 60 + ["south"] * 40})
147+
148+
# Target: 50/50 split, not the 60/40 in source
149+
calibration = CalibrationSpec(
150+
margins={"region": {"north": 0.5, "south": 0.5}}
151+
)
152+
153+
result = calibrated_match(scores, source_df, calibration)
154+
print(result.weights) # Calibration weights for each match
155+
print(result.calibration_achieved) # Achieved proportions
156+
```
157+
158+
## API Overview
159+
160+
| Function | Purpose |
161+
|----------|---------|
162+
| `match()` | Main entry point - routes to greedy, hungarian, or structure_aware |
163+
| `hungarian_match()` | Optimal 1-to-1 assignment maximizing total score |
164+
| `greedy_match()` | Fast heuristic picking highest scores first |
165+
| `structure_aware_match()` | Optimal assignment preserving group structure |
166+
| `soft_match()` | Probabilistic weights via entropy-regularized transport |
167+
| `calibrated_match()` | Match + rake weights to hit target marginals |
168+
| `Scorer` | Build score matrices from DataFrames with configurable comparators |
169+
| `HierarchySpec` | Define group structure for structure-aware matching |
170+
| `CalibrationSpec` | Define target marginal distributions |
171+
172+
## License
173+
174+
MIT

docs/api.rst

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
API Reference
2+
=============
3+
4+
Matching
5+
--------
6+
7+
.. automodule:: setjoin.matchers
8+
:members:
9+
:undoc-members:
10+
:show-inheritance:
11+
12+
Scoring
13+
-------
14+
15+
.. automodule:: setjoin.scorers
16+
:members:
17+
:undoc-members:
18+
:show-inheritance:
19+
20+
Hierarchy
21+
---------
22+
23+
.. automodule:: setjoin.hierarchy
24+
:members:
25+
:undoc-members:
26+
:show-inheritance:
27+
28+
Diagnostics
29+
-----------
30+
31+
.. automodule:: setjoin.diagnostics
32+
:members:
33+
:undoc-members:
34+
:show-inheritance:
35+
36+
Types
37+
-----
38+
39+
.. automodule:: setjoin.types
40+
:members:
41+
:undoc-members:
42+
:show-inheritance:
43+
44+
Visualization
45+
-------------
46+
47+
.. automodule:: setjoin.plots
48+
:members:
49+
:undoc-members:
50+
:show-inheritance:

0 commit comments

Comments
 (0)