-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathpaper_engineering.tex
More file actions
552 lines (421 loc) · 20.8 KB
/
paper_engineering.tex
File metadata and controls
552 lines (421 loc) · 20.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
\documentclass[a4paper,11pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage{longtable}
\usepackage{array}
\usepackage{algorithm}
\usepackage{algpseudocode}
\title{\textbf{Context-Aware Row Merge, Engineering Version}\\
\large Compatibility-Constrained Graph Search for Reconstructing Sparse Rows}
\author{
Roman Pleshkov\\
\texttt{dmgcodevil@gmail.com}
}
\date{\today}
\begin{document}
\maketitle
\begin{abstract}
This note explains the context-aware row merge algorithm from an engineering point of view. The goal is to help a software engineer understand the problem, implement the algorithm, and reason about its practical behavior without reading theorem-heavy material. The algorithm takes sparse partially overlapping rows, builds a graph over exact tuple matches, and reconstructs more complete rows by searching for missing labels under a compatibility constraint. The key design choice in the v2 algorithm is that graph traversal is allowed only when the candidate row remains consistent with the values already selected for the row being reconstructed. This prevents the transitive leakage that often appears in naive graph-based merging. We describe the data model, graph construction, search strategy, tie handling, provenance, practical tuning, and a recommended benchmark plan.
\end{abstract}
\section{Pseudocode First}
If you only want the implementation idea, this section is enough to get started. The rest of the paper explains why these steps are structured this way and how they map to the repository code.
\begin{algorithm}
\caption{Build Graph from Exact Tuple Matches}
\begin{algorithmic}[1]
\Procedure{BuildGraph}{rows}
\State graph $\gets$ empty graph
\For{each row in rows}
\State graph.rows[row.id] $\gets$ row
\For{each tuple $t$ in row.tuples}
\State bucket $\gets$ graph.buckets[t.label][t.value]
\For{each existing node $n$ already in bucket}
\If{$n.row\_id \neq row.id$}
\State connect row.id and $n.row\_id$
\EndIf
\EndFor
\State insert node(row.id, tuple-index, t.value) into bucket
\EndFor
\EndFor
\State \Return graph
\EndProcedure
\end{algorithmic}
\end{algorithm}
\begin{algorithm}
\caption{Resolve One Missing Label}
\begin{algorithmic}[1]
\Procedure{FindBestTuple}{graph, sourceRowId, label, selected}
\If{label not present anywhere in graph}
\State \Return None
\EndIf
\State perfectSeeds $\gets [\ ]$
\State directSeeds $\gets [\ ]$
\State sourceRow $\gets$ graph.rows[sourceRowId]
\For{each candidate row in graph}
\If{candidate.id = sourceRowId}
\State continue
\EndIf
\If{candidate is not compatible with selected}
\State continue
\EndIf
\If{candidate overlaps neither sourceRow nor selected}
\State continue
\EndIf
\State seedScore $\gets$ selected-match-gain(candidate, selected)
\If{candidate is a high-priority compatible seed}
\State append $(candidate.id, seedScore)$ to perfectSeeds
\ElsIf{candidate overlaps sourceRow}
\State edgeScore $\gets$ exact-overlap-strength(sourceRow, candidate)
\State append $(candidate.id, seedScore + edgeScore)$ to directSeeds
\EndIf
\EndFor
\If{perfectSeeds is not empty}
\State result $\gets$ \Call{BestFirstSearch}{graph, sourceRowId, perfectSeeds, label, selected}
\If{result is not None}
\State \Return result
\EndIf
\EndIf
\If{directSeeds is empty}
\For{each compatible one-hop neighbor of sourceRow}
\State append neighbor with its local edge score and selected gain to directSeeds
\EndFor
\EndIf
\State \Return \Call{BestFirstSearch}{graph, sourceRowId, directSeeds, label, selected}
\EndProcedure
\end{algorithmic}
\end{algorithm}
\begin{algorithm}
\caption{Merge Rows}
\begin{algorithmic}[1]
\Procedure{Merge}{graph, columns}
\State table $\gets \{\}$
\For{each source row in graph}
\State selected $\gets$ copy of source row tuples
\State provenance $\gets$ paths that start from the source row
\State totalScore $\gets 0$
\For{each column in columns}
\If{column already present in selected}
\State continue
\EndIf
\State result $\gets$ \Call{FindBestTuple}{graph, sourceRow.id, column, selected}
\If{result is not None}
\State add result.tuples to selected[column]
\State add result.provenance to provenance[column]
\State totalScore $\gets$ totalScore + result.score
\EndIf
\EndFor
\State key $\gets$ canonicalized value sets for requested columns
\State keep only the highest-scoring row for each key
\EndFor
\State remove rows subsumed by broader ambiguity-preserving rows
\State \Return table values
\EndProcedure
\end{algorithmic}
\end{algorithm}
\begin{algorithm}
\caption{Weighted Compatible Search}
\begin{algorithmic}[1]
\Procedure{BestFirstSearch}{graph, startRowId, seedRows, label, selected}
\State priorityQueue $\gets$ max-heap ordered by score
\State bestScoreByRow $\gets$ default $-\infty$
\State bestTupleScore $\gets -\infty$
\State bestTuples $\gets [\ ]$
\For{each $(rowId, seedScore)$ in seedRows}
\State push $(seedScore, rowId, [startRowId, rowId])$ to priorityQueue
\State bestScoreByRow[rowId] $\gets$ max(bestScoreByRow[rowId], seedScore)
\EndFor
\While{priorityQueue not empty}
\State $(currentScore, currentRowId, path)$ $\gets$ pop highest-score item
\If{currentScore $<$ bestScoreByRow[currentRowId]}
\State continue
\EndIf
\State currentRow $\gets$ graph.rows[currentRowId]
\If{currentRow is not compatible with selected}
\State continue
\EndIf
\For{each tuple $t$ in currentRow with $t.label =$ label}
\If{currentScore $>$ bestTupleScore}
\State bestTupleScore $\gets$ currentScore
\State bestTuples $\gets [t]$
\State reset provenance for winning tuples
\ElsIf{currentScore = $bestTupleScore$}
\State append $t$ to bestTuples
\EndIf
\State append path to provenance of $t$
\EndFor
\If{currentScore $\leq bestTupleScore$}
\State continue
\EndIf
\For{each compatible one-hop neighbor nextRow of currentRow}
\If{nextRow.id already appears in path}
\State continue
\EndIf
\State nextScore $\gets$ currentScore + exact-overlap-strength(currentRow, nextRow)
\Statex \hspace{\algorithmicindent} + selected-match-gain(nextRow, selected) - path penalty
\If{nextScore $>$ bestScoreByRow[nextRow.id]}
\State bestScoreByRow[nextRow.id] $\gets$ nextScore
\State push $(nextScore, nextRow.id, path + [nextRow.id])$ to priorityQueue
\EndIf
\EndFor
\EndWhile
\If{bestTuples is empty}
\State \Return None
\EndIf
\State \Return winning tuples with bestTupleScore and provenance
\EndProcedure
\end{algorithmic}
\end{algorithm}
\section{Problem Setting}
Assume we extract structured tuples from several noisy or incomplete sources. Each source produces rows such as:
\begin{verbatim}
R1 = [(TRANSACTION_ID, 11111), (TRANSACTION_AMOUNT, 999.0)]
R2 = [(USER, alex), (USER_ID, 33333), (TRANSACTION_ID, 11111)]
R3 = [(PRODUCT_MAKE, lenovo), (PRODUCT_MODEL, ideapad 3i), (USER, alex)]
\end{verbatim}
No single row contains the full entity. The task is to merge the rows into a more complete table while avoiding incorrect cross-entity combinations.
The problem is harder than simple join or deduplication:
\begin{itemize}
\item rows are sparse,
\item rows only partially overlap,
\item some values are shared by multiple entities,
\item a path can look locally valid while becoming globally inconsistent after a few hops.
\end{itemize}
The implementation goal is:
\begin{quote}
Given a source row $R$ and a target list of labels, fill in the missing labels by following graph connections to compatible rows, preserve ambiguity when scores tie, and keep provenance for every chosen value.
\end{quote}
\section{Core Idea}
The algorithm treats each extracted row as a node in a graph. Two rows are connected only if they share at least one exact label-value tuple. To fill a missing label for a source row, the algorithm searches through this graph and scores candidate rows. The search is not allowed to move through rows that contradict the values already selected for the partial merge.
This gives three important properties:
\begin{itemize}
\item Graph edges are concrete and explainable because they come from exact shared tuples.
\item Search remains local and deterministic.
\item The compatibility rule blocks many incorrect transitive merges.
\end{itemize}
\section{Why the Compatibility Constraint Matters}
Consider the following rows:
\begin{verbatim}
R1 = [(L1, V1)]
R2 = [(L1, V2)]
R3 = [(L3, V3), (L4, V4), (L5, V5)]
R4 = [(L3, V3), (L4, V5), (L5, V6)]
R5 = [(L1, V1), (L3, V3), (L4, V4)]
R6 = [(L1, V2), (L3, V3), (L4, V5)]
\end{verbatim}
Without compatibility checks, both $R1$ and $R2$ can reach rows containing $(L3, V3)$, and the graph may accidentally mix values from both branches. In practice that leads to rows like:
\begin{verbatim}
[(L1, V1), (L3, V3), (L4, V5), (L5, V6)]
\end{verbatim}
even though $(L4, V5)$ belongs to the branch anchored by $(L1, V2)$.
The v2 algorithm fixes this by making compatibility explicit:
\begin{quote}
Two rows are compatible if every overlapping label agrees on at least one identical value.
\end{quote}
Once the merge for a source row has already selected $(L1, V1)$ and $(L4, V4)$, any future candidate row that only supports $(L1, V2)$ or $(L4, V5)$ is rejected.
\section{Data Model}
The implementation uses a small set of domain objects:
\begin{itemize}
\item \texttt{Tuple}: one extracted label-value pair plus metadata.
\item \texttt{Row}: a list of tuples coming from one source record.
\item \texttt{Graph}: the row graph plus label-specific buckets for efficient lookup.
\item \texttt{SearchResult}: the best tuple or tied tuples for one requested label.
\item \texttt{DataRow}: one merged output row with score, values, and provenance.
\end{itemize}
Each tuple stores:
\begin{itemize}
\item \texttt{row\_id}: source row identifier,
\item \texttt{label},
\item \texttt{value}: normalized value used for matching,
\item \texttt{source\_value}: original display value,
\item \texttt{path}: source path inside the document,
\item \texttt{value\_type},
\item \texttt{weight}.
\end{itemize}
The normalized \texttt{value} is what the graph matches on. The original \texttt{source\_value} is what the UI or report should display.
\section{Graph Construction}
\subsection{Exact Matching}
The v2 algorithm uses exact tuple identity for connectivity:
\[
\texttt{tuples\_match}(t_1, t_2) \iff
t_1.\texttt{label} = t_2.\texttt{label}
\land
t_1.\texttt{value} = t_2.\texttt{value}
\]
This is intentionally stricter than token overlap. Token overlap is useful during extraction or normalization, but graph merge becomes much safer if graph edges are built only from exact tuple matches.
\subsection{Buckets}
For each label, the implementation keeps a bucket keyed by normalized value. When a new tuple is added, it is inserted into the bucket for its label and value, and the row is connected to all previous rows that contain the same label-value pair.
In repository terms, this is the role of \texttt{Graph.add\_tuple()} in \texttt{core.py}.
\subsection{Connectivity}
Two rows are directly connected if:
\begin{itemize}
\item they share at least one exact tuple, and
\item they are compatible on every overlapping label.
\end{itemize}
The function \texttt{Graph.get\_directly\_connected\_rows\_with\_scores()} computes these neighbors and gives each neighbor a local edge score.
\section{Selected State}
The selected state is the current partial reconstruction for the source row. It is the most important implementation concept in v2.
The merge starts with:
\begin{verbatim}
selected_state := copy(source_row)
\end{verbatim}
That means existing values from the source row are not treated as hints. They are part of the current state and constrain the rest of the search.
As more labels are resolved, the selected state grows monotonically. Every new search step must remain compatible with the already selected values.
\section{Scoring}
For one search step from the current row to a directly connected row, the implementation uses:
\[
\text{next\_score} =
\text{current\_score}
+ \text{edge\_gain}
+ \text{selected\_gain}
- \text{PATH\_PENALTY}
\]
where:
\begin{itemize}
\item \textbf{edge\_gain}: score from exact shared tuples between the two rows,
\item \textbf{selected\_gain}: bonus when the candidate row agrees with values already selected for the output row,
\item \textbf{PATH\_PENALTY}: constant cost per traversed edge.
\end{itemize}
\subsection{Edge Gain}
For exact tuple matching, one shared tuple contributes:
\[
\text{strength\_tuple}(t_1, t_2) =
\texttt{DEFAULT\_TUPLE\_MATCH\_GAIN}
+ t_1.\texttt{weight}
+ t_2.\texttt{weight}
\]
The edge score is the sum over all shared exact tuples.
\subsection{Selected Gain}
The selected gain increases the score of rows that agree with what is already known for the current output row. In the repository, this is implemented by \texttt{strength\_selected()}.
This term is what makes the algorithm context-aware. A row is not judged only by its local overlap with the previous row; it is also judged by how well it fits the whole partial reconstruction.
\subsection{Constant Path Penalty}
The v2 paper models path cost as a constant penalty per traversed edge. In implementation terms, each graph expansion subtracts:
\[
\texttt{PATH\_PENALTY}
\]
This discourages long chains unless those chains continue to provide enough evidence to justify the extra hop.
\section{Search Strategy}
For each missing label, the algorithm performs a weighted best-first search.
\subsection{Seed Rows}
The search uses two kinds of starting points:
\begin{itemize}
\item \textbf{Perfect seeds}: rows already compatible with the selected state and strongly aligned with what is known so far.
\item \textbf{Direct seeds}: compatible rows directly connected to the source row.
\end{itemize}
The implementation checks the stronger seeds first. If a high-quality answer is found there, the algorithm does not need to expand weaker candidates.
\subsection{Best-Score-Per-Node}
The v2 paper corrects an important point from the original presentation: a simple visited set is not enough. The same row may be reached with different scores depending on which path brought us there.
So the search keeps:
\begin{verbatim}
best_score_by_row[row_id]
\end{verbatim}
and only expands a row state if the new path score is at least as good as the best score seen for that row.
\subsection{Tie Handling}
If two candidate tuples for the target label end with the same best score, the implementation keeps both values. This is useful in ambiguous settings where the graph contains insufficient evidence to choose a single answer.
The final merged row may therefore contain multiple values for a label. This is a feature, not an error.
\subsection{Provenance}
Every winning tuple carries provenance as one or more row paths showing how the search reached it. In practice this makes the merge inspectable:
\begin{itemize}
\item which row contributed the final value,
\item which intermediate rows connected the source row to that value,
\item whether the result was unique or tied.
\end{itemize}
\section{End-to-End Merge Procedure}
For each source row:
\begin{enumerate}
\item initialize the selected state with the tuples already present in the source row,
\item for each requested output label not already present, run best-first search,
\item if a best candidate exists, add it to the selected state,
\item keep provenance for all selected tuples,
\item form a canonical key from the resolved values,
\item keep the highest-scoring merged row for each key.
\end{enumerate}
After all rows are merged, the implementation runs a small cleanup step that removes rows subsumed by a more general ambiguity-preserving row. This avoids returning both:
\begin{verbatim}
L4 = {V4}
\end{verbatim}
and
\begin{verbatim}
L4 = {V4, V5}
\end{verbatim}
when the second row already dominates the first under the repository's tie policy.
\section{How This Maps to \texttt{core.py}}
The paper-level procedures above map directly to the implementation in \texttt{core.py}. If you want to read the code after reading the pseudocode, this is the best entry point:
\begin{longtable}{|p{0.28\textwidth}|p{0.64\textwidth}|}
\hline
\textbf{Code location} & \textbf{Responsibility} \\
\hline
\texttt{Tuple}, \texttt{Row}, \texttt{DataRow} & Core data structures. \\
\hline
\texttt{tuples\_match()}, \texttt{tuples\_compatible()} & Exact tuple identity and compatibility rules. \\
\hline
\texttt{Graph.add\_tuple()} & Bucketed graph construction and exact-value connections. \\
\hline
\texttt{Graph.get\_directly\_connected\_rows\_with\_scores()} & Compatible neighbors plus local edge scores. \\
\hline
\texttt{Graph.bfs\_find\_best()} & Weighted best-first search with tie preservation and provenance. \\
\hline
\texttt{Graph.find\_best\_tuple()} & Seed selection and per-label search orchestration. \\
\hline
\texttt{merge()} & End-to-end row completion and final table assembly. \\
\hline
\texttt{main.py} & CLI entry point for loading entity files and printing the merged table. \\
\hline
\end{longtable}
\section{Complexity}
Let:
\begin{itemize}
\item $n$ be the number of rows,
\item $m$ be the number of tuples,
\item $d$ be the average number of compatible graph neighbors per row.
\end{itemize}
Graph construction is approximately linear in the number of tuples, plus the collision cost inside each label-value bucket.
The expensive part is search:
\begin{itemize}
\item for each source row,
\item for each missing output label,
\item run a best-first search over reachable compatible rows.
\end{itemize}
In practice, performance is dominated by:
\begin{itemize}
\item how many rows share high-frequency values,
\item how strong the compatibility filter is,
\item how many output labels are missing,
\item how often tie states must be preserved.
\end{itemize}
For engineering use, the main scalability lever is not cleverer graph theory. It is better normalization and better value selection for graph edges.
\section{Implementation Guidance}
\subsection{Normalize Before You Merge}
This algorithm depends heavily on exact tuple identity. If one source emits \texttt{01-06-2025} and another emits \texttt{01062025}, they should be normalized to the same internal value before graph construction.
\subsection{Use Exact Matching for Graph Edges}
If fuzzy matching is needed, use it before the graph stage to normalize values or assign weights. Do not silently make graph connectivity fuzzy unless you are ready to handle many more false merges.
\subsection{Keep Ambiguity Visible}
Do not force a single winner when the graph does not justify it. Keeping tied values lets downstream systems decide whether to ask for more evidence, show multiple options, or apply domain-specific tie-breaking.
\subsection{Make Provenance a First-Class Output}
The easiest way to debug row merge is to inspect the path that produced the answer. Provenance is also useful for UI explanations, human review, and regression testing.
\section{Failure Modes}
Even with compatibility constraints, some problems remain:
\begin{itemize}
\item High-frequency values such as common names or default dates may create many weak edges.
\item If extraction produced wrong labels, compatibility may reject good rows or accept bad ones.
\item If normalization is inconsistent, exact matching will miss valid links.
\item The current scoring model is additive and hand-tuned. In some domains, weights may need calibration.
\end{itemize}
These are good reasons to keep a benchmark suite and report error versus controlled noise.
\section{Recommended Evaluation}
For publication and for engineering confidence, the strongest next step is to evaluate the algorithm on a synthetic or semi-synthetic benchmark where noise can be controlled. The benchmark should vary:
\begin{itemize}
\item sparsity,
\item fragmentation,
\item value corruption,
\item distractor rows,
\item shared-value collisions,
\item extraction label noise.
\end{itemize}
The report should show how row reconstruction error changes as each noise source increases. A separate companion report is usually cleaner than packing all benchmark details into the main paper.
\section{Conclusion}
The engineering takeaway is simple:
\begin{quote}
Treat row merge as graph search over exact shared tuples, but only allow paths that stay compatible with the partial row already reconstructed.
\end{quote}
That one constraint is what turns a plausible graph heuristic into a safer and more implementable system. The current repository already contains the essential pieces: exact connectivity, compatibility-aware search, tie preservation, provenance, and a CLI for running the merge on extracted entity files.
\end{document}