ContextLens/report_noise_evaluation.tex at main · dmgcodevil/ContextLens · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
\documentclass[a4paper,11pt]{article}

\usepackage[margin=1in]{geometry}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage{longtable}

\title{\textbf{Noise Benchmark and Evaluation Report}\\
\large Context-Aware Row Merge}
\author{
    Roman Pleshkov\\
    \texttt{dmgcodevil@gmail.com}
}
\date{\today}

\begin{document}

\maketitle

\begin{abstract}
This document is a practical benchmark and reporting template for evaluating the context-aware row merge algorithm under controlled noise. The goal is to measure how reconstruction quality changes as the input becomes sparser, noisier, and more ambiguous. The recommended setup is a synthetic or semi-synthetic generator that starts from clean canonical entities, renders multiple partial row views per entity, injects noise with tunable severity, and exports the result in the same JSON format used by the repository. The main outcome is a set of curves showing error rate as a function of noise level, plus diagnostic metrics for incompatibility, ambiguity, and coverage.
\end{abstract}

\section{Why a Separate Evaluation Report Helps}

The main paper should explain the idea, the algorithm, and the implementation choices. A benchmark section inside the paper is still useful, but the full experimental setup usually grows quickly:

\begin{itemize}
\item dataset generation details,
\item noise families and parameter sweeps,
\item multiple metrics,
\item plots across domains and random seeds,
\item ablations and baseline comparisons.
\end{itemize}

For that reason, this report is designed to stand alone. It can also be attached as an appendix or supplementary material later.

\section{Evaluation Goal}

We want to answer four engineering questions:

\begin{enumerate}
\item How well does the merge reconstruct complete rows when inputs are clean?
\item How quickly does error increase as noise increases?
\item Which kinds of noise hurt most: sparsity, collisions, corruption, or extraction mistakes?
\item Does the v2 compatibility constraint reduce bad cross-entity merges compared with looser search variants?
\end{enumerate}

\section{Benchmark Design}

\subsection{Canonical Clean Table}

Start from a clean latent table of entities. Each entity contains the full set of labels we want the merge to reconstruct. For example, in a finance-commerce domain:

\begin{verbatim}
TRANSACTION_ID
TRANSACTION_DATE
TRANSACTION_AMOUNT
USER
USER_ID
PRODUCT_MAKE
PRODUCT_MODEL
PRODUCT_PRICE
\end{verbatim}

Each canonical entity should be internally coherent and free of ambiguity.

\subsection{Source-Specific Partial Views}

For every canonical entity, generate several partial rows that simulate different sources:

\begin{itemize}
\item transaction logs,
\item user activity logs,
\item inventory records,
\item shipping records,
\item customer support summaries.
\end{itemize}

Each source should expose only a subset of labels. This creates the sparse partially overlapping rows that the algorithm is designed to merge.

\subsection{Export Format}

The easiest path is to generate benchmark data directly in the repository's current entity JSON format so the existing loader can consume it unchanged.

For each row, export:

\begin{itemize}
\item \texttt{id},
\item \texttt{format},
\item \texttt{file\_path},
\item list of tuples with \texttt{path}, \texttt{value}, \texttt{source\_value}, \texttt{value\_type}, and \texttt{label}.
\end{itemize}

\section{Noise Families}

The benchmark should not use a single vague notion of noise. It should vary several independent noise families.

\subsection{Missingness}

Randomly drop tuple values from rows. This measures how much the algorithm depends on dense local evidence.

\subsection{Row Fragmentation}

Split what would have been a larger row into multiple smaller rows. This forces the merge to use longer graph paths.

\subsection{Shared-Value Collisions}

Increase the rate of values shared across different entities, such as:

\begin{itemize}
\item common first names,
\item repeated prices,
\item repeated dates,
\item reused product family names.
\end{itemize}

This is the most important stress test for the compatibility constraint because collisions create tempting but dangerous graph edges.

\subsection{Value Corruption}

Corrupt values with realistic extraction noise:

\begin{itemize}
\item typos,
\item OCR-style digit swaps,
\item punctuation changes,
\item whitespace variation,
\item alternate date formats,
\item casing variation.
\end{itemize}

This measures the dependence of the graph on normalization quality.

\subsection{Label Noise}

Occasionally assign the wrong label to an extracted value. This simulates extraction failures and tests whether compatibility becomes too brittle when labels are wrong.

\subsection{Distractor Rows}

Inject rows that do not belong to any target entity or mix fields from unrelated entities. This tests the algorithm's ability to avoid being pulled into irrelevant paths.

\section{Noise Severity Scale}

For each noise family, define a simple severity scale:

\begin{longtable}{|l|p{0.68\textwidth}|}
\hline
\textbf{Level} & \textbf{Example interpretation} \\
\hline
0 & Clean data or nearly clean data. \\
\hline
1 & Light noise that should still be easy for the merge. \\
\hline
2 & Moderate noise that introduces ambiguity but keeps most links recoverable. \\
\hline
3 & Heavy noise with frequent missing values and confusing shared values. \\
\hline
4 & Severe noise where a substantial part of the graph is ambiguous or corrupted. \\
\hline
\end{longtable}

The exact percentages can differ per noise family. For example:

\begin{itemize}
\item missingness: 0\%, 10\%, 25\%, 40\%, 60\%,
\item corruption: 0\%, 2\%, 5\%, 10\%, 20\%,
\item distractors: 0\%, 10\%, 25\%, 50\%, 100\% extra rows relative to clean rows.
\end{itemize}

\section{Metrics}

The report should track more than a single accuracy number.

\subsection{Per-Label Value Error}

For every requested label, compare the predicted set of values with the ground truth set of values for the entity.

\[
\text{label error rate} = 1 - \text{exact set match accuracy}
\]

This makes tie handling explicit: returning multiple values is only correct when the ground truth for that label is genuinely ambiguous under the chosen benchmark definition.

\subsection{Exact Row Reconstruction Accuracy}

Count a merged row as correct only if all requested labels match the ground truth entity exactly.

This is the strongest end-to-end metric.

\subsection{Incompatible Merge Rate}

Track how often the output row contains values that cannot all belong to the same canonical entity. This is the clearest metric for showing the value of the v2 compatibility constraint.

\subsection{Coverage}

Measure the fraction of requested labels that receive at least one non-placeholder value. A system that avoids mistakes by returning nothing should not look artificially strong.

\subsection{Ambiguity Rate}

Track how often the algorithm returns multiple tied values for one label. This is not automatically bad. It tells us where the graph lacks enough evidence to decide.

\section{Recommended Comparisons}

The most useful evaluation includes at least these systems:

\begin{enumerate}
\item \textbf{v2 algorithm}: exact tuple matching, compatibility constraint, best-score-per-node, tie preservation.
\item \textbf{Original-style variant}: looser search without explicit compatibility gating.
\item \textbf{Ablation without selected gain}: measures the value of context-awareness.
\item \textbf{Ablation without path penalty}: measures whether long chains are overused.
\end{enumerate}

If only one implementation is available today, the report can still reserve sections for these comparisons and fill them in later.

\section{Experimental Protocol}

\begin{enumerate}
\item Choose one or more domains.
\item Generate a clean canonical table.
\item Render partial source rows for each entity.
\item Sweep one noise family at a time while keeping others fixed.
\item For each severity level, run multiple random seeds.
\item Merge rows using the requested output labels.
\item Align merged rows to canonical entities.
\item Compute metrics and confidence intervals.
\end{enumerate}

Sweeping one noise family at a time is important. If all noise sources change together, the resulting plots are hard to interpret.

\section{Suggested Plots}

The headline figure should be:

\begin{quote}
Error rate versus noise severity.
\end{quote}

Recommended plots:

\begin{itemize}
\item exact row error versus missingness,
\item exact row error versus shared-value collision rate,
\item incompatible merge rate versus collision rate,
\item coverage versus missingness,
\item ambiguity rate versus fragmentation,
\item per-label error bars for each domain.
\end{itemize}

\section{Result Tables}

The following tables are useful templates for the final report.

\subsection{Overall Summary}

\begin{longtable}{|l|l|l|l|l|}
\hline
\textbf{System} & \textbf{Noise family} & \textbf{Severity} & \textbf{Exact row accuracy} & \textbf{Incompatible merge rate} \\
\hline
v2 & missingness & 0 & TBD & TBD \\
\hline
v2 & missingness & 1 & TBD & TBD \\
\hline
v2 & collision & 0 & TBD & TBD \\
\hline
v2 & collision & 1 & TBD & TBD \\
\hline
original-style & collision & 1 & TBD & TBD \\
\hline
\end{longtable}

\subsection{Per-Label Performance}

\begin{longtable}{|l|l|l|l|}
\hline
\textbf{Label} & \textbf{Noise family} & \textbf{Severity} & \textbf{Label error rate} \\
\hline
TRANSACTION\_ID & collision & 2 & TBD \\
\hline
USER & collision & 2 & TBD \\
\hline
PRODUCT\_MODEL & corruption & 2 & TBD \\
\hline
\end{longtable}

\section{Interpretation Guidance}

What we hope to see:

\begin{itemize}
\item error should increase gradually rather than catastrophically under moderate missingness,
\item v2 should show a clearly lower incompatible merge rate than looser search variants,
\item collision noise should hurt more than pure missingness,
\item normalization-sensitive labels should degrade fastest under value corruption.
\end{itemize}

What would be a red flag:

\begin{itemize}
\item high exact-row accuracy but also high incompatible merge rate,
\item large drops in coverage even at mild noise levels,
\item instability across random seeds,
\item ambiguity exploding in low-noise settings.
\end{itemize}

\section{Implementation Roadmap}

If we want to turn this report into a full reproducible benchmark, the next engineering steps are:

\begin{enumerate}
\item build a clean canonical entity generator,
\item build a renderer that creates source-specific sparse rows,
\item add configurable noise injectors,
\item export data in the current entity JSON format,
\item add an evaluation runner that executes the merge and computes metrics,
\item generate CSV summaries and plots.
\end{enumerate}

\section{How to Attach This Material}

There are two clean options:

\begin{itemize}
\item \textbf{Separate report}: best if the benchmark becomes large, with many plots and implementation details.
\item \textbf{Appendix or supplement}: good if the paper stays short and you only want one or two pages of experimental detail in the main text.
\end{itemize}

My recommendation is to keep the engineering paper focused on the algorithm and include only a short evaluation summary there. Put the full benchmark methodology and all noise-sweep plots in a separate attachment.

\section{Conclusion}

This report template is designed to make the algorithm defensible from an engineering point of view. A graph merge method is much easier to trust when we can answer a very concrete question:

\begin{quote}
How fast does it fail as the input becomes noisier, and what type of noise causes the failure?
\end{quote}

That is the central story this benchmark should tell.

\end{document}