Problem
CSV loaders silently skip rows with malformed data:
# pit38/data_sources/stock_loader/csv_loader.py:43-45
except (ValueError, KeyError) as e:
logger.warning(f"Skipping invalid row: {row}. Error: {str(e)}")
return None
User symptoms:
- Upload CSV with 500 rows
- 3 rows have parsing errors (non-standard date, malformed amount)
- CLI reports
Processed 497 transactions from 1 file(s)
- Warning goes to stderr at
WARNING level, user usually runs at DEBUG where it's buried
- Tax calculation is wrong by 3 rows. User doesn't know.
For a tax tool, silent data loss is a correctness issue, not a UX inconvenience. A missing SELL row means the user underreports income.
Goal
Default behaviour: fail-fast on malformed rows. User must see all parsing errors and decide to ignore them explicitly.
Distinguish two categories:
RowSkipped — legitimately not our concern (e.g. Binance DEPOSIT rows are not taxable events). These can be silently dropped.
RowMalformed — data corruption, ambiguous parsing, invalid date, etc. These should raise.
Acceptance criteria
Implementation sketch
@dataclass(frozen=True)
class RowSkipped:
reason: str # "not a tax event" / etc
@dataclass(frozen=True)
class RowMalformed:
row_number: int
row_data: dict
error: str
class MalformedCsvException(Exception):
def __init__(self, problems: list[RowMalformed]):
self.problems = problems
super().__init__(f"{len(problems)} malformed rows")
class BaseCsvLoader(ABC):
@classmethod
def load(cls, file_path, strict: bool = True) -> list[Record]:
records = []
problems = []
for i, row in enumerate(csv.DictReader(...), start=1):
result = cls._parse_row(row)
match result:
case RowSkipped(): continue
case RowMalformed() as p:
problems.append(dataclasses.replace(p, row_number=i))
case record: records.append(record)
if problems and strict:
raise MalformedCsvException(problems)
for p in problems:
logger.warning(f"Row {p.row_number}: {p.error}")
return records
Scale
Priority
P0 for correctness. Silent failure in tax tool = real underreporting risk.
Related
Problem
CSV loaders silently skip rows with malformed data:
User symptoms:
Processed 497 transactions from 1 file(s)WARNINGlevel, user usually runs atDEBUGwhere it's buriedFor a tax tool, silent data loss is a correctness issue, not a UX inconvenience. A missing SELL row means the user underreports income.
Goal
Default behaviour: fail-fast on malformed rows. User must see all parsing errors and decide to ignore them explicitly.
Distinguish two categories:
RowSkipped— legitimately not our concern (e.g. BinanceDEPOSITrows are not taxable events). These can be silently dropped.RowMalformed— data corruption, ambiguous parsing, invalid date, etc. These should raise.Acceptance criteria
RowSkipped(reason: str)— intentional dropRowMalformed(row_number: int, row_data: dict, error: Exception)— fail signalBaseCsvLoader._parse_row()returnsRecord | RowSkipped | RowMalformedinstead ofRecord | NoneMalformedCsvExceptionsummarizing all malformed rows (not just the first) so user fixes them all at once--best-effort(default: off) to opt into old silent-continue behaviourN transactions loaded, M rows skipped (legitimate), K rows malformed--best-effort: print rows + errors, exit non-zeroImplementation sketch
Scale
BaseCsvLoader(refactor(#9 PR 2): BaseCsvLoader ABC and subclass loaders #56), both loaders, CLI (stock.py,crypto.py)Priority
P0 for correctness. Silent failure in tax tool = real underreporting risk.
Related