Skip to content

refactor: Fail-fast on malformed CSV rows (remove silent skip) #62

@przemyslawbialon

Description

@przemyslawbialon

Problem

CSV loaders silently skip rows with malformed data:

# pit38/data_sources/stock_loader/csv_loader.py:43-45
except (ValueError, KeyError) as e:
    logger.warning(f"Skipping invalid row: {row}. Error: {str(e)}")
    return None

User symptoms:

  1. Upload CSV with 500 rows
  2. 3 rows have parsing errors (non-standard date, malformed amount)
  3. CLI reports Processed 497 transactions from 1 file(s)
  4. Warning goes to stderr at WARNING level, user usually runs at DEBUG where it's buried
  5. Tax calculation is wrong by 3 rows. User doesn't know.

For a tax tool, silent data loss is a correctness issue, not a UX inconvenience. A missing SELL row means the user underreports income.

Goal

Default behaviour: fail-fast on malformed rows. User must see all parsing errors and decide to ignore them explicitly.

Distinguish two categories:

  • RowSkipped — legitimately not our concern (e.g. Binance DEPOSIT rows are not taxable events). These can be silently dropped.
  • RowMalformed — data corruption, ambiguous parsing, invalid date, etc. These should raise.

Acceptance criteria

  • Introduce two exception-like classes or a result type:
    • RowSkipped(reason: str) — intentional drop
    • RowMalformed(row_number: int, row_data: dict, error: Exception) — fail signal
  • BaseCsvLoader._parse_row() returns Record | RowSkipped | RowMalformed instead of Record | None
  • After loading, loader raises a single MalformedCsvException summarizing all malformed rows (not just the first) so user fixes them all at once
  • CLI flag --best-effort (default: off) to opt into old silent-continue behaviour
  • CLI output always prints a summary: N transactions loaded, M rows skipped (legitimate), K rows malformed
  • If K > 0 and not --best-effort: print rows + errors, exit non-zero
  • Existing tests updated (malformed data in fixtures was silently skipped before, now raises)
  • New test: malformed CSV → raise with all errors enumerated

Implementation sketch

@dataclass(frozen=True)
class RowSkipped:
    reason: str  # "not a tax event" / etc

@dataclass(frozen=True)
class RowMalformed:
    row_number: int
    row_data: dict
    error: str

class MalformedCsvException(Exception):
    def __init__(self, problems: list[RowMalformed]):
        self.problems = problems
        super().__init__(f"{len(problems)} malformed rows")

class BaseCsvLoader(ABC):
    @classmethod
    def load(cls, file_path, strict: bool = True) -> list[Record]:
        records = []
        problems = []
        for i, row in enumerate(csv.DictReader(...), start=1):
            result = cls._parse_row(row)
            match result:
                case RowSkipped(): continue
                case RowMalformed() as p: 
                    problems.append(dataclasses.replace(p, row_number=i))
                case record: records.append(record)
        if problems and strict:
            raise MalformedCsvException(problems)
        for p in problems:
            logger.warning(f"Row {p.row_number}: {p.error}")
        return records

Scale

Priority

P0 for correctness. Silent failure in tax tool = real underreporting risk.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcorrectnessTax-correctness impact (high priority)refactorRefactoring

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions