Add compare tool#280
Conversation
|
As I mentioned above, most of the diffing logic is done in the other PR. If you can, please focus on the cli itself in this one. To run it, do poetry install
poetry shell
compare dataframes DF1 DF2
# or
compare etl_catalog CHANNEL NAMESPACE DATASET TABLE |
|
Hi Daniel, thanks a lot for this! The CLI looks very neat. I can't go through the PR now in detail, but I have some general preliminary comments: |
Marigold
left a comment
There was a problem hiding this comment.
Look good! Couple minor comments (feel free to ignore). I'm gonna try it on real datasets when I have a chance, but that shouldn't prevent merging this.
| def yield_list_lines( | ||
| description: str, items: Iterable[Any] | ||
| ) -> Generator[str, None, None]: | ||
| sublines = [item for item in items] |
There was a problem hiding this comment.
(Could be list(items) I think)
| ) | ||
|
|
||
|
|
||
| def load_table(path_str: str) -> catalog.Table: |
There was a problem hiding this comment.
(There's a similar method that could be used instead, but it'd probably look weirder than your method)
| help="Print truncated lists if they are longer than the given length.", | ||
| ) | ||
| def etl_catalog( | ||
| channel: str, |
There was a problem hiding this comment.
(For me it would be more natural to write it as a path / URI channel/namespace/dataset/table instead of having them separated)
1dbf7bb to
f29bc79
Compare

This PR implements the first version of a compare CLI that allows easy comparison of either arbitrary dataframes (feather, csv, parquet) or etl files (in that case against the version currently uploaded in the production catalog). Much of the logic for the diffing lives in the data-utils-py repo and is managed in this PR: owid/owid-datautils-py#29 .
This PR currently includes a temporary copy of the new HighlevelDiff class for development. It will be removed here and imported from data-utils-py when the PR over there is merged.