Conversation
|
@rhelder thanks for your work. I am interesting in merging this and would like to support you with your work on this. I have some initial comments: CLIThere has been a very short discussion in #346 (comment). Is there a reason why you chose I would prefer not to use This --format option could later be extended to also accept ImplementationI have no strong feelings about the csv format and the mapping to yaml/dict/lists. As long as we can express all the data formats that we need it is fine, I think. I am also fine with extending the Contact.parse/create_from/update/... functions for new data types, or adding similar functions. I like the idea that I saw in the code that updating via dict or yaml uses different code paths in the beginning but the converges after the yaml is parsed. (This represents the idea that I have with the CodeI will look at the code more thoroughly at some later time. Please remember to run the tests and type checker to ensure that the code fulfills some baseline. If something is unclear just ask :) |
|
I have dropped support for python 3.9 in main and updated the type hints to use the "new" union operator. So you can rebase your branch and do not have to fix these type related ci failures. |
|
Thanks for your comments @lucc! I'm glad to be able to keep working on this.
No, not really. Out of caution, I aired on the side of adding things rather than changing things, at least where that was possible – but that was just for the sake of getting you a first draft, not because I thought it was necessarily good design. (On the contrary, I didn't want to be overly opinionated about the design at this stage.) I like your suggestion to add a The one part of the duplicated code that isn't strictly duplicated is the help text –
I apologize that I overlooked the type checks – I did run the unit tests, but I didn't notice the type checks (I haven't worked with the Thanks again for reviewing the PR! |
|
Both unit tests and type checks should be passing now. Continuing to work on the other things. |
This will make it possible to create contacts from data formats other than YAML without rendering the data as a YAML string.
This module reads CSV files and returns data that can be read by the 'khard.contacts' module. The module will be used to create new contacts from CSV.
'--format' specifies the output format, either 'yaml' or 'csl'. YAML is still the default output format for 'khard template'. The CSV template will be used when importing contacts from CSV. Just as a YAML template is opened in the user's editor when 'khard new' is invoked without any input, a CSV template will be opened in the user's editor when 'khard new --format csv' is invoked without any input. (This will not be terribly useful, since CSV is not very pleasant to edit in a text editor, but it will be consistent with the current behavior of 'khard new').
This is not particularly useful on its own, but it will be used as a fallback when importing contacts from CSV, in the event that the user doesn't supply any input (i.e., like 'khard new' does).
Check whether or not YAML files and CSV files containing the same data produce equivalent Contact objects. Make one of the CSV files 'jumbled' to show that column order doesn't matter to getting the right result.
|
I've folded all of the functionality from Type and unit tests are passing, but obviously doc tests are failing. They are also failing on I'll look forward to your code review, when you have the time. Sorry for my own slow pace – busy time at work. Thanks! |
lucc
left a comment
There was a problem hiding this comment.
Sorry for the delay. I had a look at the code and left some comments.
| """ | ||
| with contextlib.ExitStack() as stack: | ||
| filename = stack.enter_context( | ||
| self.write_temp_file(template, ".csv") |
There was a problem hiding this comment.
If you only have one file you do not need a stack here. One simple with statement should be enough.
| description="print an empty yaml (default) or CSV template", | ||
| help="print an empty yaml (default) or CSV template") | ||
| template_parser.add_argument( | ||
| "-O", "--format", choices=("yaml", "csv"), default="yaml", |
There was a problem hiding this comment.
I am confused by the -O, I would have expected -f as a short option. What is the reason?
|
|
||
|
|
||
| class Parser: | ||
| """An iterator over rows in a CSV file that returns contact data.""" |
There was a problem hiding this comment.
I do not see any contact data related code in here. I would say this parser can parse csv with nested fields.
| """An iterator over rows in a CSV file that returns contact data.""" | ||
|
|
||
| def __init__(self, input_from_stdin_or_file: str, delimiter: str) -> None: | ||
| """Parse first row to determine structure of contact data. |
There was a problem hiding this comment.
-contact data
+nested field structure| first_row = next(self.reader) | ||
| self.template, self.columns = self._parse_headers(first_row) | ||
|
|
||
| def __iter__(self) -> Iterator[dict]: |
There was a problem hiding this comment.
Can we constraint the dicts in the type signatures more? I think all keys will be strings. And the values might be a union of string, list and dict. When you have to repeat this complex type a lot you ca add an alias at the top of this file or to helpers/typing.py. Compare the aliases there.
| if subkey: | ||
| template[key].setdefault(idx, {}) | ||
| template[key][idx].update({subkey: None}) | ||
| columns.append( |
There was a problem hiding this comment.
Can you put these on one line? The line breaks seem unnecessary.
|
|
||
| return template, columns | ||
|
|
||
| def _get_data(self, row: list[str]) -> None: |
There was a problem hiding this comment.
It seems to me that this function stores its "output" on self via the local variable data_structure. The comment also tries to explain this somehow.
It seems to me that this data is only read by _process_data and the function _get_data and _process_data are only called right after each other in parse(). Can this function be refactored to return the data it would store on self.template and then the instance variable can be removed?
|
|
||
| :param first_row: First row of the CSV file, which must contain column | ||
| headers. | ||
| :returns: The "template" dict and the "columns" list. The structure of |
There was a problem hiding this comment.
I am loosing track as soon as I try to understand what this should look like and what it should do. Maybe a very simple example would be good, or a simple test just for this function. The example in the doc comment above is not for the return value of this function, or is it?
| @@ -0,0 +1,161 @@ | |||
| # Contact template for khard version 0.1.dev1192+g40c9de648 | |||
There was a problem hiding this comment.
The test file does not need all these comments. I think you can remove them.
|
|
||
| class TestCSVParser(unittest.TestCase): | ||
| """Tests the csv module and khard.contacts.Contact.from_dict().""" | ||
| def test_yaml_and_csv_produce_equivalent_contacts(self): |
There was a problem hiding this comment.
This is one big test that probably tests all code in the csv parser. There are two problems with this test:
- It tests everything in one test case, so if any part of the code changes the whole test fails and the developers have little indication why.
- I have to open several other files (the test fixtures) to see what is happening here.
Can you please add some small tests for the parser hat each parse a very small csv data (no files needed) and assert that the dict that is returned is the right one.
E.g:
def test_foo(self):
csv = "foo,bar,baz\nx,y,z"
expected = [{"foo":"x","bar":"y","baz":"z"}]
actual = list(Parser(csv, ","))
self.assertEqual(actual, expected)Like this you could tests many corner cases of the parser and the tests would be easy to read and understand.
The big-box address-book providers (Outlook, Google, Apple) all support importing contacts from CSV files. Khard does not support this, nor to my knowledge does any console address-book software. Because contact data is often distributed in CSV form, it is useful to be able to import contacts from CSV, and it's practically indispensable if you need to manage a large number of contacts. For example, I'm a teacher at the university level, and my students' contact info (most importantly, their emails, to which I frequently need to send group messages) is given to me by the University in CSV format. Because I usually have fifty students at a time, it is not feasible to to import that many contacts into khard without some kind of scripting solution. Since I was using khard's API anyway, I decided to have a go at implementing the feature. The feature was non-trivial to implement, so apologies in advance for the length of this PR. Thanks for reviewing it, and for your work on khard!
CLI
The new feature is implemented via a new subcommand,
import(csvis also provided as an alias).khard importis designed to be as consistent withkhard newas possible.khard importtakes the same options askhard new:-a,--addressbookspecifies the address book into which the contacts should be imported. The user is asked to specify an address book if this option is not supplied.-i,--input-filespecifies the CSV file from which the contacts should be imported (stdin by default).--open-editor,--editgives the user the option to review/edit the contact after the successful creation of each contact (not unlike Apple Contacts, which asks you to review each imported contact).--vcard-versionis the same as forkhard new.khard importtakes one additional option,-dor--delimiter, which allows you to specify what field delimiter is used in your CSV file (',' by default).Like
khard new, if no input is supplied tokhard import, the user's text editor is opened to edit a temporary file containing a template -- in this case, a CSV template. I don't think there's really a use case for this, since CSV files are not very user-friendly to edit in a text editor. But I thought it was important both for consistency withkhard newand also with UNIX tools in general -- if you don't provide any input tocat, for example, it will just hang, because anything you type after executing the command still counts as stdin. So even though opening a CSV template when the user fails to supply any input is not very useful, it at least will be unsurprising to the user.I modified
khard templateto be able to show the CSV template if-cor--csvis passed to it. It still shows the YAML template by default (or, superfluously, with the-yor--yamloption), so this change doesn't break anything. (This functionality could be assigned to a subcommand other thantemplate, but I thought it was neater to fold it intotemplate.)Implementation
There are two main obstacles to implementing the ability to import contacts from CSV. The first is that CSV is a (very) simple data format, and that contacts are complex (evidenced by the fact that khard models them with YAML, which is a complex data format). The solution to this is to specify a clear standard for validly formatting CSV files. Fortunately, Google (https://support.google.com/contacts/answer/15147365?hl=en-GB&co=GENIE.Platform%3DDesktop#zippy=%2Cuse-a-template-spreadsheet-to-create-a-csv-file-to-import) and Outlook (https://support.microsoft.com/en-us/office/create-or-edit-csv-files-to-import-into-outlook-4518d70d-8fe9-46ad-94fa-1494247193c7) provide some clues as to how to specify a standard in a way that might be more or less familiar to people.
The details of this are discussed in a compressed and dry way in the API documentation, and will need to be spelled out in a more friendly way in user-facing documentation (which I am happy to write if you are interested in merging this PR). Here's a quick overview of how column headers need to be specified in order to get certain data structures:
To get something equivalent to the YAML structure
First name: Bruce, the column header should be 'First name' (where 'Bruce' is a value, in that column, in one of the subsequent rows of the CSV file).To get something equivalent to the YAML structure
'Justice League' should be in a column named 'Organisation 1' and 'Wayne Enterprises should be in a column named 'Organisation 2'.
To get something equivalent to the YAML structure
'work' should be in a column named 'Email 1 - type', and 'thebat@justice.org' should be in a column named 'Email 1 - value', in the same row. In the same row, 'home' should be in a column named 'Email 2 - type', and 'bruce@gmail.com' should be in a column named 'Email 2 - value', also in the same row.
The same idea as in 3 applies to addresses. To get something equivalent to the YAML structure
'home' should be in a column named 'Address 1 - type', '1007 Mountain Drive' should be in a column named 'Address 1 - Street', 'Gotham City' should be in a column named 'Address 1 - City', and 'USA' should be in a column named 'Address 1 - Country'.
Finally, in structures like those in 3 and 4, lists are supported. To get something equivalent to the YAML structure
'work' should be in a column named 'Email 1 - type' and in a column named 'Email 2 - type'. 'thebat@justice.org' should be in a column named 'Email 1 - value', and 'bruce@wayne.com' should be in a column named 'Email 2 - value'.
Note that the numbers ('Email 1', 'Email 2') are more or less arbitrary; they are just meant to say that different values are associated with each other (which CSV is not capable of conveying on its own). But the numbers need to start with 1, and they need to be in a sequence. For example, if there's an 'Email 3', there must also be an 'Email 2' and 'Email 1'. (The order that they're presented in the CSV file doesn't matter, though.)
Although the above standard is straightforward enough, processing the data into a form that
khard.YAMLEditable.update()can read (to avoid any duplication of the contact-creation logic that's already performed by that method) can get messy fast. To keep things organized, I ended up implementing the CSV parsing code as a separate module. (The specifics are documented, in hopefully a clear way, within the module.)The second obstacle is that khard's API (both public and private) doesn't readily support creating Contacts out of anything but YAML input. So, although in general I aimed for my PR to only add things, and not change things, there were a couple of places where I had to pry the API open a bit to allow Contacts to be created from the data returned by the CSV parser. The alternative would have been to convert the CSV parser data into YAML, just to be converted back into the same dict -- which seemed pretty silly. However, none of the changes I made were breaking, in the sense that no one who calls into the API will need to make any changes to their code on account of the changes I made.
In particular:
khard.YAMLEditable._parse_yaml()method, so that I could use the same validation logic when creating a new Contact from a dict.khard.YAMLEditable.update()to be either a string or a dictionary. If it's a string, it's parsed as YAML input just like it was before. If it's a dictionary, the dictionary is validated, and then Contact creation proceeds as normal.khard.Contact.from_dict().khard.helpers.interactive.Editor.write_temp_file().Again, these changes should not be noticed.
Tests
The PR includes tests verifying that YAML files and a CSV file containing the same data produce equivalent contacts. The tests also verify that the order of columns in the CSV file does not change the validity of the final result.
To-do
If you are interested in merging this, the remaining tasks I can think of are
importsubcommand.I'm happy to do all of these, of course.
I hope this PR can be helpful. Thanks again for your work on khard!