First of all, thanks a lot for creating gndiff. It's gonna be so useful for me.
I have just tried with a small file, to get the feeling of how it works.
I already found some issues to comment:
-
Unclear input file formats description, where it says "Prepare two files with names. There are 3 possible file formats:" ... but actually, only two formats are mentioned: (1) simple list, one name per line; (2) csv file, with some other fields (see below).
Also, it is unclear to me if the CSV format applies only to the reference.csv file or also to source.csv:
- Names to be matched (source.csv), might also contain their own ids, but it is unclear to me whether gndiff is suggesting user to provide them or not (i.e., for adding them as a new column in output, so it is easier for me to rejoin that output against my original database).
I think that is not the idea, because output already provides an autonumeric index. So I understand source.csv would usually contain just one field, with names and nothing else (just one column). With one possible exception:
- Except in case of using
Family: I suppose in that case it should be present in both files. Correct? But I couldn't make it work properly in my tries (see below):
-
I might have misunderstood input CSV format description above. But if Family and TaxonID are optional fields?, then JSON output contains errors sometimes:
1.. If I don't provide a Family column in reference.csv, then json output referenceRecords[n].family contains the same value as name (the ScientificName field provided in my reference.csv file).
2.. If I provide a Family column in reference.csv (even with empty values), then json output seems correct (referenceRecords[n].family contains those family values I provided).
3.. But if I also provide a Family in source.csv, then json output includes a new sourceRecord.id which does contain the same value as sourceRecord.name.
4.. If source.csv contains other columns (i.e., ScientificName + LifeForm) then json output produces sourceRecord.family=sourceRecord.id=sourceRecord.name (all containing the ScientificName provided in source.csv).
So I am a bit confused. I think it would be worth to provide a couple of sample input files, and explicitly say if they can/should contain some other columns or not.
Regarding family: A real example case of how "tricky homonyms where family helps to resolve taxa from each other" would be useful too (I think family is not going to solve anything in my case, but just to be sure). I wonder how this "use family" option affects speed: does it make matching faster or slower for large datasets?
-
CSV/TSV outputs are missing column headers? This could seem irrelevant, but it makes a bit difficult to check if the output content is correct. Also, I cannot proceed with further tasks, like merging this output with other tabular data by means of column joins (I can try to figure out headers and add them myself ... but it would be safer if gndiff did it to avoid mistakes).
EDIT: I have just realized that some of the above suggestions were already addressed by @Adafede in a previous closed issue (#12).
Sorry about that. My comments are pretty verbose, so @dimus might still find some helpful feedback in some of them.
This is a new one:
- Shouldn't the output include a sort of calculated numeric similitude between the matched names? I have some cases where json produces several "Exact" matches (i.e. two referenceRecords for the same sourceRecord) because my reference.csv contains two similar versions of the same name (i.e. a
subsp. rank vs. a var. rank, identical in everything else). But my source.csv only contains one (i.e. the subsp.). How can I make the decision to select the most similar in these cases?
I will better post an example in a new comment to illustrate this.
Thanks a lot in advance !!
First of all, thanks a lot for creating gndiff. It's gonna be so useful for me.
I have just tried with a small file, to get the feeling of how it works.
I already found some issues to comment:
Unclear input file formats description, where it says "Prepare two files with names. There are 3 possible file formats:" ... but actually, only two formats are mentioned: (1) simple list, one name per line; (2) csv file, with some other fields (see below).
Also, it is unclear to me if the CSV format applies only to the reference.csv file or also to source.csv:
I think that is not the idea, because output already provides an autonumeric index. So I understand source.csv would usually contain just one field, with names and nothing else (just one column). With one possible exception:
Family: I suppose in that case it should be present in both files. Correct? But I couldn't make it work properly in my tries (see below):I might have misunderstood input CSV format description above. But if
FamilyandTaxonIDare optional fields?, then JSON output contains errors sometimes:1.. If I don't provide a
Familycolumn in reference.csv, then json outputreferenceRecords[n].familycontains the same value asname(theScientificNamefield provided in my reference.csv file).2.. If I provide a
Familycolumn in reference.csv (even with empty values), then json output seems correct (referenceRecords[n].familycontains those family values I provided).3.. But if I also provide a
Familyin source.csv, then json output includes a newsourceRecord.idwhich does contain the same value assourceRecord.name.4.. If source.csv contains other columns (i.e., ScientificName + LifeForm) then json output produces
sourceRecord.family=sourceRecord.id=sourceRecord.name(all containing the ScientificName provided in source.csv).So I am a bit confused. I think it would be worth to provide a couple of sample input files, and explicitly say if they can/should contain some other columns or not.
Regarding
family: A real example case of how "tricky homonyms where family helps to resolve taxa from each other" would be useful too (I think family is not going to solve anything in my case, but just to be sure). I wonder how this "use family" option affects speed: does it make matching faster or slower for large datasets?CSV/TSV outputs are missing column headers? This could seem irrelevant, but it makes a bit difficult to check if the output content is correct. Also, I cannot proceed with further tasks, like merging this output with other tabular data by means of column joins (I can try to figure out headers and add them myself ... but it would be safer if gndiff did it to avoid mistakes).
EDIT: I have just realized that some of the above suggestions were already addressed by @Adafede in a previous closed issue (#12).
Sorry about that. My comments are pretty verbose, so @dimus might still find some helpful feedback in some of them.
This is a new one:
subsp.rank vs. avar.rank, identical in everything else). But my source.csv only contains one (i.e. thesubsp.). How can I make the decision to select the most similar in these cases?I will better post an example in a new comment to illustrate this.
Thanks a lot in advance !!