Hi @dimus
This is more a question than an issue.
My initial idea of gndiff was comparing two files so its .csv input design is perfect for that:
- I call the executable from a Python script.
- That script has to generate reference.csv and source.csv files
- Then it calls gndiff command line executable passing those filenames
- It catches the output and processes its content from Python again.
That just works.
Now I am wondering about some other possible use cases with frequent repetitive gndiff calls (also from Python).
My concern is whether so many disk-write/disk-read of .csv files could/should be avoided.
Figure out my script is parsing a long list of new specimens to include in a museum collection.
I might prefer to gndiff-match them one by one, for whatever reason (my script could need to make other intermediate tasks in a certain order before processing next specimen name).
So I would be passing gndiff a small source.csv with just one row, but so many times.
In such scenario, would it make sense not creating a source.csv in disk (which means a Python file-write, plus a gndiff file-read), but somehow passing the source info as a parameter instead?
Maybe this is already possible somehow although I am not sure about what syntax should I try.
Or maybe this doesn't make sense at all because the script performance would be similar (i.e. the intermediate tasks are slower than gndiff call).
Of course, I can always design my script to process all gndiff-matching operations in advance.
I am just thinking before scripting and I am not a professional, so don't take me too serious.
Somehow related to this, in #13 I suggested the possibility of using gndiff as a server (so we can run gndiff in one machine and call it from others).
If that feature ever becomes possible, I wonder how such a server would work.
- I guess the idea is repeating exactly the same stuff (gndiff receiving two files, doing the work and returning the output as http response)
- But another possible scenario is running it as a server of a predefined reference list: so reference.csv is not passed in http requests, but defined at server start time ... so the requests only contain a list of source taxa (or just one taxon) to match against that reference.csv ... so again, the server could be receiving small but repetitive matching tasks.
Just wondering
Hi @dimus
This is more a question than an issue.
My initial idea of gndiff was comparing two files so its .csv input design is perfect for that:
That just works.
Now I am wondering about some other possible use cases with frequent repetitive gndiff calls (also from Python).
My concern is whether so many disk-write/disk-read of .csv files could/should be avoided.
Figure out my script is parsing a long list of new specimens to include in a museum collection.
I might prefer to gndiff-match them one by one, for whatever reason (my script could need to make other intermediate tasks in a certain order before processing next specimen name).
So I would be passing gndiff a small source.csv with just one row, but so many times.
In such scenario, would it make sense not creating a source.csv in disk (which means a Python file-write, plus a gndiff file-read), but somehow passing the source info as a parameter instead?
Maybe this is already possible somehow although I am not sure about what syntax should I try.
Or maybe this doesn't make sense at all because the script performance would be similar (i.e. the intermediate tasks are slower than gndiff call).
Of course, I can always design my script to process all gndiff-matching operations in advance.
I am just thinking before scripting and I am not a professional, so don't take me too serious.
Somehow related to this, in #13 I suggested the possibility of using gndiff as a server (so we can run gndiff in one machine and call it from others).
If that feature ever becomes possible, I wonder how such a server would work.
Just wondering