Skip to content

file disk read vs other input options #28

@abubelinha

Description

@abubelinha

Hi @dimus
This is more a question than an issue.

My initial idea of gndiff was comparing two files so its .csv input design is perfect for that:

  • I call the executable from a Python script.
  • That script has to generate reference.csv and source.csv files
  • Then it calls gndiff command line executable passing those filenames
  • It catches the output and processes its content from Python again.

That just works.

Now I am wondering about some other possible use cases with frequent repetitive gndiff calls (also from Python).
My concern is whether so many disk-write/disk-read of .csv files could/should be avoided.

Figure out my script is parsing a long list of new specimens to include in a museum collection.
I might prefer to gndiff-match them one by one, for whatever reason (my script could need to make other intermediate tasks in a certain order before processing next specimen name).
So I would be passing gndiff a small source.csv with just one row, but so many times.

In such scenario, would it make sense not creating a source.csv in disk (which means a Python file-write, plus a gndiff file-read), but somehow passing the source info as a parameter instead?
Maybe this is already possible somehow although I am not sure about what syntax should I try.
Or maybe this doesn't make sense at all because the script performance would be similar (i.e. the intermediate tasks are slower than gndiff call).

Of course, I can always design my script to process all gndiff-matching operations in advance.
I am just thinking before scripting and I am not a professional, so don't take me too serious.


Somehow related to this, in #13 I suggested the possibility of using gndiff as a server (so we can run gndiff in one machine and call it from others).
If that feature ever becomes possible, I wonder how such a server would work.

  • I guess the idea is repeating exactly the same stuff (gndiff receiving two files, doing the work and returning the output as http response)
  • But another possible scenario is running it as a server of a predefined reference list: so reference.csv is not passed in http requests, but defined at server start time ... so the requests only contain a list of source taxa (or just one taxon) to match against that reference.csv ... so again, the server could be receiving small but repetitive matching tasks.

Just wondering

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions