I would like to try these ideas out on a fork if that makes more sense, and
merge it later.
Currently, sympl assumes that the arrays inside the state dictionary are
instances of DataArray. While this made sense initially, I'm continually
coming up against performance issues (like #43).
For instance,
-
get_numpy_array uses the .transpose() function of DataArray
which is very slow. I wrote an equivalent version of this function
which instead used the numpy version which is ~20-30% faster (and passes all tests).
-
accessing an attribute like .values or .dims involves multiple function calls since
they are properties which reference other properties internally.
-
creating a new DataArray has a huge __init__ overhead with all kinds of checks
which are really not necessary in our use case.
These issues really come to the front when writing models which work with a single
column of data, which currently is the major use-case for climt at least.
While it is desirable to keep the DataArray interface, it would be really helpful
downstream if sympl described an API which any array object must implement.
This will require some re-writing of internal code which assumes that the
arrays are DataArrays, but in the end will allow more performant array representations
like unyt to be used seamlessly in sympl components.
This might also require sympl to allow an implementing library to replace functions
like get_numpy_array with custom versions.
In general, it might be good to specify a number of functions that an implementing library
must provide which can replace the logic that currently resides within __call__ of any
sympl component. This will make it easy to add functionality without having to
build custom subclasses of the base sympl components, which is undesirable.
IMO this also makes sense since sympl is a framework, and it need
not be opinionated about what kind of arrays are used, or how the validation
of these arrays and their dimensions is done. sympl could register
callbacks based on the type of the input array formats and use them
for validation and reshaping.
I would like to try these ideas out on a fork if that makes more sense, and
merge it later.
Currently, sympl assumes that the arrays inside the state dictionary are
instances of
DataArray. While this made sense initially, I'm continuallycoming up against performance issues (like #43).
For instance,
get_numpy_arrayuses the.transpose()function ofDataArraywhich is very slow. I wrote an equivalent version of this function
which instead used the
numpyversion which is ~20-30% faster (and passes all tests).accessing an attribute like
.valuesor.dimsinvolves multiple function calls sincethey are properties which reference other properties internally.
creating a new
DataArrayhas a huge__init__overhead with all kinds of checkswhich are really not necessary in our use case.
These issues really come to the front when writing models which work with a single
column of data, which currently is the major use-case for
climtat least.While it is desirable to keep the
DataArrayinterface, it would be really helpfuldownstream if
sympldescribed an API which any array object must implement.This will require some re-writing of internal code which assumes that the
arrays are
DataArrays, but in the end will allow more performant array representationslike
unytto be used seamlessly insymplcomponents.This might also require
symplto allow an implementing library to replace functionslike
get_numpy_arraywith custom versions.In general, it might be good to specify a number of functions that an implementing library
must provide which can replace the logic that currently resides within
__call__of anysymplcomponent. This will make it easy to add functionality without having tobuild custom subclasses of the base
symplcomponents, which is undesirable.IMO this also makes sense since
symplis a framework, and it neednot be opinionated about what kind of arrays are used, or how the validation
of these arrays and their dimensions is done.
symplcould registercallbacks based on the type of the input array formats and use them
for validation and reshaping.