Incorporate I/O component in the compute benchmarks

For the compute benchmarks, we've been generating and persisting **the data in memory** for every combination of `chunk_size` and `chunking_scheme` prior the computations:

```yaml
  chunk_size:
    - 32MB
    - 64MB
    - 128MB
    - 256MB
  chunking_scheme:
    - spatial
    - temporal
    - auto
```

Per discussions with @rabernat, @kmpaul, @tinaok, @guillaumeeb, it is crucial to have an I/O component that emulates real use cases: _the data will almost always live on the filesystem and be bigger than what we can persist into memory_. 


## I/O benchmarks 

A few months ago, @kmpaul  and @halehawk conducted an [IOR-based I/O scaling study](https://github.com/NCAR/ior_scaling) (C/MPI-based code) that compared:

- Z5
- netCDF4
- HDF5
- PnetCDF
- MPIIO
- POSIX


In [zarr-hdf-benchmarks](https://github.com/rabernat/zarr_hdf_benchmarks) (Python/mpi4py-based code), @rabernat compared both the `write` and `read` components. 

-----
How should we go on about incorporating I/O component in the compute benchmarks? 

- Should we focus on the `read` component by generating a dataset with same chunking and compression to both netcdf4 and zarr for every `chunk_size` and `chunking_scheme` combination, and then testing a variety of access approaches?
- Should the `write` component  be taken into consideration too? 
- One of our longterm goals for this repo is that the benchmarks should be runnable on different platforms (HPC, Cloud) and storage systems. Both https://github.com/rabernat/zarr_hdf_benchmarks and https://github.com/NCAR/ior_scaling are MPI dependent, and I was wondering whether the I/O components for these benchmarks can be Python/Dask based?





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate I/O component in the compute benchmarks #26

I/O benchmarks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Incorporate I/O component in the compute benchmarks #26

Description

I/O benchmarks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions