Skip to content

Incorporate I/O component in the compute benchmarks #26

@andersy005

Description

@andersy005

For the compute benchmarks, we've been generating and persisting the data in memory for every combination of chunk_size and chunking_scheme prior the computations:

  chunk_size:
    - 32MB
    - 64MB
    - 128MB
    - 256MB
  chunking_scheme:
    - spatial
    - temporal
    - auto

Per discussions with @rabernat, @kmpaul, @tinaok, @guillaumeeb, it is crucial to have an I/O component that emulates real use cases: the data will almost always live on the filesystem and be bigger than what we can persist into memory.

I/O benchmarks

A few months ago, @kmpaul and @halehawk conducted an IOR-based I/O scaling study (C/MPI-based code) that compared:

  • Z5
  • netCDF4
  • HDF5
  • PnetCDF
  • MPIIO
  • POSIX

In zarr-hdf-benchmarks (Python/mpi4py-based code), @rabernat compared both the write and read components.


How should we go on about incorporating I/O component in the compute benchmarks?

  • Should we focus on the read component by generating a dataset with same chunking and compression to both netcdf4 and zarr for every chunk_size and chunking_scheme combination, and then testing a variety of access approaches?
  • Should the write component be taken into consideration too?
  • One of our longterm goals for this repo is that the benchmarks should be runnable on different platforms (HPC, Cloud) and storage systems. Both https://github.com/rabernat/zarr_hdf_benchmarks and https://github.com/NCAR/ior_scaling are MPI dependent, and I was wondering whether the I/O components for these benchmarks can be Python/Dask based?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions