Optimise row count in hdf5 chunks?

Currently SOAP stores hdf5 datasets in 1000-row chunks. It's not clear that this is a good strategy. If wanting to read the entire catalogue then 1000-row chunks probably unnecessarily slows down reading. Larger chunks usually improve raw read performance, up to a point (e.g. swift snapshots use 2^20 ~ 1M row chunks, which seems like a reasonably good compromise). However, reading spatially-masked data is supposed to be supported, but the median number of halos in a top-level cell is only ~200 (I had a look at colibre L200m6 and L400m7), although it does peak at several tens of thousands. This means that any read with a mask probably comes with a large overhead as the rest of the chunks outside of the cells of interest have to be read. Not so bad for masks that are a large fraction of the box (e.g. octants) but smaller regions could suffer quite a bit. It's also a very large overhead for a workflow like `swiftgalaxy` that masks down to a single row.

In summary it's not clear what the right approach is, but 1000-row chunks seems like it's not optimal for reading entire datasets *or* for reading masked datasets. Perhaps it turns out to be a good compromise to support both, but I don't think that it's been tested.

Also need to consider how compression filters interact with this. I assume that the compression operates chunk by chunk, but don't actually know for sure. If it is chunk-by-chunk, then applying compression to (1,1) chunks, for example, is probably completely useless?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimise row count in hdf5 chunks? #200

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimise row count in hdf5 chunks? #200

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions