Skip to content

Optimise row count in hdf5 chunks? #200

@kyleaoman

Description

@kyleaoman

Currently SOAP stores hdf5 datasets in 1000-row chunks. It's not clear that this is a good strategy. If wanting to read the entire catalogue then 1000-row chunks probably unnecessarily slows down reading. Larger chunks usually improve raw read performance, up to a point (e.g. swift snapshots use 2^20 ~ 1M row chunks, which seems like a reasonably good compromise). However, reading spatially-masked data is supposed to be supported, but the median number of halos in a top-level cell is only ~200 (I had a look at colibre L200m6 and L400m7), although it does peak at several tens of thousands. This means that any read with a mask probably comes with a large overhead as the rest of the chunks outside of the cells of interest have to be read. Not so bad for masks that are a large fraction of the box (e.g. octants) but smaller regions could suffer quite a bit. It's also a very large overhead for a workflow like swiftgalaxy that masks down to a single row.

In summary it's not clear what the right approach is, but 1000-row chunks seems like it's not optimal for reading entire datasets or for reading masked datasets. Perhaps it turns out to be a good compromise to support both, but I don't think that it's been tested.

Also need to consider how compression filters interact with this. I assume that the compression operates chunk by chunk, but don't actually know for sure. If it is chunk-by-chunk, then applying compression to (1,1) chunks, for example, is probably completely useless?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions