-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Currently SOAP stores hdf5 datasets in 1000-row chunks. It's not clear that this is a good strategy. If wanting to read the entire catalogue then 1000-row chunks probably unnecessarily slows down reading. Larger chunks usually improve raw read performance, up to a point (e.g. swift snapshots use 2^20 ~ 1M row chunks, which seems like a reasonably good compromise). However, reading spatially-masked data is supposed to be supported, but the median number of halos in a top-level cell is only ~200 (I had a look at colibre L200m6 and L400m7), although it does peak at several tens of thousands. This means that any read with a mask probably comes with a large overhead as the rest of the chunks outside of the cells of interest have to be read. Not so bad for masks that are a large fraction of the box (e.g. octants) but smaller regions could suffer quite a bit. It's also a very large overhead for a workflow like swiftgalaxy that masks down to a single row.
In summary it's not clear what the right approach is, but 1000-row chunks seems like it's not optimal for reading entire datasets or for reading masked datasets. Perhaps it turns out to be a good compromise to support both, but I don't think that it's been tested.
Also need to consider how compression filters interact with this. I assume that the compression operates chunk by chunk, but don't actually know for sure. If it is chunk-by-chunk, then applying compression to (1,1) chunks, for example, is probably completely useless?