Execute all readColumnChunk concurrently for a given RowGroup#33
Execute all readColumnChunk concurrently for a given RowGroup#33ZJONSSON wants to merge 2 commits intoironSource:masterfrom
Conversation
|
LGTM. If I/O reduction over the network is a concern, we could also optionally disable reading the header entirely as it is just a sanity check and not required to understand the file. |
|
However, we might also want to benchmark this on files backed by a spinning disk and/or give the user the option to disable parallel/out-of-order reading; I'm not sure off the top of my head if our writer does it, but other writers might write out the column chunks in order (in the data file) so that readers can benefit from read ahead optimization. |
|
@ZJONSSON, We also need a benchmark test suite to make sure we are indeed improving stuff and in which scenarios. We're gonna spend some time tomorrow morning doing this. |
|
I agree that concurrency should not be infinite. However I think there are better ways to control it than hard-coding sequential executing for tasks that could be in parallel One way to create controls around maximum concurrent reads would be to wrap the get method in a simple queue where maximum concurrency is defined in options (and a default value) Additionally: number of actual requests could be optimized by inspecting any simultaneous requests (in the |
|
On the second point, here is a quick branch (very much wip) on the optimization of simultaneous requests. Any reads with close to consecutive segments, i.e. the |
Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)
Read both header and footer concurrently, but make header error the first one to throw (if there are errors)
516d098 to
6d1376a
Compare
Update readme with correct package name
Offers significant speed improvement when the reader has slow i/o (over network instead of from disk)