Parallelization Enhancement Ideas #1412
ArtinSarraf
started this conversation in
Ideas
Replies: 1 comment 2 replies
-
|
Interesting! Will need to think about this some more. But some intial questions:
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, as I've been working with the Parallelizable/Collect capabilities of Hamilton I've come across a couple limitations that I am having trouble finding adequate workarounds for. I have some ideas that could address these that I would like to discuss with the Hamilton team.
The limitations:
The workaround that I considered was flattening to just one Parallelizable that returns a tuple of (product_lineX, productY), but this is no good because then combining/processing the data will need to be done for both levels in one function (all_data). This means you lose the parallelization benefits for the product_line level (in this simple case it's negligible, but not in all real world cases), and also the logic in all_data becomes more complicated as it must now be responsible for parsing/processing multiple levels.
Parallel Execution:
KeyError: 'key df not found in cache'#1029The workaround I considered here was something like this:
This pattern will give you the desired combined result (all_data), and also a result indexed by individual path result (all_data_raw) that you can slice from. However, this breaks down in a few ways. (1) You need to pass around a tuple containing the info that the path is parameterized on, through each function. (2) This will not allow you to access intermediate results in the path, only the terminal node result, which is not necessarily the specific path result you want. (3) If you only ever want a specific path result and not the others, you'll still need to compute all the others, resulting in wasted compute (or alternatively you could reconfigure your driver to avoid this, but that also adds complexity).
Proposal:
The idea is still rough and not fully fleshed out, but what I would propose is a new additional syntax for defining parallizable paths. I think it's best if I start with a simple example to illustrate the idea:
The idea here is that when you want to do some branching you can provide to your hamilton function a reference to the hamilton driver itself. This would be used instead of the function arguments to encode the function dependencies by "calling" them through the driver. And using Collect as a delimiting point to say stop defining dependencies and collect their resolved values now. These "calls" would be used during the graph build, so that instead of inspecting the function arguments you run the function and follow its driver calls to the dependencies, which can then potentially either be another "driver" function or standard hamilton function. And what this means for parallelization is that it's simply represented with a regular python for loop!
This unlocks a few benefits:
(1) similar to plain python - this follows extremely closely to how a naive plain python implementation would be coded. Just replace
driver.callwith the direct python call and removeCollectcalls and you have your standard python code. Plain for loops look the same, args passed the same way, etc, so it's easy for users to understand what going on.(2) nested parallelization possible - you could simply do something like this:
(3) can call an individual path - e.g. can call this
driver.execute('B', x=1)(or driver.call within another hamilton function) without needing to build all the other possible paths and without needing to alter/reconfigure the driver.(4) can get a static graph - with Parallelizable the parallelized branches are necessarily only known at run time. With this approach you can get the paths in a static graph during the graph build, which opens up further optimizations/inspections that can be done.
Would be happy to hear the teams thoughts and discuss further. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions