Skip to content

Fix heartbeat for process restarts#4

Open
schleyfox wants to merge 2 commits intomasterfrom
fix-heartbeat-for-process-restarts
Open

Fix heartbeat for process restarts#4
schleyfox wants to merge 2 commits intomasterfrom
fix-heartbeat-for-process-restarts

Conversation

@schleyfox
Copy link
Copy Markdown
Collaborator

When the process backing a worker exits, pool-hall replaces it with a
new process. Process start up can take a nontrivial amount of time.
The heartbeat module was not aware of process replacement, so it
continued expecting heartbeats while the process was starting up. This
can lead to a worker being reported as stalled, when it merely booting
up, leading to confusion over underlying problem (stalls = infinite
loops, process replacement = unhandled errors or the like).

This factors in worker state to monitoring and tweaks delta checking to
avoid reporting start up time.

/cc @andyfangdz @goatslacker @martinwin

Copy link
Copy Markdown
Collaborator

@ljharb ljharb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

src/heartbeat.js Outdated
this.emit('workerStall', workerId);
this.emit('info:workerHeartbeatDelta', workerId, delta);
.filter(([, worker]) => !worker.isDead())
.filter(([, worker]) => worker.state === 'up').forEach(([workerId]) => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sinc the filters are on their own lines, the forEach should be too

Copy link
Copy Markdown
Collaborator

@andyfangdz andyfangdz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. Thanks for doing this!

It looks like we now won’t be detecting stalls that are caused by process startup. This is reasonable as time to first heartbeat should be treated separately. Perhaps a future task 😄

@schleyfox
Copy link
Copy Markdown
Collaborator Author

@andyfangdz yes, though I think we already weren't adequately tracking that due to monitoring only starting after all workers came up for the first time. Monitoring successful boot and start up time is def a useful, but distinct thing.

When the process backing a worker exits, pool-hall replaces it with a
new process.  Process start up can take a nontrivial amount of time.
The heartbeat module was not aware of process replacement, so it
continued expecting heartbeats while the process was starting up.  This
can lead to a worker being reported as stalled, when it merely booting
up, leading to confusion over underlying problem (stalls = infinite
loops, process replacement = unhandled errors or the like).

This factors in worker state to monitoring and tweaks delta checking to
avoid reporting start up time.
@schleyfox schleyfox force-pushed the fix-heartbeat-for-process-restarts branch from 253a1e6 to 45aa881 Compare October 22, 2018 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants