Skip to content

feat: support agent graceful shutdown#909

Open
uuuyuqi wants to merge 4 commits intoagentscope-ai:mainfrom
uuuyuqi:uuuyuqi/feat-graceful-shutdown
Open

feat: support agent graceful shutdown#909
uuuyuqi wants to merge 4 commits intoagentscope-ai:mainfrom
uuuyuqi:uuuyuqi/feat-graceful-shutdown

Conversation

@uuuyuqi
Copy link
Copy Markdown
Contributor

@uuuyuqi uuuyuqi commented Mar 10, 2026

AgentScope-Java Version

1.0.10-SNAPSHOT

Description

Closes #907

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with mvn spotless:apply
  • All tests are passing (mvn test)
  • Javadoc comments are complete and follow project conventions
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

uuuyuqi added 2 commits March 6, 2026 16:08
Change-Id: I5958adfad87fdfd64db45cffbb36cb3c85d83850
…hutdownManager

Change-Id: I82ffdab04bc0d6c5eed62d23510c350426e97f64
@uuuyuqi uuuyuqi requested a review from a team March 10, 2026 10:57
Change-Id: I694c2d4edd2e683b9c1d2aec7ce074af5c2aeef7
@uuuyuqi uuuyuqi changed the title [feat] support agent graceful shutdown feat: support agent graceful shutdown Mar 11, 2026
Change-Id: I9ef35944e9c7eebf30b03468c1bbfcde897b77c7
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a framework-level graceful shutdown lifecycle for AgentScope-Java, adding request tracking, checkpoint-based interruption, session persistence for resume, and ordered JVM shutdown cleanup (agents first, HTTP transports last), plus a new example module and comprehensive tests.

Changes:

  • Add GracefulShutdownManager + hook integration to reject new requests during shutdown and interrupt in-flight requests at safe checkpoints / timeout.
  • Persist shutdown-interrupted state to Session and deduplicate retried input on the next request.
  • Add a tool-execution shutdown guard and a unified JVM shutdown hook to close HTTP transports after agent termination; include new examples and tests.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
agentscope-examples/pom.xml Adds the new graceful-shutdown example module to the build.
agentscope-examples/graceful-shutdown/pom.xml New Spring Boot-based example module for graceful shutdown demos/e2e.
agentscope-examples/graceful-shutdown/src/main/resources/application.yml Example config for Spring graceful shutdown + agent shutdown options.
agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/smoke/GracefulShutdownExample.java Standalone smoke scenarios for shutdown + resume behavior.
agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/e2e/ShutdownE2eApplication.java Spring Boot app entrypoint for e2e shutdown testing.
agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/e2e/AgentController.java REST endpoints for chat + readiness/status + triggering shutdown.
agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/e2e/AgentService.java Creates agents per request, configures shutdown config, handles shutdown exceptions.
agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/e2e/SlowAnalysisTool.java Slow tool to simulate long-running tool execution during shutdown.
agentscope-core/src/main/java/io/agentscope/core/shutdown/ShutdownState.java New global shutdown state enum.
agentscope-core/src/main/java/io/agentscope/core/shutdown/PartialReasoningPolicy.java Policy enum for saving/discarding partial reasoning on shutdown.
agentscope-core/src/main/java/io/agentscope/core/shutdown/GracefulShutdownConfig.java Shutdown timeout + partial reasoning policy configuration record.
agentscope-core/src/main/java/io/agentscope/core/shutdown/AgentShuttingDownException.java Exception for rejected/interrupted requests during shutdown.
agentscope-core/src/main/java/io/agentscope/core/shutdown/ShutdownInterruptedState.java Session-persisted flag for shutdown-interrupted sessions.
agentscope-core/src/main/java/io/agentscope/core/shutdown/ShutdownSessionBinding.java Session binding record used by shutdown persistence.
agentscope-core/src/main/java/io/agentscope/core/shutdown/ActiveRequestContext.java Per-request context: interrupt + save-to-session behavior.
agentscope-core/src/main/java/io/agentscope/core/shutdown/GracefulShutdownManager.java Core singleton: state machine, tracking, timeout monitor, persistence integration.
agentscope-core/src/main/java/io/agentscope/core/shutdown/GracefulShutdownHook.java System hook: safe-checkpoint interruption + resume input deduplication.
agentscope-core/src/main/java/io/agentscope/core/shutdown/AgentScopeJvmShutdownHook.java Unified JVM shutdown hook: graceful shutdown → await → transport shutdown.
agentscope-core/src/main/java/io/agentscope/core/agent/AgentBase.java Integrates request registration/unregistration + shutdown request rejection; adds interrupt source.
agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java Binds session for shutdown persistence; saves/discards partial reasoning; system-interrupt handling.
agentscope-core/src/main/java/io/agentscope/core/tool/ToolExecutor.java Races tool execution against global shutdown-timeout signal.
agentscope-core/src/main/java/io/agentscope/core/session/Session.java Adds per-key delete API (default method).
agentscope-core/src/main/java/io/agentscope/core/session/JsonSession.java Implements per-key delete.
agentscope-core/src/main/java/io/agentscope/core/session/InMemorySession.java Implements per-key delete.
agentscope-core/src/main/java/io/agentscope/core/model/transport/HttpTransportFactory.java Removes per-transport shutdown hook in favor of unified shutdown orchestration.
agentscope-core/src/test/java/io/agentscope/core/shutdown/GracefulShutdownTest.java Comprehensive tests covering shutdown lifecycle, hooks, persistence, and tracking.

}

@Override
public void loadFrom(Session session, SessionKey sessionKey) {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadIfExists() binds the session to GracefulShutdownManager, but loadFrom() does not. If callers use loadFrom(session, key) and then rely on graceful-shutdown persistence/dedup, the manager won't have the session binding and won’t be able to save/clear the shutdown-interrupted flag. Consider also binding in loadFrom() for consistent behavior.

Suggested change
public void loadFrom(Session session, SessionKey sessionKey) {
public void loadFrom(Session session, SessionKey sessionKey) {
shutdownManager.bindSession(this, session, sessionKey);

Copilot uses AI. Check for mistakes.
Comment on lines +403 to +405
resetInterruptFlag();
GracefulShutdownManager.getInstance().ensureAcceptingRequests();
GracefulShutdownManager.getInstance().registerRequest(this);
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acquireExecution() sets running to true before calling GracefulShutdownManager.ensureAcceptingRequests(). If ensureAcceptingRequests() throws (shutdown in progress), the Mono.using resource supplier fails and releaseExecution() will not run, leaving running=true and effectively bricking the agent instance for future calls. Consider checking shutdown acceptance before flipping running, or wrapping the shutdown check/registration in try/catch to revert running on failure.

Suggested change
resetInterruptFlag();
GracefulShutdownManager.getInstance().ensureAcceptingRequests();
GracefulShutdownManager.getInstance().registerRequest(this);
try {
resetInterruptFlag();
GracefulShutdownManager.getInstance().ensureAcceptingRequests();
GracefulShutdownManager.getInstance().registerRequest(this);
} catch (RuntimeException ex) {
if (checkRunning) {
running.set(false);
}
throw ex;
}

Copilot uses AI. Check for mistakes.
Comment on lines +155 to +159
String requestId = UUID.randomUUID().toString();
ActiveRequestContext ctx =
new ActiveRequestContext(requestId, agentBase, session, sessionKey);
activeRequestsByAgentId.put(agent.getAgentId(), ctx);
return requestId;
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Active requests are tracked in activeRequestsByAgentId keyed only by agentId. If checkRunning is disabled (concurrent call()s on the same agent), a second registerRequest() will overwrite the first context; later unregisterRequest() can drop the remaining in-flight request, causing getActiveRequestCount() to be wrong and potentially transitioning to TERMINATED while work is still running. Tracking should be per-request (e.g., keyed by requestId) or support multiple contexts per agentId.

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +168
return;
}
activeRequestsByAgentId.remove(agent.getAgentId());
sessionBindings.remove(agent.getAgentId());
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unregisterRequest() removes the agent's sessionBindings entry on every request completion. For agent instances that call loadIfExists() once and are then reused across multiple call()s, this drops the session binding and prevents shutdown persistence / interrupted-flag clearing on subsequent requests. Consider keeping the binding for the agent lifecycle (or providing an explicit unbind/close path) rather than removing it on every request cleanup.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +128
* @param sessionKey the session identifier
* @param key the state key to delete
*/
default void delete(SessionKey sessionKey, String key) {
// Default no-op; implementations should override if they support per-key deletion
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Session.delete(SessionKey, String) defaults to a no-op. GracefulShutdownManager.checkAndClearShutdownInterrupted() relies on this to clear the shutdown-interrupted flag; any custom Session that doesn't override this will never clear the flag and may keep discarding input on every retry. Consider failing fast (e.g., UnsupportedOperationException) or changing shutdown logic to clear the flag via an overwrite instead of requiring per-key deletion support.

Suggested change
* @param sessionKey the session identifier
* @param key the state key to delete
*/
default void delete(SessionKey sessionKey, String key) {
// Default no-op; implementations should override if they support per-key deletion
* <p>The default implementation throws {@link UnsupportedOperationException}. Implementations
* that support per-key deletion must override this method.
*
* @param sessionKey the session identifier
* @param key the state key to delete
* @throws UnsupportedOperationException if the session implementation does not support per-key
* deletion
*/
default void delete(SessionKey sessionKey, String key) {
throw new UnsupportedOperationException(
"Per-key deletion is not supported by this Session implementation");

Copilot uses AI. Check for mistakes.
Comment on lines +306 to +310
activeRequestsByAgentId.clear();
sessionBindings.clear();
shutdownStartedAt.set(null);
monitorStarted.set(false);
shutdownTimeoutSignal.set(Sinks.empty());
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetForTesting() resets monitorStarted to false but does not cancel/shutdown the already-scheduled monitor.scheduleAtFixedRate(...) task (nor recreate the executor). If performGracefulShutdown() is invoked again after a reset, additional monitor tasks can be scheduled, leading to duplicated timeout enforcement/logging and potential test flakiness. Consider keeping a ScheduledFuture to cancel on reset, or shutting down/recreating the executor in resetForTesting().

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +52
* <p>All registered transports and the default transport will be closed
* during JVM shutdown, orchestrated by {@code AgentScopeJvmShutdownHook}
* after all agents have finished processing.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the per-transport shutdown hook removed, HttpTransportFactory now relies on AgentScopeJvmShutdownHook (registered from GracefulShutdownManager) to call HttpTransportFactory.shutdown(). If an application uses models/transports without ever loading/creating an AgentBase (so GracefulShutdownManager is never initialized), transports may no longer be closed on JVM shutdown despite this Javadoc. Consider ensuring the unified hook is registered whenever HttpTransportFactory is used (e.g., trigger GracefulShutdownManager.getInstance() from HttpTransportFactory).

Copilot uses AI. Check for mistakes.
() -> {
try {
manager.performGracefulShutdown();
manager.awaitTermination(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The awaitTermination in the JVM hook currently uses the same timeout as force-interrupt. Since enforceTimeoutAndInterrupt triggers the interrupt at the exact moment awaitTermination times out, the Agent is left with no buffer to process the interrupt or unregister before the connection is closed by HttpTransportFactory.shutdown().

We should add a grace period (e.g., an additional 5s) to the awaitTermination timeout to ensure the Agent has sufficient time to handle the interrupt and perform a clean teardown.

throw new IllegalStateException("Agent is still running, please wait for it to finish");
}
resetInterruptFlag();
GracefulShutdownManager.getInstance().ensureAcceptingRequests();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The running.compareAndSet(false, true) is executed before ensureAcceptingRequests(). Since the latter is part of the resourceSupplier in Mono.using, if it throws an AgentShuttingDownException, the cleanup callback (releaseExecution) will not be triggered (as the resource was never successfully initialized).

This results in a state leak where the running flag is stuck at true, permanently preventing this Agent instance from accepting any subsequent requests. We should ensure the state is reset if the resource acquisition fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support agent graceful shutdown

3 participants