Skip to content

OCaml5-aligned redesign: make continuation/task ownership carry dispatch roots #282

@proboscis

Description

@proboscis

Summary

This is the fundamental follow-up beyond #277 and #281.

  • #277 moved dispatch authority out of DispatchContext / DispatchState and onto machine-owned VM state
  • #281 tracks the remaining immediate compromise around dispatch-root reachability

This issue is broader:

Redesign continuations so they carry enough machine-owned control state to resume as first-class suspended computations, aligned as closely as practical with the OCaml 5 model.

The point is not merely to delete one map or add better resume patches.
The point is to eliminate the architectural reason ad hoc resume reconstruction was needed in the first place.

Context

The runtime is already much cleaner than before:

  • DispatchContext is gone as runtime authority
  • DispatchState is gone
  • dispatch is modeled by machine-owned Frame::Dispatch(...)
  • transient HandlerDispatch wrapper frames are gone
  • handler completion now routes through an explicit handler boundary frame
  • traceback repair heuristics were reduced by moving more truth into Rust-side trace emission
  • the suite has reached green states repeatedly during this work

Current core validation baseline remains:

  • make sync
  • cargo check -q -p doeff-vm --manifest-path packages/doeff-vm/Cargo.toml
  • uv run pytest -q

Clarification: The Real Remaining Problem

The remaining problem is not just where a lookup map lives.
It is how the VM represents a suspended computation.

The current design still leans too much toward:

  • continuation = execution snapshot (frames_snapshot, mode, pending python, scope store)
  • outer handler/interceptor structure = reconstructed later from surrounding machine state

That split is what keeps producing edge cases like:

  • resumed computation losing the outer cache handler
  • CachePutEffect becoming unhandled after an Await
  • resumed code observing the wrong handler/interceptor topology
  • traceback needing compensating logic

The correct target is:

a continuation should be a first-class suspended computation object, carrying enough structural control context that resume is reactivation, not heuristic reconstruction.

Why The Existing “Topology Reinstall” Framing Is Not Final

Saying “reinstall the topology” is only acceptable as an intermediate description.
As a final architecture target, it sounds like a patch layered on top of an insufficient continuation representation.

The target should be stated more precisely:

  • continuation owns both execution-state snapshot and structural-boundary snapshot
  • resume reactivates that object into machine state
  • shared mutable state remains outside the continuation, in handler-local / runtime store
  • resume does not guess or repair outer handler/interceptor structure from incidental caller state

Correct Architecture Target

Continuation model

A continuation must carry two kinds of machine-owned state:

  1. Execution snapshot
  • frames / resumable control stack
  • mode
  • pending python state
  • scope store
  • marker / caller linkage needed for re-entry
  1. Structural boundary snapshot
  • prompt boundaries (WithHandler installations)
  • interceptor boundaries (WithIntercept installations)
  • enough identity to preserve which handler/interceptor installations are visible after resume

This must apply to both:

  • started continuations captured from running code
  • unstarted continuations created explicitly

Shared mutable state

The continuation must not clone or own shared mutable runtime state.
That state stays external and shared, for example:

  • handler-local store cells
  • scheduler state
  • semaphore state
  • promise tables
  • cache handler state

From the VM’s perspective, these remain opaque mutable store-backed state.
The continuation must only preserve enough control/topology information to reach and use them again after resume.

Live authority

Live authority for installed handler scope remains:

  • PromptBoundary.handler_scope_id

That remains the runtime source of truth for active scope.
Any continuation-carried scope metadata is snapshot metadata, not live authority.

What This Means Operationally

Resume semantics

Resume(k, value) / Transfer(k, value) should be defined as:

  • reactivate a suspended computation object
  • re-enter it under the same structural handler/interceptor envelope it had when captured
  • continue stepping from there

Not as:

  • materialize a bare execution segment from frames only
  • then try to reconstruct the missing outer structure from current caller state

Consequence

The final architecture should not rely on ad hoc “reinstall helper” logic as the conceptual model.
Implementation may still temporarily build segments/prompts during activation, but the design target is:

  • continuation object already owns the structure that is being reactivated
  • no semantic dependence on caller-chain guesswork or heuristic reattachment

Concrete Data Model To Aim At

The exact names may differ, but the direction should be explicit.

Continuation-owned structural entries

Conceptually, started and unstarted continuations should both be able to carry a chain like:

enum ReinstallChainEntry {
    Handler(HandlerInstallSpec),
    Interceptor(InterceptorInstallSpec),
}

struct HandlerInstallSpec {
    handler: KleisliRef,
    identity: Option<PyShared>,
    handler_scope_id: Option<HandlerScopeId>,
    types: Option<Vec<PyShared>>,
}

struct InterceptorInstallSpec {
    interceptor: KleisliRef,
    types: Option<Vec<PyShared>>,
    mode: InterceptMode,
    metadata: Option<CallMetadata>,
}

This is not meant as a long-term “patch helper”; it is the structural part of the continuation.

Ownership split still matters

The owner/visible dispatch split discovered earlier is still valid:

  • bookkeeping owner
  • resumed-into / visible dispatch affiliation

But that split belongs inside a stronger continuation representation, not alongside a continuation that only partially knows its own topology.

Scheduler Implication

Scheduler remains user-space handler logic.
It must not become the semantic owner of dispatch/continuation rules.

The continuation model itself must be strong enough that scheduler can simply transport suspended computations.
That means:

  • no scheduler-specific reattachment hacks
  • no special VM-side lookup state for outer handler recovery
  • no need for scheduler to reconstruct missing handler topology

Await Implication

The final sync Await design should follow the same rule:

  • sync Await runtime is Rust-owned and handler-local
  • continuation/object reactivation must preserve the outer handler/interceptor structure naturally
  • resumed code after Await must still see outer handlers such as cache handlers and interceptors without special post-hoc repair

Traceback / Trace Implication

Traceback should become a pure projection of runtime truth.
That requires:

  • handler identity in trace payloads should come from runtime-owned stable identity (for example handler_scope_id)
  • active-chain suppression / stack attach should happen during active-chain assembly, not in the formatter
  • doeff/traceback.py should become render-only

Exact Current Code Areas

Core runtime:

  • packages/doeff-vm-core/src/continuation.rs
  • packages/doeff-vm-core/src/vm.rs
  • packages/doeff-vm-core/src/segment.rs
  • packages/doeff-vm-core/src/dispatch.rs
  • packages/doeff-vm-core/src/trace_state.rs

Scheduler / transport:

  • packages/doeff-core-effects/src/scheduler/mod.rs

Await:

  • packages/doeff-core-effects/src/handlers/mod.rs
  • doeff/handlers/await_handlers.py

What Has Already Been Learned From Experiments

Experiments have already established that:

  • pure caller-chain recovery is insufficient
  • a bare execution snapshot is insufficient for started continuation resume
  • nested dispatch exposes that bookkeeping owner and resumed-into topology are not the same concept
  • simply copying a few scope ids is not enough
  • the remaining failures are strongly tied to continuation topology, not just mutable state placement

Those experiments should inform the redesign instead of being repeated blindly.

Validation Targets

The main regression detectors remain:

  • tests/effects/test_finally_semaphore_over_release.py
  • tests/effects/test_effect_combinations.py
  • tests/test_dispatch_completion.py
  • tests/core/test_runtime_regressions_manual.py
  • tests/core/test_traceback_format_default.py
  • tests/core/test_spec_trace_001_examples.py
  • tests/core/test_traceback_spec_compliance.py
  • uv run pytest -q

Acceptance Criteria

  1. A continuation is explicitly treated as a first-class suspended computation object, not merely a frame snapshot that requires heuristic outer-structure repair
  2. Started continuation resume preserves outer handler/interceptor topology without relying on ad hoc caller-chain reconstruction
  3. Shared mutable runtime state remains external and store-backed; it is not cloned into continuations
  4. Live scope authority remains PromptBoundary.handler_scope_id
  5. Traceback becomes a pure projection of runtime truth; formatter-side repair is gone
  6. make sync, cargo check -q -p doeff-vm --manifest-path packages/doeff-vm/Cargo.toml, and uv run pytest -q all stay green

Relationship To Other Issues

  • #277: primary dispatch architecture rewrite
  • #281: immediate follow-up to remove remaining dispatch-root lookup compromise

This issue is the architectural parent of both the remaining started-continuation and traceback work.
If #281 is “delete the last fallback map,” this issue is now more precisely:

make continuations structurally complete enough that resume is reactivation of suspended computation state, not heuristic reconstruction of lost topology.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions