✨ Add stepwise intermediate reward for RL#526
✨ Add stepwise intermediate reward for RL#526Shaobo-Zhou wants to merge 122 commits intomunich-quantum-toolkit:mainfrom
Conversation
Update action space and feature space Update actions Update action space
Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py
Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations
Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>
Fix windows runtime warning problem Fix windows runtime warning issue
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/approx_reward.py`:
- Around line 62-63: The code repeatedly calls estimate_basis_gate_counts (which
internally transpiles) for the same circuit state; modify calculate_reward
(and/or predictorenv.py's step) to memoize the basis-gate counts keyed by a
canonical representation of the quantum circuit (e.g., serialized QASM or a hash
of qc.data + device target), reuse the cached result for subsequent calls, and
fall back to calling get_basis_gates_from_target and estimate_basis_gate_counts
only on cache misses; ensure the cache key and lookup are used wherever basis =
get_basis_gates_from_target(device) and counts = estimate_basis_gate_counts(qc,
basis_gates=basis) are invoked so duplicate transpilation is avoided.
- Around line 30-40: The loop in estimate_basis_gate_counts uses deprecated
tuple unpacking of qc_t.data; replace it by iterating over CircuitInstruction
objects and access attributes directly (operation/qubits/clbits). Concretely, in
estimate_basis_gate_counts change the loop to iterate like "for circ_instr in
qc_t.data:", extract the gate with "instr = circ_instr.operation" (and use
circ_instr.qubits / circ_instr.clbits if needed), then use instr.name and the
existing BLACKLIST/counts logic.
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 259-261: The code only suppresses delta_reward when transitioning
from "approx"→"exact"; change the logic in the reward computation (referencing
prev_reward_kind, new_kind, and delta_reward in predictorenv.py) to suppress the
delta for any regime change (i.e., if prev_reward_kind != new_kind then set
delta_reward = 0.0) and ensure prev_reward_kind is still updated after this
check so later steps see the new regime; adjust the conditional that currently
checks (self.prev_reward_kind == "approx" and new_kind == "exact") to a generic
regime-change check and keep existing comments about metrics comparability.
- Around line 623-624: Replace the local duplicate gate_blacklist in
_ensure_device_averages_cached with the shared BLACKLIST constant from
approx_reward to avoid divergence; import BLACKLIST from the approx_reward
module and use it when building basis_ops (i.e., change basis_ops = [name for
name in op_names if name not in gate_blacklist] to use BLACKLIST), removing the
local {"measure","reset","delay","barrier"} definition.
- Around line 263-268: The two branches computing reward_val in predictorenv.py
are identical; replace the if/elif block that compares delta_reward with a
single assignment using the existing symbols: set reward_val = self.reward_scale
* delta_reward (preserving delta_reward == 0.0 result) inside the same function
where delta_reward and reward_val are used so you remove the redundant
conditional logic.
- Around line 218-229: Calculate and assign prev_reward and prev_reward_kind
only when the chosen action is non-terminal: move the call to calculate_reward
from before apply_action into the branch that handles non-terminal actions
(i.e., the branch that proceeds after altered_qc is not None and not the
terminate action), so terminal/terminate actions skip the expensive pre-action
evaluation; ensure apply_action, calculate_reward, and the used_actions logic
remain unchanged. Also initialize prev_reward and prev_reward_kind in reset()
(or add a guard before use) so the first step has defined values when no prior
reward exists.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 268-352: Update the calculate_reward signature to narrow the mode
parameter from plain str to a Literal type (Literal["auto", "exact", "approx"])
to catch invalid modes at type-check time; add the appropriate import for
Literal (from typing or typing_extensions depending on project compatibility)
and update any affected type hints/exports so static checkers (mypy/pyright)
pick it up—no logic changes needed inside calculate_reward or its use of
self._is_native_and_mapped, expected_fidelity,
approx_estimated_success_probability, etc.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/rl/predictorenv.py (1)
104-104: 🧹 Nitpick | 🔵 TrivialUse lazy logging formatting for consistency.
String concatenation in the log call is inconsistent with the
%s-style formatting used elsewhere in this file (e.g., lines 217, 271). Ruff rule G003 discourages eager string building in logging calls.♻️ Suggested fix
- logger.info("Init env: " + reward_function) + logger.info("Init env: %s", reward_function)
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 244-245: The call to calculate_reward at the start of each step is
redundant because self.prev_reward and self.prev_reward_kind already hold the
post-action reward from the previous step; remove the redundant invocation in
the step logic (the line calling calculate_reward that sets
self.prev_reward/self.prev_reward_kind) and instead initialize/seed
self.prev_reward and self.prev_reward_kind during reset() by calling
calculate_reward() once there, ensuring the end-of-step code that updates
prev_reward/prev_reward_kind (lines around where post-action reward is set)
continues to overwrite for the next step; adjust any references in step() and
reset() accordingly to reuse the carried-over prev_reward/prev_reward_kind
without recomputing on each new step.
- Around line 726-735: _determine_valid_actions_for_state and
_is_native_and_mapped duplicate expensive GatesInBasis/CheckMap work; cache the
pass instances and share computed results to avoid repeated checks. Create
cached pass instances on self (e.g., self._gates_in_basis and self._check_map)
initialized once and reused in both determine_valid_actions_for_state and
_is_native_and_mapped, and have determine_valid_actions_for_state store the last
computed flags (e.g., self._last_only_nat_gates and self._last_mapped) after
running the checks so _is_native_and_mapped can return those cached booleans if
available instead of re-running the passes; fall back to running the cached pass
instances if no cached results exist and update the cache.
- Around line 218-242: The estimated_hellinger_distance branch duplicates the
apply→decompose→update-state→check-valid-actions logic found later; extract that
shared sequence into a helper (e.g. _apply_and_update or
_apply_action_and_update_state) which calls apply_action, performs the gate-type
decompose loop, sets self.state and self.state._layout, increments
self.num_steps, updates self.valid_actions via determine_valid_actions_for_state
and raises RuntimeError if empty; then replace the duplicated block in the
reward_function == "estimated_hellinger_distance" branch to call the helper and
only keep the branch-specific reward logic (calculate_reward(mode="exact") vs
no_effect_penalty) before returning create_feature_dict(self.state), reward,
done, False, {}.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/compilation/test_predictor_rl.py`:
- Around line 149-204: Tests test_approx_reward_ef and test_approx_reward_esp
duplicate setup and assertions; replace them with a single parametrized test
using pytest.mark.parametrize over figure_of_merit values ("expected_fidelity"
and "estimated_success_probability") that creates Predictor(figure_of_merit=...)
and monkeypatches predictor.env._is_native_and_mapped the same way, runs val,
kind = predictor.env.calculate_reward(...), asserts common conditions (kind ==
"approx", 0<=val<=1, predictor.env._dev_avgs_cached, isinstance checks and
len(predictor.env._err_by_gate)>0, len(predictor.env._dur_by_gate)>0), and then
include the ESP-specific assertions (assert len(..._dur_by_gate)>0 and assert
predictor.env._tbar is None or predictor.env._tbar > 0.0) only when the
parameter == "estimated_success_probability"; keep references to Predictor,
predictor.env.calculate_reward, _is_native_and_mapped, _err_by_gate,
_dur_by_gate, and _tbar to locate the code.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/rl/predictorenv.py (1)
562-576: 🧹 Nitpick | 🔵 Trivial
determine_valid_actions_for_statecreates fresh pass instances while_native_and_mappedreuses cached ones.Both methods perform the same GatesInBasis + CheckMap checks.
determine_valid_actions_for_statestill instantiates new objects each call (lines 564, 574), while_native_and_mappeduses the cached instances from__init__. Consider reusing the cached instances here too, or calling_native_and_mappedinternally.
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 705-717: The cache in _native_and_mapped uses id(qc) which can be
recycled and cause stale hits; either remove the cache entirely and always run
_gates_in_basis_check and _check_map, or replace the id-based key with a safe
version counter: add a _state_version integer on the class (initialize in
__init__), increment _state_version whenever self.state is reassigned, store and
compare _last_state_version (instead of _last_qc_id) alongside
_last_native_mapped in _native_and_mapped, and update that version when you set
self.state so the cached (only_native, mapped) is only reused for the current
state.
In `@tests/compilation/test_predictor_rl.py`:
- Around line 175-178: The condition is checking the imported type alias
figure_of_merit instead of the test parameter fom, so the ESP-specific
assertions (predictor.env._dur_by_gate and predictor.env._tbar) never run;
update the conditional to use the test parameter (if fom ==
"estimated_success_probability") so the assertions execute for the ESP reward
path and validate duration/tbar caching for predictor.env._dur_by_gate and
predictor.env._tbar.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 636-639: The suppress(KeyError, AttributeError, TypeError) around
the basis_ops loop is too broad and can hide legitimate TypeError bugs when
calling target.operation_from_name or reading op.num_qubits; change the
exception list to (KeyError, AttributeError) (i.e., remove TypeError) and
optionally add a debug/log line inside the except path that records the gate
name and the caught exception so skipped gates are visible; locate the loop
using basis_ops, target.operation_from_name, arity_by_name and op.num_qubits to
make the change.
- Around line 238-243: The code path for reward_function ==
"estimated_hellinger_distance" only gives the episode terminal step a real
reward and assigns no_effect_penalty for every non-terminal step, which leaves
the agent without shaping signal; update the branch in predictorenv.py (the
block checking self.reward_function == "estimated_hellinger_distance") to either
implement a non-terminal shaping heuristic or, if this is intentional, add a
concise explanatory comment above the block referencing that choice and the
involved symbols (self.reward_function, "estimated_hellinger_distance",
self.no_effect_penalty, self.calculate_reward, and self.action_terminate_index)
so readers know why calculate_reward is only called at done and why non-terminal
steps receive no signal.
In `@tests/compilation/test_predictor_rl.py`:
- Around line 165-170: The monkeypatch targets a non-existent method
`_is_native_and_mapped` and has no effect; update the test to force the
approximate path by calling calculate_reward with mode="approx" instead of
relying on device mapping heuristics. Locate the call to
predictor.env.calculate_reward(qc=qc, mode="auto") in the test and change the
mode argument to "approx" (remove the monkeypatch line that sets
`_is_native_and_mapped`), ensuring the test exercises the approximate reward
branch in calculate_reward rather than depending on GatesInBasis/CheckMap
behavior.
| done = action == self.action_terminate_index | ||
|
|
||
| if self.reward_function == "estimated_hellinger_distance": | ||
| reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty | ||
| self.state._layout = self.layout # noqa: SLF001 | ||
| return create_feature_dict(self.state), reward_val, done, False, {} |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Hellinger path provides no shaping signal — non-terminal steps always get a flat penalty.
This is a design choice, but it means the RL agent receives zero useful gradient signal during an episode for estimated_hellinger_distance — reward information only comes at termination. If this is intentional (e.g., because there's no approximate Hellinger model), a brief comment explaining the rationale would help future readers.
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 242-242: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
🤖 Prompt for AI Agents
In `@src/mqt/predictor/rl/predictorenv.py` around lines 238 - 243, The code path
for reward_function == "estimated_hellinger_distance" only gives the episode
terminal step a real reward and assigns no_effect_penalty for every non-terminal
step, which leaves the agent without shaping signal; update the branch in
predictorenv.py (the block checking self.reward_function ==
"estimated_hellinger_distance") to either implement a non-terminal shaping
heuristic or, if this is intentional, add a concise explanatory comment above
the block referencing that choice and the involved symbols
(self.reward_function, "estimated_hellinger_distance", self.no_effect_penalty,
self.calculate_reward, and self.action_terminate_index) so readers know why
calculate_reward is only called at done and why non-terminal steps receive no
signal.
There was a problem hiding this comment.
Indeed, there's no approximate Hellinger model. However, I don't understand why this must imply a no-effect penalty here (the circuit might have changed)?
There was a problem hiding this comment.
@flowerthrower You're absolutely right — this does look like a design oversight rather than an intentional choice.
The current code assigns no_effect_penalty to every non-terminal step when using estimated_hellinger_distance, regardless of whether the circuit actually improved or degraded. This is inconsistent with how the environment handles expected_fidelity and estimated_success_probability below (lines 244+), where:
- The previous reward is cached
- A new reward is computed after each action
- The step reward is the scaled delta (
reward_scale * (new_val - prev_reward)) - Only when
delta == 0.0does it fall back tono_effect_penalty
The absence of an approximate Hellinger model simply means you can't use the transpile-based proxy — but you can still compute the exact Hellinger distance at each step via self.calculate_reward(mode="exact") and shape rewards based on the delta, just like EF/ESP do.
Suggested fix:
Remove the special-case early return for Hellinger and let it flow through the same delta-based shaping logic below, but always call calculate_reward(mode="exact") (since mode="auto" would also choose "exact" anyway, as there's no approximate path). That would give the agent meaningful gradient signal during the episode.
✏️ Learnings added
Learnt from: flowerthrower
Repo: munich-quantum-toolkit/predictor PR: 526
File: src/mqt/predictor/rl/predictorenv.py:238-243
Timestamp: 2026-02-09T13:15:50.387Z
Learning: When using reward_function == "estimated_hellinger_distance" in PredictorEnv (src/mqt/predictor/rl/predictorenv.py), the current implementation only computes the exact reward at episode termination and assigns no_effect_penalty to all non-terminal steps, which prevents intermediate shaping signal. This is likely unintentional; the Hellinger distance can be computed exactly at each step and used for delta-based reward shaping, similar to expected_fidelity and estimated_success_probability.
</add_learning>
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
flowerthrower
left a comment
There was a problem hiding this comment.
Hi there, great work so far. Only a few comments left from my end.
| done = action == self.action_terminate_index | ||
|
|
||
| if self.reward_function == "estimated_hellinger_distance": | ||
| reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty | ||
| self.state._layout = self.layout # noqa: SLF001 | ||
| return create_feature_dict(self.state), reward_val, done, False, {} |
There was a problem hiding this comment.
Indeed, there's no approximate Hellinger model. However, I don't understand why this must imply a no-effect penalty here (the circuit might have changed)?
Signed-off-by: Patrick Hopf <81010725+flowerthrower@users.noreply.github.com>
|
@coderabbitai provide a detailed summary of the structural and methodical changes in this PR |
|
Tip For best results, initiate chat on the files or code changes.
🏗️ Structural ChangesNew Module
Core Environment Refactoring (
|
Description
This PR introduces a shaped, step-wise reward signal for the RL-based compiler.
For the figures of merit expected_fidelity and estimated_success_probability, the reward is computed in two regimes:
Exact regime (native + mapped circuits)
If the circuit consists only of device-native gates and respects the device’s coupling map, the step reward is based on the change in the exact calibration-aware metric between successive steps.
Approximate regime (non-native / unmapped circuits)
If the circuit still contains non-native gates or violates the device topology, a conservative canonical cost model is used to approximate the expected fidelity and ESP. The intermediate reward is then derived from the change in this approximate metric.
Checklist: